Table of Contents
Open Table of Contents
Where We Left Off
Last article we deployed Loki in SimpleScalable mode — three write pods, three read pods, two backend pods, all writing to Nutanix Objects via the S3 API. That’s the deployment. This article is the operating manual.
The choices that matter for the people using Loki — what to put in a label, what to leave out, how long to keep what, how to write a query that finishes — happen here, in the values that the deployed pods read. Get them right and Loki is invisible. Get them wrong and you’re either losing data, paging the on-call for noise, or paying for storage you don’t need.
The Label Set We Actually Run
Loki indexes labels, not log contents. Every unique combination of label values creates a stream. Streams are what the ingesters hold in memory and what the index points at. Too many streams and your write path runs out of memory; too few and your queries can’t find anything without a brute-force scan.
Our cap is max_global_streams_per_user: 50000. Right now we’re at about 16,000 — comfortably under, with headroom for the Windows fleet onboarding wave that’s projected to push us to roughly 25k.
The label set we run, grouped by source:
Pod logs (from loki.source.kubernetes)
| Label | Cardinality | Source |
|---|---|---|
namespace | low (~30) | __meta_kubernetes_namespace |
pod | medium (changes on restart) | __meta_kubernetes_pod_name |
container | low | __meta_kubernetes_pod_container_name |
node | low (number of K8s nodes) | __meta_kubernetes_pod_node_name |
app | low (number of distinct apps) | __meta_kubernetes_pod_label_app_kubernetes_io_name |
source | constant | static label, “kubernetes” |
job | constant | ”loki.source.kubernetes.pod_logs” |
cluster | low | external label, “conveyor-platform” |
dc | low | external label, “east” or “west” |
pod is the highest-cardinality label here because pod names change every time a Deployment rolls out (my-app-5f8b9c-x7d2k → my-app-5f8b9c-q4l9p). The chunks for an old pod become inactive once the pod is gone, and the compactor reclaims them on the standard schedule. Day-to-day, it works.
Kube-audit (from the API server log file)
| Label | Cardinality | Source |
|---|---|---|
job | constant | ”kube-audit” |
source | constant | ”audit” |
verb | low (get, list, create, update, patch, delete, watch) | parsed from JSON |
user | medium | parsed from JSON user.username |
resource | medium | parsed from JSON objectRef.resource |
audit_ns | low | parsed from JSON objectRef.namespace |
status_code | very low (HTTP codes) | parsed from JSON responseStatus.code |
audit_level | very low (Request, RequestResponse, etc.) | parsed from JSON level |
The CRD/lease/self-subject-review filter (covered in the Alloy production post) keeps the audit volume sane. Everything that survives the filter is genuinely useful for security alerting.
Network syslog (from the alloy-network Deployment)
| Label | Cardinality | Source |
|---|---|---|
source | constant | ”network_syslog” |
device_type | low | ”nutanix” / “rubrik” / “dnac” / “network” / “ise” |
hostname | medium | parsed from RFC 3164/5424 header |
severity | low | parsed from <PRI> for Cisco, regex for Nutanix |
device_type is the lever we use most. Switches are network, Nutanix CVMs are nutanix, Cisco ISE is ise, Rubrik backup appliances are rubrik. Each gets its own retention rule.
Windows EventLog
| Label | Cardinality | Source |
|---|---|---|
job | constant | ”windows_eventlog” |
level | very low | ”Information” / “Verbose” / “Warning” / “Error” / “Critical” |
channel | low | Application / Security / System / etc. |
host | medium | one per Windows server |
level is the workhorse. It’s the only field that determines how long the Windows event lives — Error and Critical get a full year, everything else gets 90 or 180 days. Storage cost ends up dominated by the volume of Information-level events, which is why those have the shortest retention.
Per-Stream Retention: 14 Rules That Earn Their Keep
This is the section I had the most fun with in this article, because the retention table is genuinely useful and it’s not generic. Every rule has a reason.
Our retention_period is a 365-day global default. Then per-stream rules selectively trim down (or extend) for specific log sources. Loki’s compactor evaluates the per-stream rules in priority order and applies the most-specific match.
limits_config:
retention_period: 365d
retention_stream:
# --- kube-audit ---
- selector: '{job="kube-audit"}'
priority: 1
period: 90d
# --- Pod logs: noisy infra ---
- selector: '{job="loki.source.kubernetes.pod_logs",namespace="kube-system",container="calico-node"}'
priority: 2
period: 30d
- selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="loki"}'
priority: 2
period: 90d
- selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="nginx"}'
priority: 2
period: 90d
- selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="grafana-sc-dashboard"}'
priority: 2
period: 90d
- selector: '{job="loki.source.kubernetes.pod_logs",namespace="runners"}'
priority: 2
period: 90d
# --- Pod logs catch-all ---
- selector: '{job="loki.source.kubernetes.pod_logs"}'
priority: 1
period: 180d
# --- windows_eventlog ---
- selector: '{job="windows_eventlog",level="Information"}'
priority: 2
period: 90d
- selector: '{job="windows_eventlog",level="Verbose"}'
priority: 2
period: 90d
- selector: '{job="windows_eventlog",level="Warning"}'
priority: 2
period: 180d
- selector: '{job="windows_eventlog",level=~"Error|Critical"}'
priority: 2
period: 365d
# --- network syslog by device type ---
- selector: '{source="network_syslog",device_type="nutanix"}'
priority: 2
period: 90d
- selector: '{source="network_syslog",device_type="rubrik"}'
priority: 2
period: 90d
- selector: '{source="network_syslog",device_type="dnac"}'
priority: 2
period: 180d
- selector: '{source="network_syslog",device_type="network"}' # switches, firewalls
priority: 2
period: 365d
- selector: '{source="network_syslog",device_type="ise"}' # auth logs
priority: 2
period: 365d
A few decisions in here are worth talking about.
calico-node at 30 days. Calico is the CNI. It’s chatty. On a busy node, the daemon logs status messages every few seconds. We don’t need a year of CNI status to debug anything; the last 30 days covers any rolling-upgrade or BGP-peering investigation we’ve ever needed. The volume difference is significant — keeping calico-node at the catch-all 180 days would consume real storage for zero query value.
Self-logs at 90 days. Loki, the Mimir nginx gateway, the Grafana dashboard sidecar — these all log a lot, but their logs only matter when something is broken with the observability stack itself. Three months covers the worst incident window.
ARC runners at 90 days. Our GitHub Actions runner pods are ephemeral by design. The runner logs from three months ago aren’t going to help us debug a CI job. Trim them.
Pod logs catch-all at 180 days, kube-audit at 90 days. This is the one I’d revisit if I were starting over. Kube-audit at 90 days is shorter than the pod logs they correspond to, which means in some forensic scenarios you can see what the pod did but not what the API server did about it. We picked 90 days for audit because the volume is high and 90 days satisfies our internal policy. If you have a different policy or different volume, this is a knob worth turning.
Switch/firewall and ISE auth logs at 365 days. These are the security and compliance logs. Regulators have opinions. The volume from syslog is modest enough that 365 days isn’t expensive.
Nutanix CVM syslog at 90 days. Operational signal, not security. Three months is plenty for any post-incident investigation.
The general principle: set per-stream retention by what the log is for, not where it comes from. Security logs get long retention. Operational logs get medium retention. Noise gets short retention. The compactor will do the rest.
LogQL Alerts We Actually Rely On
Some of the most valuable LogQL queries we run are alerting rules, not dashboards. The Loki ruler evaluates these continuously and fires alerts into Alertmanager when they trip.
Here’s the audit-log alert group, lifted directly from observability/loki/rules/audit-log-alerts.yaml:
groups:
- name: audit-log-alerts
rules:
- alert: UnauthorizedAPIAccess
expr: |
sum by (namespace) (
count_over_time(
{job="audit"} | json | responseStatus_code >= 403 [5m]
)
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of unauthorized API access attempts"
- alert: SensitiveResourceAccess
expr: |
count_over_time(
{job="audit"}
| json
| objectRef_resource="secrets"
| verb=~"create|update|delete|patch"
[5m]
) > 0
labels:
severity: warning
annotations:
summary: "Sensitive resource modification detected"
- alert: ClusterAdminBindingCreated
expr: |
count_over_time(
{job="audit"}
|~ "cluster-admin"
| json
| objectRef_resource="clusterrolebindings"
| verb="create"
[15m]
) > 0
labels:
severity: critical
annotations:
summary: "cluster-admin ClusterRoleBinding created"
ClusterAdminBindingCreated is the one I’d point at if someone asked “what does this stack actually do for us beyond the dashboards?” If anybody — engineer, attacker, runaway controller — creates a new cluster-admin ClusterRoleBinding, the on-call gets paged within 15 minutes. We have the audit log because the API server emits it, we have Loki because we ship it, we have this alert because someone wrote three lines of LogQL. The audit pipeline pays for itself the first time this rule fires for a real reason.
The syslog alert group covers the network side:
groups:
- name: syslog-alerts
rules:
- alert: NetworkDeviceCriticalSyslog
expr: |
count_over_time(
{job="syslog"} | json | severity=~"0|1|2" [5m]
) > 0
labels:
severity: critical
annotations:
summary: "Critical syslog from {{ $labels.hostname }}"
- alert: FirewallDenySpike
expr: |
sum by (hostname) (
count_over_time(
{job="syslog"} |~ "(?i)(deny|drop|block|reject)" [5m]
)
) > 500
for: 5m
labels:
severity: warning
- alert: InterfaceFlap
expr: |
sum by (hostname) (
count_over_time(
{job="syslog"}
|~ "(?i)(line protocol.*down|link.*down|interface.*changed state to down)"
[15m]
)
) > 3
labels:
severity: warning
NetworkDeviceCriticalSyslog fires on Cisco syslog severity 0–2 (Emergency, Alert, Critical). Three hundred switches all over the fleet can’t realistically be watched by humans. This rule watches them.
FirewallDenySpike catches the pattern where a misconfiguration or a probe scan suddenly causes thousands of deny events. The (?i) makes the regex case-insensitive, which matters because firewall vendors don’t all agree on capitalization.
InterfaceFlap is the small-but-helpful one. Three interface state changes in 15 minutes usually means a cable is going bad or a transceiver is failing. Catching it before the user complaints arrive is a small win every time.
A couple of LogQL patterns worth noticing across all of these:
- Always start with a label selector —
{job="audit"},{job="syslog"}. Never{} |~ "error". The label selector is what determines which chunks Loki has to scan. jsonparser before field filters.| json | objectRef_resource="secrets"parses the JSON once, then filters the parsed field. Much cheaper than|~ "secrets"regex.count_over_timeover a window is the standard idiom for “how many events in the last N minutes.” Pair withsum by (label)to break down by stream.
The Ingestion-Rate Gotcha That Bit Us Early
We tripped over this one in the first month and the fix is one line of config.
The Loki chart’s default for ingestion_rate_strategy is local. That means the configured ingestion_rate_mb cap is divided across distributor pods. Three distributors, 50 MB/s configured, each one allowed about 16.6 MB/s.
Sounds reasonable in theory. In practice, kube-proxy’s connection load balancing is L4 — it picks a backend pod when a TCP connection is opened and pins all subsequent packets to that pod for the life of the connection. Alloy’s loki.write opens long-lived HTTP/2 connections. If kube-proxy happens to land all three of your Alloy pods’ connections on the same distributor, that distributor has to absorb the full 50 MB/s while two others sit idle. The full-rate distributor hits the local 16.6 MB/s cap, returns 429, and you start losing logs.
The fix is ingestion_rate_strategy: global:
loki:
limits_config:
ingestion_rate_strategy: global
ingestion_rate_mb: 50
ingestion_burst_size_mb: 100
global applies the cap cluster-wide. The distributor that’s getting all the traffic doesn’t care that its peers are idle; the cap is on aggregate volume, not per-pod. Connection stickiness becomes invisible.
The reason this isn’t the chart default is that global requires the distributors to coordinate via the ring (memberlist), which adds a tiny amount of inter-pod traffic. For most deployments — including ours — that overhead is invisible. For very large multi-tenant deployments, local might be slightly more efficient. For everything else, switch to global and stop debugging phantom rate-limit errors.
Structured Metadata for High-Cardinality Fields
Schema v13 added something called structured metadata, which is the right home for high-cardinality fields that you want to filter on but never group by.
Examples:
- Trace IDs. Every request has a unique trace ID. If you make
trace_ida label, you create one stream per request — your stream count explodes within an hour. But you do want to be able to find the logs for a specific trace. - Request IDs. Same problem.
- User session IDs. Same problem.
The pre-v13 advice was to grep for these inside the log content with |~. That worked but it was slow because Loki had to scan every chunk for the regex.
With structured metadata, you can attach trace_id, request_id, session_id as metadata key-value pairs on each log entry without making them labels. Queries can filter on metadata directly:
{namespace="api", app="auth"} | trace_id = "abc123def456"
Loki uses the labels to find the right chunks, then uses the structured metadata index inside each chunk to find the matching entries — without creating a stream per trace.
In Alloy, you set structured metadata in a loki.process stage:
loki.process "extract_metadata" {
stage.json {
expressions = {
trace_id = "trace_id",
request_id = "request_id",
}
}
stage.structured_metadata {
values = {
trace_id = "",
request_id = "",
}
}
forward_to = [loki.write.local.receiver]
}
This is the right answer for any high-cardinality field that you want to query but don’t want to count as a label. If you find yourself wanting to add a label with thousands or millions of distinct values, structured metadata is what you want instead.
On Multi-Tenancy: We Don’t Use It
A note for anyone reading this expecting a deep multi-tenancy section like every other Loki tutorial.
We run single-tenant — auth_enabled: false in the deployment values from the previous article. Everyone using this Loki is on the same platform team. We don’t need to isolate data between groups; we don’t need per-tenant ingestion limits; we don’t need per-tenant retention. One tenant. Done.
If you do need multi-tenancy:
- Flip
auth_enabled: true. - Configure
X-Scope-OrgIDon every Alloyloki.writeblock:tenant_id = "team-name". - Configure per-tenant overrides in a runtime config file (mounted ConfigMap) for ingestion limits and retention.
- Configure Grafana data sources to send the right
X-Scope-OrgIDheader per data source (one Loki data source per tenant, or one with a template variable).
The architecture from the previous post doesn’t change; only the auth and per-request headers do. You can flip this on later if your needs change.
The Loki docs cover multi-tenancy in detail. I haven’t run it in production so I won’t pretend to have opinions about which corners it has.
Wrapping Up
Loki’s value lives in the operational choices, not the deployment. Our specific calls:
- Labels stay low-cardinality. The set above gives us every filter we actually use, and our stream count sits comfortably under the 50k cap.
- Per-stream retention by purpose, not by source. Security and compliance logs at 365 days, operational logs at 90–180 days, infrastructure noise at 30 days.
- LogQL alerts on the audit and syslog streams earn their keep —
cluster-adminbinding creation, severity 0–2 network syslog, firewall deny spikes, interface flaps. ingestion_rate_strategy: global— not the chart default. Switch this on day one and skip the 429 debugging.- Structured metadata for high-cardinality fields — trace IDs, request IDs, anything you want to filter on but never group by. Use v13 schema and the
stage.structured_metadataAlloy stage. - Single-tenant. Multi-tenancy is for cases where you’re running Loki for multiple unrelated consumers. We aren’t.
Next post (article 5) we move to Mimir. Same conceptual stack — agent ships data, simple-scalable architecture, Nutanix Objects backend — but the operational levers for metrics are very different from the ones for logs. Cardinality, scrape intervals, the ingester ring, and what happens when somebody adds a request_id label to a Prometheus counter.
Happy automating!