Loki in Production: Labels, Per-Stream Retention, and the LogQL Alerts We Run

Open Table of Contents

Where We Left Off
The Label Set We Actually Run
Per-Stream Retention: 14 Rules That Earn Their Keep
LogQL Alerts We Actually Rely On
The Ingestion-Rate Gotcha That Bit Us Early
Structured Metadata for High-Cardinality Fields
On Multi-Tenancy: We Don’t Use It
Wrapping Up

Where We Left Off

Last article we deployed Loki in SimpleScalable mode — three write pods, three read pods, two backend pods, all writing to Nutanix Objects via the S3 API. That’s the deployment. This article is the operating manual.

The choices that matter for the people using Loki — what to put in a label, what to leave out, how long to keep what, how to write a query that finishes — happen here, in the values that the deployed pods read. Get them right and Loki is invisible. Get them wrong and you’re either losing data, paging the on-call for noise, or paying for storage you don’t need.

The Label Set We Actually Run

Loki indexes labels, not log contents. Every unique combination of label values creates a stream. Streams are what the ingesters hold in memory and what the index points at. Too many streams and your write path runs out of memory; too few and your queries can’t find anything without a brute-force scan.

Our cap is max_global_streams_per_user: 50000. Right now we’re at about 16,000 — comfortably under, with headroom for the Windows fleet onboarding wave that’s projected to push us to roughly 25k.

The label set we run, grouped by source:

Pod logs (from `loki.source.kubernetes`)

Label	Cardinality	Source
`namespace`	low (~30)	`__meta_kubernetes_namespace`
`pod`	medium (changes on restart)	`__meta_kubernetes_pod_name`
`container`	low	`__meta_kubernetes_pod_container_name`
`node`	low (number of K8s nodes)	`__meta_kubernetes_pod_node_name`
`app`	low (number of distinct apps)	`__meta_kubernetes_pod_label_app_kubernetes_io_name`
`source`	constant	static label, “kubernetes”
`job`	constant	”loki.source.kubernetes.pod_logs”
`cluster`	low	external label, “conveyor-platform”
`dc`	low	external label, “east” or “west”

pod is the highest-cardinality label here because pod names change every time a Deployment rolls out (my-app-5f8b9c-x7d2k → my-app-5f8b9c-q4l9p). The chunks for an old pod become inactive once the pod is gone, and the compactor reclaims them on the standard schedule. Day-to-day, it works.

Kube-audit (from the API server log file)

Label	Cardinality	Source
`job`	constant	”kube-audit”
`source`	constant	”audit”
`verb`	low (get, list, create, update, patch, delete, watch)	parsed from JSON
`user`	medium	parsed from JSON `user.username`
`resource`	medium	parsed from JSON `objectRef.resource`
`audit_ns`	low	parsed from JSON `objectRef.namespace`
`status_code`	very low (HTTP codes)	parsed from JSON `responseStatus.code`
`audit_level`	very low (Request, RequestResponse, etc.)	parsed from JSON `level`

The CRD/lease/self-subject-review filter (covered in the Alloy production post) keeps the audit volume sane. Everything that survives the filter is genuinely useful for security alerting.

Network syslog (from the alloy-network Deployment)

Label	Cardinality	Source
`source`	constant	”network_syslog”
`device_type`	low	”nutanix” / “rubrik” / “dnac” / “network” / “ise”
`hostname`	medium	parsed from RFC 3164/5424 header
`severity`	low	parsed from `<PRI>` for Cisco, regex for Nutanix

device_type is the lever we use most. Switches are network, Nutanix CVMs are nutanix, Cisco ISE is ise, Rubrik backup appliances are rubrik. Each gets its own retention rule.

Windows EventLog

Label	Cardinality	Source
`job`	constant	”windows_eventlog”
`level`	very low	”Information” / “Verbose” / “Warning” / “Error” / “Critical”
`channel`	low	Application / Security / System / etc.
`host`	medium	one per Windows server

level is the workhorse. It’s the only field that determines how long the Windows event lives — Error and Critical get a full year, everything else gets 90 or 180 days. Storage cost ends up dominated by the volume of Information-level events, which is why those have the shortest retention.

Per-Stream Retention: 14 Rules That Earn Their Keep

This is the section I had the most fun with in this article, because the retention table is genuinely useful and it’s not generic. Every rule has a reason.

Our retention_period is a 365-day global default. Then per-stream rules selectively trim down (or extend) for specific log sources. Loki’s compactor evaluates the per-stream rules in priority order and applies the most-specific match.

limits_config:
  retention_period: 365d
  retention_stream:
    # --- kube-audit ---
    - selector: '{job="kube-audit"}'
      priority: 1
      period: 90d

    # --- Pod logs: noisy infra ---
    - selector: '{job="loki.source.kubernetes.pod_logs",namespace="kube-system",container="calico-node"}'
      priority: 2
      period: 30d
    - selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="loki"}'
      priority: 2
      period: 90d
    - selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="nginx"}'
      priority: 2
      period: 90d
    - selector: '{job="loki.source.kubernetes.pod_logs",namespace="observability",container="grafana-sc-dashboard"}'
      priority: 2
      period: 90d
    - selector: '{job="loki.source.kubernetes.pod_logs",namespace="runners"}'
      priority: 2
      period: 90d

    # --- Pod logs catch-all ---
    - selector: '{job="loki.source.kubernetes.pod_logs"}'
      priority: 1
      period: 180d

    # --- windows_eventlog ---
    - selector: '{job="windows_eventlog",level="Information"}'
      priority: 2
      period: 90d
    - selector: '{job="windows_eventlog",level="Verbose"}'
      priority: 2
      period: 90d
    - selector: '{job="windows_eventlog",level="Warning"}'
      priority: 2
      period: 180d
    - selector: '{job="windows_eventlog",level=~"Error|Critical"}'
      priority: 2
      period: 365d

    # --- network syslog by device type ---
    - selector: '{source="network_syslog",device_type="nutanix"}'
      priority: 2
      period: 90d
    - selector: '{source="network_syslog",device_type="rubrik"}'
      priority: 2
      period: 90d
    - selector: '{source="network_syslog",device_type="dnac"}'
      priority: 2
      period: 180d
    - selector: '{source="network_syslog",device_type="network"}'   # switches, firewalls
      priority: 2
      period: 365d
    - selector: '{source="network_syslog",device_type="ise"}'        # auth logs
      priority: 2
      period: 365d

A few decisions in here are worth talking about.

calico-node at 30 days. Calico is the CNI. It’s chatty. On a busy node, the daemon logs status messages every few seconds. We don’t need a year of CNI status to debug anything; the last 30 days covers any rolling-upgrade or BGP-peering investigation we’ve ever needed. The volume difference is significant — keeping calico-node at the catch-all 180 days would consume real storage for zero query value.

Self-logs at 90 days. Loki, the Mimir nginx gateway, the Grafana dashboard sidecar — these all log a lot, but their logs only matter when something is broken with the observability stack itself. Three months covers the worst incident window.

ARC runners at 90 days. Our GitHub Actions runner pods are ephemeral by design. The runner logs from three months ago aren’t going to help us debug a CI job. Trim them.

Pod logs catch-all at 180 days, kube-audit at 90 days. This is the one I’d revisit if I were starting over. Kube-audit at 90 days is shorter than the pod logs they correspond to, which means in some forensic scenarios you can see what the pod did but not what the API server did about it. We picked 90 days for audit because the volume is high and 90 days satisfies our internal policy. If you have a different policy or different volume, this is a knob worth turning.

Switch/firewall and ISE auth logs at 365 days. These are the security and compliance logs. Regulators have opinions. The volume from syslog is modest enough that 365 days isn’t expensive.

Nutanix CVM syslog at 90 days. Operational signal, not security. Three months is plenty for any post-incident investigation.

The general principle: set per-stream retention by what the log is for, not where it comes from. Security logs get long retention. Operational logs get medium retention. Noise gets short retention. The compactor will do the rest.

LogQL Alerts We Actually Rely On

Some of the most valuable LogQL queries we run are alerting rules, not dashboards. The Loki ruler evaluates these continuously and fires alerts into Alertmanager when they trip.

Here’s the audit-log alert group, lifted directly from observability/loki/rules/audit-log-alerts.yaml:

groups:
  - name: audit-log-alerts
    rules:
      - alert: UnauthorizedAPIAccess
        expr: |
          sum by (namespace) (
            count_over_time(
              {job="audit"} | json | responseStatus_code >= 403 [5m]
            )
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate of unauthorized API access attempts"

      - alert: SensitiveResourceAccess
        expr: |
          count_over_time(
            {job="audit"}
              | json
              | objectRef_resource="secrets"
              | verb=~"create|update|delete|patch"
            [5m]
          ) > 0
        labels:
          severity: warning
        annotations:
          summary: "Sensitive resource modification detected"

      - alert: ClusterAdminBindingCreated
        expr: |
          count_over_time(
            {job="audit"}
              |~ "cluster-admin"
              | json
              | objectRef_resource="clusterrolebindings"
              | verb="create"
            [15m]
          ) > 0
        labels:
          severity: critical
        annotations:
          summary: "cluster-admin ClusterRoleBinding created"

ClusterAdminBindingCreated is the one I’d point at if someone asked “what does this stack actually do for us beyond the dashboards?” If anybody — engineer, attacker, runaway controller — creates a new cluster-admin ClusterRoleBinding, the on-call gets paged within 15 minutes. We have the audit log because the API server emits it, we have Loki because we ship it, we have this alert because someone wrote three lines of LogQL. The audit pipeline pays for itself the first time this rule fires for a real reason.

The syslog alert group covers the network side:

groups:
  - name: syslog-alerts
    rules:
      - alert: NetworkDeviceCriticalSyslog
        expr: |
          count_over_time(
            {job="syslog"} | json | severity=~"0|1|2" [5m]
          ) > 0
        labels:
          severity: critical
        annotations:
          summary: "Critical syslog from {{ $labels.hostname }}"

      - alert: FirewallDenySpike
        expr: |
          sum by (hostname) (
            count_over_time(
              {job="syslog"} |~ "(?i)(deny|drop|block|reject)" [5m]
            )
          ) > 500
        for: 5m
        labels:
          severity: warning

      - alert: InterfaceFlap
        expr: |
          sum by (hostname) (
            count_over_time(
              {job="syslog"}
                |~ "(?i)(line protocol.*down|link.*down|interface.*changed state to down)"
              [15m]
            )
          ) > 3
        labels:
          severity: warning

NetworkDeviceCriticalSyslog fires on Cisco syslog severity 0–2 (Emergency, Alert, Critical). Three hundred switches all over the fleet can’t realistically be watched by humans. This rule watches them.

FirewallDenySpike catches the pattern where a misconfiguration or a probe scan suddenly causes thousands of deny events. The (?i) makes the regex case-insensitive, which matters because firewall vendors don’t all agree on capitalization.

InterfaceFlap is the small-but-helpful one. Three interface state changes in 15 minutes usually means a cable is going bad or a transceiver is failing. Catching it before the user complaints arrive is a small win every time.

A couple of LogQL patterns worth noticing across all of these:

Always start with a label selector — {job="audit"}, {job="syslog"}. Never {} |~ "error". The label selector is what determines which chunks Loki has to scan.
json parser before field filters. | json | objectRef_resource="secrets" parses the JSON once, then filters the parsed field. Much cheaper than |~ "secrets" regex.
count_over_time over a window is the standard idiom for “how many events in the last N minutes.” Pair with sum by (label) to break down by stream.

The Ingestion-Rate Gotcha That Bit Us Early

We tripped over this one in the first month and the fix is one line of config.

The Loki chart’s default for ingestion_rate_strategy is local. That means the configured ingestion_rate_mb cap is divided across distributor pods. Three distributors, 50 MB/s configured, each one allowed about 16.6 MB/s.

Sounds reasonable in theory. In practice, kube-proxy’s connection load balancing is L4 — it picks a backend pod when a TCP connection is opened and pins all subsequent packets to that pod for the life of the connection. Alloy’s loki.write opens long-lived HTTP/2 connections. If kube-proxy happens to land all three of your Alloy pods’ connections on the same distributor, that distributor has to absorb the full 50 MB/s while two others sit idle. The full-rate distributor hits the local 16.6 MB/s cap, returns 429, and you start losing logs.

The fix is ingestion_rate_strategy: global:

loki:
  limits_config:
    ingestion_rate_strategy: global
    ingestion_rate_mb: 50
    ingestion_burst_size_mb: 100

global applies the cap cluster-wide. The distributor that’s getting all the traffic doesn’t care that its peers are idle; the cap is on aggregate volume, not per-pod. Connection stickiness becomes invisible.

The reason this isn’t the chart default is that global requires the distributors to coordinate via the ring (memberlist), which adds a tiny amount of inter-pod traffic. For most deployments — including ours — that overhead is invisible. For very large multi-tenant deployments, local might be slightly more efficient. For everything else, switch to global and stop debugging phantom rate-limit errors.

Structured Metadata for High-Cardinality Fields

Schema v13 added something called structured metadata, which is the right home for high-cardinality fields that you want to filter on but never group by.

Examples:

Trace IDs. Every request has a unique trace ID. If you make trace_id a label, you create one stream per request — your stream count explodes within an hour. But you do want to be able to find the logs for a specific trace.
Request IDs. Same problem.
User session IDs. Same problem.

The pre-v13 advice was to grep for these inside the log content with |~. That worked but it was slow because Loki had to scan every chunk for the regex.

With structured metadata, you can attach trace_id, request_id, session_id as metadata key-value pairs on each log entry without making them labels. Queries can filter on metadata directly:

{namespace="api", app="auth"} | trace_id = "abc123def456"

Loki uses the labels to find the right chunks, then uses the structured metadata index inside each chunk to find the matching entries — without creating a stream per trace.

In Alloy, you set structured metadata in a loki.process stage:

loki.process "extract_metadata" {
  stage.json {
    expressions = {
      trace_id   = "trace_id",
      request_id = "request_id",
    }
  }
  stage.structured_metadata {
    values = {
      trace_id   = "",
      request_id = "",
    }
  }
  forward_to = [loki.write.local.receiver]
}

This is the right answer for any high-cardinality field that you want to query but don’t want to count as a label. If you find yourself wanting to add a label with thousands or millions of distinct values, structured metadata is what you want instead.

On Multi-Tenancy: We Don’t Use It

A note for anyone reading this expecting a deep multi-tenancy section like every other Loki tutorial.

We run single-tenant — auth_enabled: false in the deployment values from the previous article. Everyone using this Loki is on the same platform team. We don’t need to isolate data between groups; we don’t need per-tenant ingestion limits; we don’t need per-tenant retention. One tenant. Done.

If you do need multi-tenancy:

Flip auth_enabled: true.
Configure X-Scope-OrgID on every Alloy loki.write block: tenant_id = "team-name".
Configure per-tenant overrides in a runtime config file (mounted ConfigMap) for ingestion limits and retention.
Configure Grafana data sources to send the right X-Scope-OrgID header per data source (one Loki data source per tenant, or one with a template variable).

The architecture from the previous post doesn’t change; only the auth and per-request headers do. You can flip this on later if your needs change.

The Loki docs cover multi-tenancy in detail. I haven’t run it in production so I won’t pretend to have opinions about which corners it has.

Wrapping Up

Loki’s value lives in the operational choices, not the deployment. Our specific calls:

Labels stay low-cardinality. The set above gives us every filter we actually use, and our stream count sits comfortably under the 50k cap.
Per-stream retention by purpose, not by source. Security and compliance logs at 365 days, operational logs at 90–180 days, infrastructure noise at 30 days.
LogQL alerts on the audit and syslog streams earn their keep — cluster-admin binding creation, severity 0–2 network syslog, firewall deny spikes, interface flaps.
ingestion_rate_strategy: global — not the chart default. Switch this on day one and skip the 429 debugging.
Structured metadata for high-cardinality fields — trace IDs, request IDs, anything you want to filter on but never group by. Use v13 schema and the stage.structured_metadata Alloy stage.
Single-tenant. Multi-tenancy is for cases where you’re running Loki for multiple unrelated consumers. We aren’t.

Next post (article 5) we move to Mimir. Same conceptual stack — agent ships data, simple-scalable architecture, Nutanix Objects backend — but the operational levers for metrics are very different from the ones for logs. Cardinality, scrape intervals, the ingester ring, and what happens when somebody adds a request_id label to a Prometheus counter.

Happy automating!