Skip to content
Pipelines and Pizza 🍕
Go back

Mimir on Kubernetes: 620K Active Series on Nutanix Objects

14 min read

Table of Contents

Open Table of Contents

Where We Left Off

We’ve covered Alloy (collection), Loki deploy, and Loki operations. Mimir is the M in LGTM — the metrics backend that receives every prometheus.remote_write from the Alloy DaemonSet and serves PromQL queries to Grafana.

The shape of Mimir is conceptually identical to Loki: a write path, a read path, S3 in the middle. The component names are different (distributor, ingester, querier, store-gateway, compactor) but the philosophy is the same. The big differences are around sample-level concurrency (Mimir has to handle samples landing slightly out of order from a DaemonSet) and the size of the working set in memory (ingesters hold the active series in RAM until they flush to S3).


Why Mimir for a Net-New Platform

The decision is captured in ADR-008. The short version:

When you have an existing Prometheus deployment and need long-term storage, Thanos is the natural fit. Its sidecar model attaches to your running Prometheus servers without much disruption. You get long-term storage and global query without throwing anything out.

When you’re building net-new, the sidecar model is extra weight. We aren’t sidecaring onto an existing Prometheus — we have Alloy remote_write going directly to a metrics backend. Mimir accepts that traffic on its distributor endpoint, no Prometheus in the middle.

The other big factor: Mimir reads the same way Loki does. Same chart conventions, same simple-scalable deployment pattern, same S3 backend model, same Grafana data source pattern. When the team already learned Loki, they could read a Mimir Helm chart on day one. The cognitive load of running one ecosystem instead of two is real, especially for a team that operates a lot of other things alongside observability.

VictoriaMetrics and Cortex were the other candidates considered:

  • VictoriaMetrics: Excellent performance and storage efficiency, but it’s a different operational model with its own query language extensions. Less Grafana-native.
  • Cortex: Mimir was forked from Cortex in 2022. Grafana Labs is investing in Mimir; Cortex is in maintenance mode. Pick the active project.

Mimir wins for our situation. If you have existing Prometheus, your answer might be Thanos.


The Architecture We Run

We run Mimir in SimpleScalable mode, same as Loki. The component breakdown:

ComponentReplicasRole
Distributor2Front door for remote_write; hashes series to ingesters
Ingester3Holds active series in memory; flushes TSDB blocks to S3 every 2h
Querier3Executes PromQL across ingesters (recent) + store-gateway (historical)
Query-frontend1Splits queries, caches results, retries
Store-gateway1Serves historical TSDB blocks from S3
Compactor1Merges small blocks into large ones, dedups samples, enforces retention
Ruler1Evaluates recording rules and alerting rules

The data flow:

Alloy → distributor → ingester (RAM + WAL) → S3 (Nutanix Objects)

                                                  |
Grafana → query-frontend → querier → ingester (recent)
                                  → store-gateway (historical from S3)

                                          compactor merges blocks in S3

Distributors are stateless and just rebroadcast. Ingesters are stateful — each one holds a slice of the active series set, identified by a hash ring. The replication factor is 3, which means every sample lands on every ingester. That sounds wasteful but it’s how Mimir guarantees no data loss across single-ingester restarts.


Real Numbers

These are our actual production numbers, as of this writing:

Value
Active series~620,000
Retention365 days
Ingestion rate cap100,000 samples/sec
Ingestion burst200,000 samples
Global series cap8,000,000
Ingester memory request8 Gi
Ingester memory limit16 Gi
Out-of-order window5 minutes

The 620k active series count is the steady-state for our current scope: three RKE2 nodes per DC, the LGTM stack itself, ArgoCD, CNPG, kube-state-metrics, cAdvisor, kubelet, etcd, apiserver, CoreDNS, the Rubrik exporter, blackbox probes, Telegraf instances feeding network metrics, and a growing fleet of Windows servers via the Windows Alloy agent.

The 8M global series cap is intentionally well above 620k. We’ll cover the bump-up story in a section below.


Helm Values: The Mimir Side

Trimmed for readability. The full thing has more comments about why-we-did-what.

mimir-distributed:
  kafka:
    enabled: false           # No Kafka ingest path
  minio:
    enabled: false           # We use Nutanix Objects, not bundled MinIO
  rollout_operator:
    enabled: false           # Zone-aware replication disabled
  alertmanager:
    enabled: false           # Standalone Alertmanager handles routing

  global:
    extraEnvFrom:
      - secretRef:
          name: mimir-s3-credentials
    extraEnv:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://tempo-distributor.observability.svc:4317
      - name: OTEL_TRACES_SAMPLER
        value: parentbased_traceidratio
      - name: OTEL_TRACES_SAMPLER_ARG
        value: "0.1"

  mimir:
    structuredConfig:
      multitenancy_enabled: false           # Single-tenant
      common:
        storage:
          backend: s3
      blocks_storage:
        storage_prefix: blocks
        tsdb:
          dir: /data/tsdb
      limits:
        compactor_blocks_retention_period: 365d
        max_global_series_per_user: 8000000
        ingestion_rate: 100000
        ingestion_burst_size: 200000
        out_of_order_time_window: 5m
        max_global_exemplars_per_user: 100000
      ruler:
        alertmanager_url: http://alertmanager.observability.svc:9093
        external_url: https://grafana.conveyor.internal
      usage_stats:
        enabled: false           # Egress NetworkPolicy blocks stats.grafana.org

  ingester:
    replicas: 3
    podDisruptionBudget:
      maxUnavailable: 1
    zoneAwareReplication:
      enabled: false             # Single-DC
    persistentVolume:
      storageClass: local-path
      size: 20Gi
    resources:
      requests:
        cpu: 2000m
        memory: 8Gi
      limits:
        cpu: 8000m
        memory: 16Gi
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: ingester
                app.kubernetes.io/name: mimir
            topologyKey: kubernetes.io/hostname

  querier:
    replicas: 3
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi
      limits:
        cpu: 4000m
        memory: 8Gi

  query_frontend:
    replicas: 1

  compactor:
    replicas: 1
    persistentVolume:
      storageClass: local-path
      size: 50Gi
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi
      limits:
        cpu: 4000m
        memory: 8Gi

  store_gateway:
    replicas: 1
    zoneAwareReplication:
      enabled: false
    persistentVolume:
      storageClass: local-path
      size: 20Gi

  distributor:
    replicas: 2
    podDisruptionBudget:
      maxUnavailable: 1

A handful of decisions to call out:

Single-tenant (multitenancy_enabled: false). Same reasoning as Loki — everyone using this Mimir is on the same platform team. No need to isolate ingestion or query.

Required pod anti-affinity on ingesters (app.kubernetes.io/component: ingester and app.kubernetes.io/name: mimir). Note the second selector — without it, the rule would also match tempo-ingester pods because they share the same component label, and you’d end up with bizarre cross-chart scheduling effects.

usage_stats: enabled: false. Mimir tries to phone home to stats.grafana.org every 4 hours by default. Our egress NetworkPolicy blocks that traffic, so the calls were failing and logging warnings on a clock. Turning the feature off silenced the noise and saved us from filtering warning logs that meant nothing.

Self-tracing to Tempo at 10% sample. Mimir is one of the things we dogfood our own observability stack on. Mimir’s spans go to our Tempo deployment with service.name=mimir. Unlike Alloy (which hardcoded service.name=alloy and required an OTel processor workaround), Mimir respects the standard OTEL_SERVICE_NAME env var.


The Singleton Components, On Purpose

Three of those Mimir components are running at one replica each: query-frontend, store-gateway, compactor. During release-candidate review, this got flagged as a single point of failure. The platform runs ingesters at 3 replicas and distributors at 2, why are these three at 1?

The answer is in ADR-028, and the short version is: it’s intentional and correct.

Compactor at 1 replica. Mimir’s compactor uses a hash ring with a single token by default. Running multiple compactors without explicit sharding configuration causes them to compete for the same blocks and corrupt them. The chart default is 1 for this reason. To run multiple compactors safely, you’d enable compactor sharding, which adds configuration and complexity. At our block volume, one compactor with adequate resources catches up easily — typical compaction interval finishes well within its budget. Two compactors gets you nothing except a risk of block corruption.

Store-gateway at 1 replica. Serves historical blocks from S3. Queries against recent data (which is the majority of queries) are served by ingesters, not store-gateway. A store-gateway restart causes a ~30-second gap in historical query availability but doesn’t affect ingestion or recent queries. Scaling to 2 requires enabling store-gateway sharding and replication, which adds the same complexity-without-benefit problem as the compactor. We accept the brief restart gap.

Query-frontend at 1 replica. Stateless query scheduler and splitter. If it’s unavailable, queriers still function — they just don’t benefit from query splitting and caching. A restart causes ~10–30 seconds of elevated query latency. Scaling to 2 is safe and low-risk but provides minimal benefit at our query volume, which is internal dashboards, not customer-facing traffic.

The savings from running these three as singletons: roughly 4 CPU and 9 GiB of memory across the cluster. Not huge, but real, and the simpler config is its own benefit.

If our query volume ever grew to the point where the brief restart gaps were unacceptable — customer-facing SLO dashboards, automated trading dashboards, anything where 30 seconds of stale data is a problem — we’d revisit. For internal observability, the singletons are the right call. ADR-028 captures the reasoning so the next reviewer can see why and doesn’t have to ask again.


Four Memcached Caches

Mimir’s read path is dominated by S3 fetch time for historical data, just like Loki. The chart bundles four separate memcached deployments, each tuned for a different cache type:

CacheReplicasPer-replicaTotalWhat it caches
chunks-cache332 GB96 GBTSDB chunks fetched from S3
index-cache316 GB48 GBTSDB block indexes (which chunks to fetch)
metadata-cache24 GB8 GBBlock metadata (avoids S3 list calls)
results-cache28 GB16 GBFinal PromQL query results

168 GB of memcached total. Distributed across replicas via consistent hashing, so a single memcached pod going away costs you (replica count) − 1 of the cache for the affected keys but keeps the rest serving.

The TTLs are set wider than the chart defaults:

bucket_store:
  chunks_cache:
    subrange_ttl: 336h          # 14 days
    attributes_ttl: 336h
  metadata_cache:
    metafile_content_ttl: 336h
    metafile_attributes_ttl: 336h
    block_index_attributes_ttl: 336h
    chunks_list_ttl: 336h

limits:
  results_cache_ttl: 336h        # 14 days for PromQL result cache

14 days because the cache typically sits below 5% fill. LRU handles eviction long before any single entry hits its TTL. The TTL is just a backstop against a stale entry surviving forever; in practice, eviction runs continuously and entries are fresh.

The result of all this caching: dashboard refreshes that touch unchanged time windows return in milliseconds. The first query against a new window pays the S3 round trip. The second query against the same window — the next dashboard refresh, a different user opening the same dashboard, the alert rule evaluating again — hits cache.


Out-of-Order Samples: The DaemonSet Tax

Standard Prometheus assumes samples arrive in strictly increasing time order per series. The first sample for cpu_usage{node="node-1c"} at 10:00:00 must arrive before the sample at 10:00:30, which must arrive before 10:01:00, and so on.

A DaemonSet scraper makes this assumption uncomfortable. Three Alloy pods scrape three nodes’ kubelets every 30 seconds. The scrape timing is staggered by pod startup, network jitter, and whatever clock skew exists between the kubelets. If pod A scrapes kubelet on node-1c at 10:00:00.000 and the next scrape from a different pod for the same kubelet arrives at 10:00:29.950 (a few ms before the next “expected” scrape boundary), strict-ordering Mimir rejects the sample.

We fix this with:

limits:
  out_of_order_time_window: 5m

Mimir will accept samples arriving up to 5 minutes out of order. The samples get sorted into the right time-series position on ingest. This is a Mimir 2.x feature; it was off by default at one point but it’s safe to turn on and necessary for the DaemonSet scraping pattern.

The 5-minute window is conservative. We could trim it to 30 seconds and still cover all the practical cases. We leave it at 5 minutes because the cost is negligible and any edge case where a scraper pod is briefly stuck recovers gracefully.


Capacity Bump: 5M to 8M Series

The max_global_series_per_user cap started at 5,000,000. That gave us ~10x headroom over the 620k steady state, which felt like plenty at the time.

Then the Windows fleet started onboarding.

Each Windows host running the Windows Alloy agent ships ~6,000 metric series (Windows Exporter on a Windows Server with a few standard collectors enabled). Multiply by the 650-host wave we have planned, and the additional series count comes out to roughly 3.86 million. That’s 77% of the 5M cap, before any other growth.

We bumped the cap to 8M to restore the ~50% headroom we want against unplanned label or fleet growth. The number isn’t magic — the principle is “leave roughly half the cap as headroom for things we haven’t planned for.” If you find yourself running at 80% of any global cap, raise the cap. Hitting the cap means dropped samples, and dropped samples means you can’t fully trust your dashboards.

The bump itself is a one-line change. The reason it’s a story is that we wrote it down in the inline comment next to the value:

max_global_series_per_user: 8000000   # Bumped from 5M — Windows onboarding wave
                                       # (650 hosts) projects ~3.86M (77% of 5M);
                                       # 8M restores ~50% headroom.

The inline comment is the part I want to point at. The number “8 million” by itself looks arbitrary. The number with the math next to it is a defensible choice that the next person to read this config can audit and adjust. Two minutes to type, and it answers every “why did you set it to 8?” question forever.


Recording Rules

The ruler is one of the singleton components above. It evaluates recording rules and alerting rules against the metrics in Mimir. Recording rules pre-compute expensive aggregations so dashboards don’t recompute them on every page load.

Our recording rules are simple and focused. From observability/mimir/rules/recording-rules.yaml:

groups:
  - name: recording-rules
    interval: 1m
    rules:
      - record: namespace:container_cpu_usage_seconds_total:sum_rate
        expr: |
          sum by (namespace) (
            rate(container_cpu_usage_seconds_total{container!=""}[5m])
          )

      - record: namespace:container_memory_working_set_bytes:sum
        expr: |
          sum by (namespace) (
            container_memory_working_set_bytes{container!=""}
          )

      - record: job:up:sum
        expr: |
          sum by (job) (up)

      - record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
        expr: |
          sum by (cluster, namespace, pod, container) (
            irate(container_cpu_usage_seconds_total{container!=""}[5m])
          )

The naming convention follows the kube-prometheus standard: <grouping_labels>:<metric>:<operation>. The namespace: prefix says “this is grouped by namespace.” The :sum_rate suffix says “this is a sum of a rate.” Anyone reading a dashboard query can look at namespace:container_cpu_usage_seconds_total:sum_rate and know exactly what aggregation was applied.

The dashboards that use these recording rules render measurably faster than the same dashboards built on raw container_cpu_usage_seconds_total{}. The ruler does the heavy aggregation once per minute; every dashboard refresh hits the pre-computed series. The compounding effect when you have a dozen panels per dashboard is substantial.

Beyond recording rules, the alerting rules in observability/mimir/rules/*.yaml cover Kubernetes etcd health, node-level issues, workload health (pod restarts, OOM, etc.), platform services (ArgoCD, ESO, networking), backup status, and a meta-alert group that watches the alerting pipeline itself. The alerting rules are worth a separate post; this one is long enough already.


Wrapping Up

Mimir as we run it is more architecturally similar to Loki than it is different — same SimpleScalable shape, same S3 backend, same single-tenant decision, same per-DC values pattern. The interesting bits are the Mimir-specific knobs:

  • ~620k active series at steady state, capped at 8M with intentional headroom for the Windows onboarding wave.
  • Three singleton components (query-frontend, store-gateway, compactor) — flagged during review as a potential SPOF, kept as singletons after ADR-028 documented why each one is correct at this scale.
  • 168 GB of memcached across four caches (chunks, index, metadata, results), TTLs at 14 days because LRU handles eviction long before TTL.
  • out_of_order_time_window: 5m — necessary for DaemonSet scraping where slight timing variance between Alloy pods scraping the same target causes Mimir to reject samples without it.
  • Required pod anti-affinity with both app.kubernetes.io/component AND app.kubernetes.io/name selectors — otherwise the rule collides with tempo-ingester.
  • Recording rules with kube-prometheus naming — pre-compute heavy aggregations once per minute, every dashboard reads the cheap pre-computed series.

Next post we close out the deployment trilogy with Grafana on Kubernetes — backed by CloudNativePG instead of SQLite, and including the real Grafana 12.4.2 → 13.0.1 upgrade we ran last month. The upgrade-doc walkthrough has more sharp edges than you’d expect from “bump the chart version.”

Happy automating!