Table of Contents
Open Table of Contents
Where We Left Off
We’ve covered Alloy (collection), Loki deploy, and Loki operations. Mimir is the M in LGTM — the metrics backend that receives every prometheus.remote_write from the Alloy DaemonSet and serves PromQL queries to Grafana.
The shape of Mimir is conceptually identical to Loki: a write path, a read path, S3 in the middle. The component names are different (distributor, ingester, querier, store-gateway, compactor) but the philosophy is the same. The big differences are around sample-level concurrency (Mimir has to handle samples landing slightly out of order from a DaemonSet) and the size of the working set in memory (ingesters hold the active series in RAM until they flush to S3).
Why Mimir for a Net-New Platform
The decision is captured in ADR-008. The short version:
When you have an existing Prometheus deployment and need long-term storage, Thanos is the natural fit. Its sidecar model attaches to your running Prometheus servers without much disruption. You get long-term storage and global query without throwing anything out.
When you’re building net-new, the sidecar model is extra weight. We aren’t sidecaring onto an existing Prometheus — we have Alloy remote_write going directly to a metrics backend. Mimir accepts that traffic on its distributor endpoint, no Prometheus in the middle.
The other big factor: Mimir reads the same way Loki does. Same chart conventions, same simple-scalable deployment pattern, same S3 backend model, same Grafana data source pattern. When the team already learned Loki, they could read a Mimir Helm chart on day one. The cognitive load of running one ecosystem instead of two is real, especially for a team that operates a lot of other things alongside observability.
VictoriaMetrics and Cortex were the other candidates considered:
- VictoriaMetrics: Excellent performance and storage efficiency, but it’s a different operational model with its own query language extensions. Less Grafana-native.
- Cortex: Mimir was forked from Cortex in 2022. Grafana Labs is investing in Mimir; Cortex is in maintenance mode. Pick the active project.
Mimir wins for our situation. If you have existing Prometheus, your answer might be Thanos.
The Architecture We Run
We run Mimir in SimpleScalable mode, same as Loki. The component breakdown:
| Component | Replicas | Role |
|---|---|---|
| Distributor | 2 | Front door for remote_write; hashes series to ingesters |
| Ingester | 3 | Holds active series in memory; flushes TSDB blocks to S3 every 2h |
| Querier | 3 | Executes PromQL across ingesters (recent) + store-gateway (historical) |
| Query-frontend | 1 | Splits queries, caches results, retries |
| Store-gateway | 1 | Serves historical TSDB blocks from S3 |
| Compactor | 1 | Merges small blocks into large ones, dedups samples, enforces retention |
| Ruler | 1 | Evaluates recording rules and alerting rules |
The data flow:
Alloy → distributor → ingester (RAM + WAL) → S3 (Nutanix Objects)
↑
|
Grafana → query-frontend → querier → ingester (recent)
→ store-gateway (historical from S3)
↓
compactor merges blocks in S3
Distributors are stateless and just rebroadcast. Ingesters are stateful — each one holds a slice of the active series set, identified by a hash ring. The replication factor is 3, which means every sample lands on every ingester. That sounds wasteful but it’s how Mimir guarantees no data loss across single-ingester restarts.
Real Numbers
These are our actual production numbers, as of this writing:
| Value | |
|---|---|
| Active series | ~620,000 |
| Retention | 365 days |
| Ingestion rate cap | 100,000 samples/sec |
| Ingestion burst | 200,000 samples |
| Global series cap | 8,000,000 |
| Ingester memory request | 8 Gi |
| Ingester memory limit | 16 Gi |
| Out-of-order window | 5 minutes |
The 620k active series count is the steady-state for our current scope: three RKE2 nodes per DC, the LGTM stack itself, ArgoCD, CNPG, kube-state-metrics, cAdvisor, kubelet, etcd, apiserver, CoreDNS, the Rubrik exporter, blackbox probes, Telegraf instances feeding network metrics, and a growing fleet of Windows servers via the Windows Alloy agent.
The 8M global series cap is intentionally well above 620k. We’ll cover the bump-up story in a section below.
Helm Values: The Mimir Side
Trimmed for readability. The full thing has more comments about why-we-did-what.
mimir-distributed:
kafka:
enabled: false # No Kafka ingest path
minio:
enabled: false # We use Nutanix Objects, not bundled MinIO
rollout_operator:
enabled: false # Zone-aware replication disabled
alertmanager:
enabled: false # Standalone Alertmanager handles routing
global:
extraEnvFrom:
- secretRef:
name: mimir-s3-credentials
extraEnv:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://tempo-distributor.observability.svc:4317
- name: OTEL_TRACES_SAMPLER
value: parentbased_traceidratio
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1"
mimir:
structuredConfig:
multitenancy_enabled: false # Single-tenant
common:
storage:
backend: s3
blocks_storage:
storage_prefix: blocks
tsdb:
dir: /data/tsdb
limits:
compactor_blocks_retention_period: 365d
max_global_series_per_user: 8000000
ingestion_rate: 100000
ingestion_burst_size: 200000
out_of_order_time_window: 5m
max_global_exemplars_per_user: 100000
ruler:
alertmanager_url: http://alertmanager.observability.svc:9093
external_url: https://grafana.conveyor.internal
usage_stats:
enabled: false # Egress NetworkPolicy blocks stats.grafana.org
ingester:
replicas: 3
podDisruptionBudget:
maxUnavailable: 1
zoneAwareReplication:
enabled: false # Single-DC
persistentVolume:
storageClass: local-path
size: 20Gi
resources:
requests:
cpu: 2000m
memory: 8Gi
limits:
cpu: 8000m
memory: 16Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: ingester
app.kubernetes.io/name: mimir
topologyKey: kubernetes.io/hostname
querier:
replicas: 3
podDisruptionBudget:
maxUnavailable: 1
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 4000m
memory: 8Gi
query_frontend:
replicas: 1
compactor:
replicas: 1
persistentVolume:
storageClass: local-path
size: 50Gi
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 4000m
memory: 8Gi
store_gateway:
replicas: 1
zoneAwareReplication:
enabled: false
persistentVolume:
storageClass: local-path
size: 20Gi
distributor:
replicas: 2
podDisruptionBudget:
maxUnavailable: 1
A handful of decisions to call out:
Single-tenant (multitenancy_enabled: false). Same reasoning as Loki — everyone using this Mimir is on the same platform team. No need to isolate ingestion or query.
Required pod anti-affinity on ingesters (app.kubernetes.io/component: ingester and app.kubernetes.io/name: mimir). Note the second selector — without it, the rule would also match tempo-ingester pods because they share the same component label, and you’d end up with bizarre cross-chart scheduling effects.
usage_stats: enabled: false. Mimir tries to phone home to stats.grafana.org every 4 hours by default. Our egress NetworkPolicy blocks that traffic, so the calls were failing and logging warnings on a clock. Turning the feature off silenced the noise and saved us from filtering warning logs that meant nothing.
Self-tracing to Tempo at 10% sample. Mimir is one of the things we dogfood our own observability stack on. Mimir’s spans go to our Tempo deployment with service.name=mimir. Unlike Alloy (which hardcoded service.name=alloy and required an OTel processor workaround), Mimir respects the standard OTEL_SERVICE_NAME env var.
The Singleton Components, On Purpose
Three of those Mimir components are running at one replica each: query-frontend, store-gateway, compactor. During release-candidate review, this got flagged as a single point of failure. The platform runs ingesters at 3 replicas and distributors at 2, why are these three at 1?
The answer is in ADR-028, and the short version is: it’s intentional and correct.
Compactor at 1 replica. Mimir’s compactor uses a hash ring with a single token by default. Running multiple compactors without explicit sharding configuration causes them to compete for the same blocks and corrupt them. The chart default is 1 for this reason. To run multiple compactors safely, you’d enable compactor sharding, which adds configuration and complexity. At our block volume, one compactor with adequate resources catches up easily — typical compaction interval finishes well within its budget. Two compactors gets you nothing except a risk of block corruption.
Store-gateway at 1 replica. Serves historical blocks from S3. Queries against recent data (which is the majority of queries) are served by ingesters, not store-gateway. A store-gateway restart causes a ~30-second gap in historical query availability but doesn’t affect ingestion or recent queries. Scaling to 2 requires enabling store-gateway sharding and replication, which adds the same complexity-without-benefit problem as the compactor. We accept the brief restart gap.
Query-frontend at 1 replica. Stateless query scheduler and splitter. If it’s unavailable, queriers still function — they just don’t benefit from query splitting and caching. A restart causes ~10–30 seconds of elevated query latency. Scaling to 2 is safe and low-risk but provides minimal benefit at our query volume, which is internal dashboards, not customer-facing traffic.
The savings from running these three as singletons: roughly 4 CPU and 9 GiB of memory across the cluster. Not huge, but real, and the simpler config is its own benefit.
If our query volume ever grew to the point where the brief restart gaps were unacceptable — customer-facing SLO dashboards, automated trading dashboards, anything where 30 seconds of stale data is a problem — we’d revisit. For internal observability, the singletons are the right call. ADR-028 captures the reasoning so the next reviewer can see why and doesn’t have to ask again.
Four Memcached Caches
Mimir’s read path is dominated by S3 fetch time for historical data, just like Loki. The chart bundles four separate memcached deployments, each tuned for a different cache type:
| Cache | Replicas | Per-replica | Total | What it caches |
|---|---|---|---|---|
| chunks-cache | 3 | 32 GB | 96 GB | TSDB chunks fetched from S3 |
| index-cache | 3 | 16 GB | 48 GB | TSDB block indexes (which chunks to fetch) |
| metadata-cache | 2 | 4 GB | 8 GB | Block metadata (avoids S3 list calls) |
| results-cache | 2 | 8 GB | 16 GB | Final PromQL query results |
168 GB of memcached total. Distributed across replicas via consistent hashing, so a single memcached pod going away costs you (replica count) − 1 of the cache for the affected keys but keeps the rest serving.
The TTLs are set wider than the chart defaults:
bucket_store:
chunks_cache:
subrange_ttl: 336h # 14 days
attributes_ttl: 336h
metadata_cache:
metafile_content_ttl: 336h
metafile_attributes_ttl: 336h
block_index_attributes_ttl: 336h
chunks_list_ttl: 336h
limits:
results_cache_ttl: 336h # 14 days for PromQL result cache
14 days because the cache typically sits below 5% fill. LRU handles eviction long before any single entry hits its TTL. The TTL is just a backstop against a stale entry surviving forever; in practice, eviction runs continuously and entries are fresh.
The result of all this caching: dashboard refreshes that touch unchanged time windows return in milliseconds. The first query against a new window pays the S3 round trip. The second query against the same window — the next dashboard refresh, a different user opening the same dashboard, the alert rule evaluating again — hits cache.
Out-of-Order Samples: The DaemonSet Tax
Standard Prometheus assumes samples arrive in strictly increasing time order per series. The first sample for cpu_usage{node="node-1c"} at 10:00:00 must arrive before the sample at 10:00:30, which must arrive before 10:01:00, and so on.
A DaemonSet scraper makes this assumption uncomfortable. Three Alloy pods scrape three nodes’ kubelets every 30 seconds. The scrape timing is staggered by pod startup, network jitter, and whatever clock skew exists between the kubelets. If pod A scrapes kubelet on node-1c at 10:00:00.000 and the next scrape from a different pod for the same kubelet arrives at 10:00:29.950 (a few ms before the next “expected” scrape boundary), strict-ordering Mimir rejects the sample.
We fix this with:
limits:
out_of_order_time_window: 5m
Mimir will accept samples arriving up to 5 minutes out of order. The samples get sorted into the right time-series position on ingest. This is a Mimir 2.x feature; it was off by default at one point but it’s safe to turn on and necessary for the DaemonSet scraping pattern.
The 5-minute window is conservative. We could trim it to 30 seconds and still cover all the practical cases. We leave it at 5 minutes because the cost is negligible and any edge case where a scraper pod is briefly stuck recovers gracefully.
Capacity Bump: 5M to 8M Series
The max_global_series_per_user cap started at 5,000,000. That gave us ~10x headroom over the 620k steady state, which felt like plenty at the time.
Then the Windows fleet started onboarding.
Each Windows host running the Windows Alloy agent ships ~6,000 metric series (Windows Exporter on a Windows Server with a few standard collectors enabled). Multiply by the 650-host wave we have planned, and the additional series count comes out to roughly 3.86 million. That’s 77% of the 5M cap, before any other growth.
We bumped the cap to 8M to restore the ~50% headroom we want against unplanned label or fleet growth. The number isn’t magic — the principle is “leave roughly half the cap as headroom for things we haven’t planned for.” If you find yourself running at 80% of any global cap, raise the cap. Hitting the cap means dropped samples, and dropped samples means you can’t fully trust your dashboards.
The bump itself is a one-line change. The reason it’s a story is that we wrote it down in the inline comment next to the value:
max_global_series_per_user: 8000000 # Bumped from 5M — Windows onboarding wave
# (650 hosts) projects ~3.86M (77% of 5M);
# 8M restores ~50% headroom.
The inline comment is the part I want to point at. The number “8 million” by itself looks arbitrary. The number with the math next to it is a defensible choice that the next person to read this config can audit and adjust. Two minutes to type, and it answers every “why did you set it to 8?” question forever.
Recording Rules
The ruler is one of the singleton components above. It evaluates recording rules and alerting rules against the metrics in Mimir. Recording rules pre-compute expensive aggregations so dashboards don’t recompute them on every page load.
Our recording rules are simple and focused. From observability/mimir/rules/recording-rules.yaml:
groups:
- name: recording-rules
interval: 1m
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: |
sum by (namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
- record: namespace:container_memory_working_set_bytes:sum
expr: |
sum by (namespace) (
container_memory_working_set_bytes{container!=""}
)
- record: job:up:sum
expr: |
sum by (job) (up)
- record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: |
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{container!=""}[5m])
)
The naming convention follows the kube-prometheus standard: <grouping_labels>:<metric>:<operation>. The namespace: prefix says “this is grouped by namespace.” The :sum_rate suffix says “this is a sum of a rate.” Anyone reading a dashboard query can look at namespace:container_cpu_usage_seconds_total:sum_rate and know exactly what aggregation was applied.
The dashboards that use these recording rules render measurably faster than the same dashboards built on raw container_cpu_usage_seconds_total{}. The ruler does the heavy aggregation once per minute; every dashboard refresh hits the pre-computed series. The compounding effect when you have a dozen panels per dashboard is substantial.
Beyond recording rules, the alerting rules in observability/mimir/rules/*.yaml cover Kubernetes etcd health, node-level issues, workload health (pod restarts, OOM, etc.), platform services (ArgoCD, ESO, networking), backup status, and a meta-alert group that watches the alerting pipeline itself. The alerting rules are worth a separate post; this one is long enough already.
Wrapping Up
Mimir as we run it is more architecturally similar to Loki than it is different — same SimpleScalable shape, same S3 backend, same single-tenant decision, same per-DC values pattern. The interesting bits are the Mimir-specific knobs:
- ~620k active series at steady state, capped at 8M with intentional headroom for the Windows onboarding wave.
- Three singleton components (query-frontend, store-gateway, compactor) — flagged during review as a potential SPOF, kept as singletons after ADR-028 documented why each one is correct at this scale.
- 168 GB of memcached across four caches (chunks, index, metadata, results), TTLs at 14 days because LRU handles eviction long before TTL.
out_of_order_time_window: 5m— necessary for DaemonSet scraping where slight timing variance between Alloy pods scraping the same target causes Mimir to reject samples without it.- Required pod anti-affinity with both
app.kubernetes.io/componentANDapp.kubernetes.io/nameselectors — otherwise the rule collides with tempo-ingester. - Recording rules with kube-prometheus naming — pre-compute heavy aggregations once per minute, every dashboard reads the cheap pre-computed series.
Next post we close out the deployment trilogy with Grafana on Kubernetes — backed by CloudNativePG instead of SQLite, and including the real Grafana 12.4.2 → 13.0.1 upgrade we ran last month. The upgrade-doc walkthrough has more sharp edges than you’d expect from “bump the chart version.”
Happy automating!