Table of Contents
Open Table of Contents
Where We Are
The architectural decisions for this platform started in early February 2026 with the first ADRs — RKE2, Calico (later replaced with Canal), MetalLB, ArgoCD hub-spoke. It is now mid-May. A hundred-some days in is a reasonable place to stop and look back honestly.
This is not the year-end retrospective. We haven’t run this for a year. The platform is tagged v0.6.2 (pre-production) in the changelog — v1.0.0 is reserved for the moment we run a full DR drill across both data centers after WestCoastDC comes back online. We’re operating like it’s production because, for The Conveyor’s monitoring needs, it is. But the changelog is honest about the version number.
So this post is a hundred-day checkpoint. What’s actually deployed. What bit us along the way. What surprised me by working better than expected. What we’d do differently if we started over today. And what’s still on the roadmap for the next hundred days.
Every fix and surprise below is real and recorded in the changelog. No fabricated war stories. The detail in places will be deeper than the analogous section in most retrospective posts because the source material is commits and changelog entries, not memory I haven’t earned yet.
What’s Actually Deployed
The high-level shape:
- Two 3-node RKE2 clusters on Cisco UCS M6 hardware. EastCoastDC is live; WestCoastDC is being built and brought online.
- Two Nutanix Objects S3 endpoints holding chunks, blocks, ruler state, and backups across the two DCs.
- Loki in SimpleScalable mode — 3 write / 3 read / 2 backend, single-tenant, 365-day retention with 14 per-stream override rules, 192 GB memcached chunks cache. Detail in article 5 and article 6.
- Mimir in SimpleScalable mode — ~620K active series at steady state, 8M series cap, 365-day retention, three singleton components (query-frontend, store-gateway, compactor) intentionally per ADR-028. Detail in article 7.
- Grafana 13.0.1 (Enterprise binary, unlicensed) on CloudNativePG — 3 replicas, Entra ID OIDC + break-glass local admin, Barman backups to Nutanix Objects. Real upgrade walkthrough in article 8.
- Alloy as DaemonSet + Deployment + Deployment — three deployment topologies, one binary, covered in article 3 and article 4.
- Tempo for tracing at 10% sample. Alloy, Mimir, Grafana, ArgoCD all dogfood the trace pipeline.
- Alertmanager routing to Microsoft Teams and ServiceNow.
- GitOps via ArgoCD — 36 applications, hub-spoke topology, EastCoastDC manages both clusters.
- CloudNativePG for Grafana’s database and any other future Postgres-backed workload.
- A dashboard set of ~118 dashboards across 18 domain-organized folders (article 9).
- External Secrets Operator pulling credentials from Azure Key Vault.
- Entra ID OIDC for Grafana auth.
- A growing fleet of network probes (blackbox-exporter against ~50 retail app endpoints, Cisco syslog ingest, switch dial-out telemetry via Telegraf, Meraki via telegraf-meraki, vSphere via telegraf-vsphere, NetApp via Harvest, Pure FlashArray via native scrape).
- Custom exporters for Rubrik backup state, GitHub Actions runner metrics, CrowdStrike host coverage.
We’re at v0.6.2 in changelog terms. About 50 ADRs documenting the structural decisions, ~118 dashboards, ~25 alert rules, and a deployment surface that turns over via PR review and ArgoCD reconcile.
What Bit Us: Five Stories From the Changelog
These are picked from the actual repository CHANGELOG. Every one has a fix in git history and a commit you can read.
1. The Drain Catch-22 on Stateful Sets
This is the one that prompted the most cross-cutting fix in the platform.
The setup. Three multi-replica StatefulSets — tempo-ingester, loki-backend, alertmanager — had no pod anti-affinity. Over time, replicas drift onto the same node. Local-path PVCs pin to the node where the pod first scheduled. When you go to drain a node for an OS patch or RKE2 upgrade, the trap closes.
The trap. Drain evicts replica A on node-1c. Replica A goes Pending — its local-path PVC is pinned to node-1c, which is now cordoned, so it can’t reschedule. The PDB (maxUnavailable: 1) now has zero disruption budget. Drain blocks trying to evict replica B from the same node. Eventually drain times out at 10 minutes. You sit there, cluster half-cordoned, wondering if it’s safer to force-drain or revert.
Discovered. OS-patch playbook testing in early May. Two of three tempo-ingester replicas had ended up on node-1c, exactly the scenario described above.
Fix. Required pod anti-affinity (requiredDuringSchedulingIgnoredDuringExecution) with topologyKey: kubernetes.io/hostname on all three charts. Required, not preferred — the scheduler refuses to place two replicas on the same node, full stop. N replicas → N nodes maximum, drain only ever evicts one at a time per the PDB. The catch-22 becomes structurally impossible.
Upstream gotcha. The tempo-distributed chart consumes ingester.affinity as a string passed through tpl, not as a YAML mapping. Overriding it with a normal YAML structure silently fails with a destination is a table Helm coalesce warning. The fix is to use a block-scalar affinity: | in values-common.yaml. The Loki and Alertmanager charts take the normal mapping form, so the same pattern works directly there.
Migration note. Existing pods that violated the new rule weren’t evicted automatically (IgnoredDuringExecution semantics). To force the spread, we deleted the offending pod and its PVC — data loss bounded because Mimir/Loki/Tempo persistent data lives in S3 and ingester WAL replays from there.
2. The Telegraf PDB Falsy-Zero Gotcha
The telegraf-meraki and telegraf-vsphere deployments are 1-replica pollers (you can’t safely scale them — the Meraki API rate-limits at ~10 req/sec org-wide and vCenter would get double-polled). They had minAvailable: 1 PDBs that blocked drain.
The fix should have been one line: minAvailable: 0. Trivial change, right?
The wrapper-chart values file used podDisruptionBudget: as the YAML key. But the upstream influxdata/telegraf chart consumes PDB settings under pdb:. Helm silently dropped our override. The rendered PDBs kept the chart default minAvailable: 1. Drain stayed blocked.
PR #815 fixed that key. PR #818 hit the second problem: the chart’s PDB template uses {{- if .Values.pdb.minAvailable }}, which treats 0 as falsy. With minAvailable: 0 set, the entire minAvailable field gets silently dropped from the rendered manifest, leaving an empty PDB spec that the controller computes as disruptionsAllowed=0 — drain still blocked.
The honest end state was pdb.create: false. A singleton 1-replica deployment doesn’t benefit from a PDB anyway — PDBs only affect voluntary disruption, and a singleton has zero crash protection regardless of what the PDB says.
Lesson: upstream chart values keys don’t always match what your wrapper expects, and Helm doesn’t tell you when it ignores a value. Render the manifest and check what actually gets deployed. The values you wrote and the spec the controller sees are not the same thing until you verify.
3. Canal CNI DNAT Breaking NetworkPolicy Rules
The platform started on Calico. We hit gMNI dial-in subscribe instability on NX-OS, switched to dial-out (Telegraf via gRPC port 57000), and along the way migrated CNI from Calico to Canal for unrelated reasons (ADR-033). Canal kept the same Calico policy engine but uses kube-proxy for service routing.
The first symptom: cert-manager webhooks logging ~963 timeout errors per hour. Same for Kyverno. ESO was fine.
The cause: NetworkPolicies allowing egress to the Kubernetes API server used ipBlock: <serviceCIDR> rules. Canal’s kube-proxy does DNAT — the source/destination IPs get rewritten from 10.96.0.1:443 (the service IP) to a node IP on port 6443 before Calico evaluates the policy. So the rule never matched.
The fix: replace service-CIDR rules with ipBlock: nodeCIDR targeting actual node IPs (10.225.155.0/24 for EastCoastDC, 10.250.155.0/24 for WestCoastDC) on port 6443. The webhook ingress rules needed the same fix — kube-apiserver calls webhooks from node IPs (hostNetwork), not via the service CIDR.
This was an eight-policy fix across the cluster. ESO worked because its NetworkPolicy already had nodeCIDR for unrelated reasons; cert-manager and Kyverno didn’t. ADR-035 captures the broader principle.
Lesson: CNI behavior shows up at the policy layer in non-obvious ways. When network policies don’t behave as expected, packet-capture-by-eyeball is faster than reading documentation. We confirmed the DNAT empirically with a curl from a debug pod (HTTP 000 after timeout = NetPol drop), then traced the rewrite.
4. Loki Self-Tracing Disabled by an Upstream Schema-URL Conflict
We wanted Loki to ship its own traces to Tempo at 10% sample. The config was correct. The pods came up and the trace data never arrived.
The error in Loki’s startup log: failed to initialise trace resource: conflicting Schema URL.
The cause: Loki vendors grafana/dskit. dskit/tracing/otel.go imports semconv "go.opentelemetry.io/otel/semconv/v1.39.0". But Loki also pulls go.opentelemetry.io/otel v1.42.0, whose resource.Default() is built from semconv/v1.40.0. When dskit.NewResource() calls resource.Merge(resource.Default(), resource.NewWithAttributes(semconv.SchemaURL, ...)), the OTel SDK rejects the merge because the two resources declare different schema URLs (1.39 vs 1.40). dskit bubbles the error up; Loki logs it and starts without tracing.
The fix requires a dskit release with matching semconv versions. There isn’t one yet. The workaround in our values-common.yaml is to disable Loki tracing entirely and document the deferral with a multi-line comment explaining the precise mechanism so the next reader doesn’t go on the same investigation.
Lesson: vendored dependencies with mismatched transitive versions produce errors that look like config bugs. “Schema URL conflict” sounds like something I did wrong in a config file. It wasn’t. Reading the actual error against the actual library source code is the only way to know.
5. Mimir Built-In Alertmanager Crash-Looping with 603 Restarts
A small one. The Mimir Helm chart enables a built-in Alertmanager component by default. We run a separate, standalone Alertmanager for cross-component routing, so we didn’t need the built-in one. It crash-looped trying to bootstrap because the alertmanager_storage.s3.bucket_name it expected (mimir-alertmanager) didn’t exist — we never created it because we never intended to use this component.
603 restarts, three different alerts firing continuously (PodCrashLooping, StatefulSetReplicaMismatch, AlertmanagerNotificationFailures).
The fix is two lines:
mimir-distributed:
alertmanager:
enabled: false
And remove the now-orphan alertmanager_storage block.
Lesson: default-on components in third-party charts are easy to ignore until they’re alert-storming your dashboards. Audit the chart values on first deploy. Anything you don’t actively need should be enabled: false. The chart authors don’t know what your platform looks like; you do.
What Surprised Me by Going Smoothly
For balance, the things that worked better than I expected.
The ConfigMap-sidecar dashboard pattern. I expected this to be painful — managing 118 dashboards as JSON files, doing exports cleanly, keeping the IDs and folder mappings honest. In practice, the sidecar Just Works. We’ve deployed dozens of new dashboards since launch. None of them required UI intervention. The friction of going through PR review is real but small, and it pays back in audit-readiness.
The single-tenant decision. Multi-tenancy is the marquee feature of both Loki and Mimir. We turned it off. Every Loki tutorial we read after that decision made us second-guess it. Three months in, single-tenant has not caused us a single problem. The header complexity, the runtime override file, the cross-tenant query syntax — none of it is in our config, none of it has been missed.
Recording rules for dashboard performance. I expected to spend weeks tuning PromQL queries and waiting for slow dashboards. Instead, we added a handful of recording rules with the standard kube-prometheus naming convention, and the dashboards that build on them render in milliseconds. Recording rules are a feature most people don’t reach for early enough.
The S3 backend on Nutanix Objects. I was nervous about an on-prem S3-compatible store handling the throughput of Loki and Mimir simultaneously. After the initial TLS-trust issue (Nutanix Objects uses an internal CA we had to plumb through every pod), Objects has been entirely stable. Zero storage-related incidents since week two. Throughput is fine. Lifecycle policies work. Versioning works. The S3 abstraction is exactly the same as cloud S3 from the application side.
Grafana 13 upgrade. I expected at least one rollback. The pre-upgrade backup was the most-discussed line item in the upgrade plan; we tested the restore procedure before we ran the upgrade for real. In the actual upgrade, the verification checklist caught the memory regression early, we bumped the limit pre-emptively, and the rollout completed without anyone outside the team noticing. The unified-storage migration ran in seconds. The rollback target sat unused.
What We’d Do Differently
A few honest second-guesses.
Plan for Canal-CNI DNAT in NetworkPolicy from day one. The eight-policy fix described above could have been avoided with a “always use node CIDRs, never service CIDRs” rule baked into our policy templates from the start. We’d save the eight follow-up commits and the few hours of debugging.
Document the upstream-chart-key inversion gotchas in CONTRIBUTING.md early. The telegraf pdb: vs podDisruptionBudget: mismatch, the affinity: | block scalar for tempo-distributed — these are exactly the kind of footguns that justify the existence of a chart-conventions document. We added inline comments at each site after we hit each problem; consolidating the patterns in a CONTRIBUTING section would have helped the third and fourth incidents.
Audit chart defaults before first deploy. The Mimir built-in Alertmanager crash-looped for weeks before we noticed. A standing review of “what’s enabled by default that we don’t need” would have caught it on day one. Same principle for the Mimir usage_stats.enabled phone-home that was generating log noise against our egress NetworkPolicy.
Wait less time before setting up real backups. Our Postgres backup to Nutanix Objects via Barman came together late — we ran for a couple of weeks with only local WAL on PVC. That gap was a real risk for which the upside was nothing (we never had to recover; if we had, we would have been in trouble). Day-one backups are non-negotiable in any production environment, and “we’ll get to it” is the dangerous version of that conversation.
Build the dashboard-cardinality safety net before the new exporter onboarding wave. The Mimir-cardinality dashboard and the topk(10, count by (__name__)) alert came together after the platform was already live. They should have come first. Watching cardinality from day one means catching a misbehaving exporter before it causes an ingester OOM, not after.
What’s Still on the Roadmap
The next hundred days have a clear shape. Most of the open work is in two buckets: bringing WestCoastDC fully online, and expanding telemetry into corners we haven’t covered yet.
WestCoastDC observability cluster. EastCoastDC is live. WestCoastDC is being rebuilt. When it comes back online, the dual-write Loki configuration in Alloy (currently commented out with a re-enable comment) gets re-enabled, the cross-DC Loki gossip in Alertmanager activates, and Postgres streaming replication resumes. That’s the gate for the v1.0.0 tag — a full DR drill where we fail Grafana from EastCoastDC to WestCoastDC and serve real query traffic from the standby site.
The rolling RKE2 + host-OS upgrade playbook. Ansible playbooks for safely cycling all three nodes of a cluster through OS patches and RKE2 upgrades without losing observability. Battle-tested through multiple end-to-end runs against the live EastCoastDC cluster — caught a half-dozen bugs along the way (kubeconfig tilde-expansion, check-mode false-positives on read-only kubectl, Pattern 1 race conditions versus kubectl wait, task-name templating quirks). Currently runs hands-off against EastCoastDC; WestCoastDC enters scope after the cluster is back.
OTel Operator and auto-instrumentation. The OTel Operator is already deployed. The next step is operationalizing it — a single Instrumentation CR in the observability namespace covering Java, Python, Node.js, and .NET, with a one-annotation onboarding contract for application teams. Documented in docs/operations/otel-operator-instrumentation.md.
AI-assisted log analysis. Locally-hosted LLMs (Ollama with Mistral, LiteLLM as proxy) for log summarization and anomaly detection. Early experiments are promising — an LLM can summarize 10,000 log lines into a useful paragraph faster than a human can scroll through them. Not production yet; on the roadmap for the next quarter.
Tiered alerting. Right now our alert routing is two-tier: critical to PagerDuty, everything else to Microsoft Teams. We want a third tier for informational alerts that should land in a ticket queue (ServiceNow) without paging or chatting. The Alertmanager configuration is in place; the receivers are the work.
More exporter coverage. Specifically: the Nutanix Era database service, Move migration appliance state, AppViewX certificate automation, F5 BIG-IP health. Each is a small wrapper exporter or a Telegraf plugin away. The pattern is well-established now (write an exporter, drop a Helm chart in observability/<name>/, scrape via Alloy annotations or label-based discovery, build a dashboard).
On Calling This Production
A note worth being explicit about.
The CHANGELOG tags the platform at v0.6.2 (pre-production). The README says “Architecture and documentation phase. Implementation follows the deployment roadmap.” Both are technically true and both are doing the platform a slight disservice.
We are operating this platform like it is in production. The Conveyor’s monitoring depends on the EastCoastDC cluster running, ingesting telemetry from real infrastructure, evaluating real alert rules, paging real people. Outages here are real outages.
The version tags reflect a higher bar than “operationally relied upon.” The v1.0.0 line is reserved for the moment when both DCs are live, the DR drill has been run end-to-end, the rollback procedures have been exercised against real failure modes, and we can say with audit-grade confidence that the platform behaves as documented under every condition we expect to encounter.
We’re not there yet. WestCoastDC is the gate. After WestCoastDC comes back and the DR drill runs cleanly, the version cuts to v1.0.0 and the language changes. Until then, the platform is in pre-production by version-policy but operationally serving The Conveyor.
I’m calling it that explicitly because I read too many retrospective posts where the language is loose. “We’ve been running this in production for a year” can mean “we’ve had it deployed for a year and somebody touched the dashboard once.” Or it can mean “this has been carrying real load and surviving real incidents for a year and we’ve earned the right to say so.” The version tag in our changelog is the bright line between the two.
A hundred days in, we’ve earned the operational use. We haven’t earned the v1.0.0 tag yet. Both can be true at the same time.
Series Recap
The full reading order, for anyone finding this post first:
- LGTM Stack on Nutanix — Architecture Overview
- Nutanix Objects as the S3 Backend for Loki and Mimir
- Grafana Alloy on Kubernetes — Three Deployments, One Collector
- Alloy in Production — The DaemonSet Config Running The Conveyor’s Observability
- Deploying Loki on Kubernetes — SimpleScalable on Nutanix Objects
- Loki in Production — Labels, Per-Stream Retention, and the LogQL Alerts We Run
- Mimir on Kubernetes — 620K Active Series on Nutanix Objects
- Grafana 13 on CloudNativePG — The Real Upgrade Walkthrough
- Dashboards That Actually Get Used — 118 Across 18 Folders
- You’re here.
Each article builds on the previous. The architecture decisions in article 1 explain the deployment choices in articles 3 through 8. The collection layer in articles 3 and 4 produces the data the deployment articles consume. By the time you reach the dashboard article, you have shared context for every label, retention rule, and recording rule that the dashboards depend on.
If I were recommending an order for someone considering a similar build, I’d suggest reading the architecture overview first, then jumping ahead to this retrospective to understand what we’d do differently before going back through the deployment articles. The deployment posts are more useful when you already know which decisions we’d revisit.
Wrapping Up
A hundred days. About 50 ADRs. About 118 dashboards. About 25 alert rules. One Grafana major version upgrade (12.4.2 → 13.0.1). One CNI migration (Calico → Canal). Several rounds of NetworkPolicy hardening. The drain catch-22 fix across three charts. The telegraf PDB falsy-zero fix. The Loki self-tracing schema-URL deferral. The Mimir series-cap bump from 5M to 8M. The dual-write Loki re-enable comment that’s still waiting on WestCoastDC.
We are not a year in. We are a hundred days in. Whatever calmness shows up in our incident response or speed shows up in our query performance, it’s not the calm of a mature platform — it’s the calm of a platform that has been built carefully enough that the early incidents have already been written down and fixed.
The next hundred days will look different. WestCoastDC comes online. The v1.0.0 line gets crossed. New telemetry sources get onboarded. The alert routing gets a third tier. Maybe AI-assisted log summarization starts paying off. Maybe it doesn’t. We’ll know when we know.
I started this series in April. I’m publishing the last article in May. The platform existed for both the writing and reading of every word above. If you’re considering a similar build at your own organization — regulated industry, on-prem, observability you can own end-to-end — the most useful thing I can offer is the changelog. It’s the receipt for everything I’ve claimed in these ten articles.
Thanks for reading.
Happy automating!