We failed a DR exercise. Not because our systems went down β because we could not prove they were up. The auditors asked straightforward questions: which backup jobs ran successfully last Tuesday? What was the failover time for the database cluster? Were there any authentication anomalies during the switchover? And we sat there cycling through a dozen vendor portals, trying to piece together an answer from tools that each showed a slice of the picture but none showed the whole thing.
That was the moment I realized we did not have an observability problem. We had an observability absence. We had monitoring β Nutanix Prism Central, VMware vROps, Azure Analytics, backup vendor dashboards, firewall consoles β but no way to correlate any of it, no way to filter signal from noise, and no way for me to review it as often as I needed to. We were reporting on what the vendors thought we needed to see, not what actually mattered.
Building an observability stack is like building a pizza from scratch. You could order delivery and let someone else pick the toppings, but when you need to control every ingredient and the delivery options do not cover your part of town, it is time to build your own kitchen. Five months in, that kitchen is feeding us answers we never had before β and we are just getting started.
What the LGTM Stack Actually Is
LGTM is not a single product. It is a collection of open-source projects from Grafana Labs that each handle one pillar of observability:
| Letter | Component | Role | What It Replaces |
|---|---|---|---|
| L | Loki | Log aggregation | Splunk, ELK/Elasticsearch, CloudWatch Logs |
| G | Grafana | Visualization and dashboards | Kibana, vendor portals, Azure Monitor |
| T | Tempo | Distributed tracing | Jaeger, Zipkin |
| M | Mimir | Long-term metrics storage | Thanos, Cortex, VictoriaMetrics |
And the collectors that move data into the stack:
| Component | Role | What It Replaces |
|---|---|---|
| Alloy | Primary telemetry collection (metrics, logs, traces) | Grafana Agent (deprecated), Promtail, FluentBit |
| Telegraf | Network device telemetry (NX-OS dial-out, vSphere, Meraki) | Vendor-specific collectors |
| NetApp Harvest | Storage array metrics | NetApp ActiveIQ portal |
| Custom Nutanix Exporter | Prism Central inventory and DR readiness metrics | Manual portal checks |
A few things worth calling out before we go deeper:
Loki is not Elasticsearch. Loki indexes labels (metadata), not log content. Compressed log chunks land in object storage. That is why it is dramatically cheaper to operate β you are not paying for full-text indexing on every line. The trade-off is that grep-style searches across unindexed fields are slower. If you label well, you rarely notice.
Mimir is Prometheus for the long haul. Grafana Labs forked CNCF Cortex to build Mimir, stripped years of accumulated technical debt, and added features from Grafana Enterprise Metrics. Horizontally scalable, speaks native Prometheus remote-write, stores metrics in object storage instead of local disk. If you know Prometheus, you know how to feed Mimir.
Alloy replaced Grafana Agent. As of 2024, Grafana Agent went into maintenance mode and Alloy became the recommended collector. Alloy is Grafana Labsβ OpenTelemetry Collector distribution with its own configuration language (Alloy syntax, formerly River). It collects metrics, logs, and traces in one binary.
Tempo is in scope. We deploy Tempo in distributed microservices mode with 30-day retention. Tracing was initially out of scope but came back once the core stack stabilized.
Why We Did Not Buy Something
Let me be direct: there was no budget for this. Zero. That is not a negotiating position β that is reality. When I brought up observability gaps after the DR exercise, the answer was not βhere is money to fix it.β The answer was βfigure it out.β
So we figured it out. Honestly? The constraints made the solution better.
What We Evaluated
| Option | Why We Looked | Why We Passed |
|---|---|---|
| Nutanix Prism Central | Already deployed, native to platform | Strong for Nutanix-specific metrics; weak for logs and third-party infra |
| VMware vROps | Already deployed for vSphere | Useful for VMware visibility; misses storage arrays, network devices, application logs |
| Azure Analytics | Already a Microsoft shop | Most of our infrastructure is on-prem; egress for everything makes no sense |
| Splunk | The right feature set | Per-GB pricing is a non-starter for our log volume |
| Datadog / New Relic | Mentioned by every consultant | SaaS-only, off-prem data, premium pricing β same dealbreakers as Splunk |
Every vendor portal gave us a piece of the puzzle. None gave us the whole picture. Layering more vendor portals on top was just adding more browser tabs, not more visibility β like trying to make a pizza by ordering five different appetizers and hoping they add up to dinner.
Why Open Source Won
The Grafana community is enormous, the documentation is solid, and the stack runs on infrastructure we already own. We started with recycled compute capacity β Cisco UCS rack nodes that were sitting underutilized β and an open-source project with a well-documented community. If the project proves its value (spoiler: it already has), there is a path to enterprise features and support from Grafana Labs without ripping anything out. A pizza kitchen that can grow from a food truck to a restaurant without rebuilding the ovens.
Why Nutanix as the Platform
This is not βNutanix because that is what we run.β There are specific technical reasons it is a good fit for this stack.
Nutanix Objects: S3-Compatible Storage That Already Exists
Loki and Mimir both need object storage. In the cloud you would use S3 or Azure Blob. On-prem you need an S3-compatible store. Nutanix Objects gives us exactly that β an S3-compatible service running on infrastructure we already manage, with native Prometheus metrics for monitoring it.
We deploy Objects in both data centers with cross-DC replication for backups and a global load balancer in front for HA. The Nutanix developer community has documented this pattern for configuring Loki with Objects. We used that as a starting point and extended it to Mimir and Tempo. Article 2 of this series covers the full storage architecture.
The configuration in Lokiβs Helm values is straightforward:
loki:
storage:
type: s3
s3:
endpoint: https://objects.example.com
region: us-east-1 # Required but arbitrary for non-AWS S3
bucketnames: loki-chunks
access_key_id: ${S3_ACCESS_KEY}
secret_access_key: ${S3_SECRET_KEY}
s3ForcePathStyle: true # Required for non-AWS S3 endpoints
insecure: false
Mimir uses the same pattern with slightly different keys:
mimir:
structuredConfig:
common:
storage:
backend: s3
s3:
endpoint: objects.example.com
region: us-east-1
access_key_id: ${S3_ACCESS_KEY}
secret_access_key: ${S3_SECRET_KEY}
insecure: false
s3_force_path_style: true
Native Prometheus Metrics Endpoint
Nutanix Objects exposes a Prometheus-compatible metrics endpoint through Prism Central. You scrape cluster-level and bucket-level metrics directly:
- Object store metrics:
https://<prism-central>:9440/oss/api/nutanix/metrics - Bucket metrics:
https://<prism-central>:9440/oss/api/nutanix/metrics/<store>/<bucket>
Same stack monitoring its own storage backend. No separate exporter to deploy and maintain.
Full-Stack Hybrid Cloud
Nutanix gives us Kubernetes (RKE2 on Nutanix compute), S3-compatible object storage (Objects), cross-DC replication, and global load balancing. The entire observability platform runs on one stack. Adding capacity means adding a node β storage, compute, and networking scale together.
Architecture Overview
Think of the stack in four layers: sources generate telemetry, Alloy and Telegraf collect and route it, the LGTM backends store it, and Grafana lets you see it. Like a pizza supply chain β farms grow the ingredients, trucks deliver them to the kitchen, the kitchen stores and preps everything, and the counter is where you actually get your slice.
+------------------------------------------------------------------+
| DATA SOURCES |
| |
| +-------------+ +-------------+ +----------+ +------------+ |
| | Application | | Network | | Storage | | Nutanix | |
| | Logs | | Devices | | Arrays | | Clusters | |
| +------+------+ +------+------+ +-----+----+ +-----+------+ |
| | | | | |
+---------+----------------+---------------+--------------+---------+
| | | |
v v v v
+------------------------------------------------------------------+
| COLLECTION LAYER |
| |
| +------------------------------------------------------------+ |
| | Grafana Alloy | |
| | DaemonSet (pod logs, node metrics, audit logs) | |
| | Deployment (syslog, gNMI, SNMP on MetalLB VIP) | |
| | Deployment (OTLP traces receiver) | |
| +-----+------------------+-------------------+---------------+ |
| | | | |
| +-----+------+ +-----+------+ +-------+------+ |
| | Telegraf | | NetApp | | Nutanix | |
| | (NX-OS | | Harvest | | Exporter | |
| | dial-out) | | | | (custom) | |
| +------------+ +------------+ +--------------+ |
| |
+--------+------------------+-------------------+-------------------+
| | |
v v v
+------------------------------------------------------------------+
| STORAGE LAYER |
| |
| +--------------+ +---------------+ +------------------+ |
| | Loki | | Mimir | | Tempo | |
| | (Logs) | | (Metrics) | | (Traces) | |
| | SimpleScale | | SimpleScale | | Distributed | |
| +------+-------+ +-------+-------+ +--------+---------+ |
| | | | |
| +-------------------+---------------------+ |
| | |
| +--------v--------+ |
| | Nutanix Objects | |
| | (S3-compatible) | |
| | Cross-DC + GSLB | |
| +-----------------+ |
| |
+------------------------------------------------------------------+
| | |
v v v
+------------------------------------------------------------------+
| VISUALIZATION + ALERTING |
| |
| +------------------------------------------------------------+ |
| | Grafana | |
| | PostgreSQL backend via CloudNativePG (cross-DC repl.) | |
| | OIDC authentication via Entra ID | |
| | Data sources: Loki, Mimir, Tempo | |
| +------------------------------------------------------------+ |
| |
| +------------------------------------------------------------+ |
| | Alertmanager | |
| | Cross-DC gossip, routes to Teams + ServiceNow | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Data Flow
- Infrastructure and applications generate logs, metrics, and traces across two data centers.
- Grafana Alloy collects most of it β DaemonSets tail pod logs and scrape node metrics, a Deployment receives syslog and SNMP from network devices, and another Deployment receives OTLP traces. Telegraf handles Cisco NX-OS dial-out telemetry where Alloy cannot. NetApp Harvest and a custom Nutanix exporter feed storage and infrastructure metrics.
- Loki stores logs, Mimir stores metrics, and Tempo stores traces. All three write to Nutanix Objects, with per-DC buckets and dual-write for resilience.
- Grafana queries all three backends with its configuration stored in PostgreSQL managed by the CloudNativePG operator (cross-DC streaming replication for DR).
- Alertmanager routes alerts via cross-DC gossip, sending notifications to Microsoft Teams and creating incidents in ServiceNow.
Component-by-Component Breakdown
What we are running and how we deploy it. Every service uses a wrapper Helm chart pattern β a local Chart.yaml wrapping the upstream dependency, with shared values plus per-DC overrides. Everything ships through ArgoCD. No manual helm install against production.
| Component | App Version | Helm Chart | Deployment Mode | Notes |
|---|---|---|---|---|
| Grafana | 13.0.1 Enterprise (unlicensed) | grafana-community 12.1.1 | Single replica | PostgreSQL via CloudNativePG, OIDC via Entra ID |
| Loki | 9.3.6 | grafana-community 9.3.6 | Simple Scalable | Write/read/backend split, S3 on Nutanix Objects |
| Mimir | 6.0.6 | grafana/mimir-distributed 6.0.6 | Simple Scalable | ~620K active series, S3 on Nutanix Objects |
| Tempo | v2.9.0 | grafana 1.61.3 | Distributed | 30-day retention, S3 on Nutanix Objects |
| Alloy | 1.6.0 | grafana 1.6.2 | DaemonSet + 2 Deployments | Universal collector for logs, metrics, traces |
| Alertmanager | Upstream | Custom wrapper | Standalone | Cross-DC gossip, Teams + ServiceNow |
| CloudNativePG | 1.28 | cloudnative-pg 0.28.0 | Operator | Primary (3 inst.) + replica (1 inst.) cross-DC |
A Note on Deployment Modes
Loki and Mimir each support three deployment modes:
- Monolithic: Single binary, all components in one process. Good for dev and test.
- Simple Scalable: A few read/write/backend components. Good for small-to-medium production.
- Distributed (Microservices): Each component runs independently. Best for very large scale.
We run Simple Scalable for both Loki and Mimir. It lets us scale reads and writes independently without managing a dozen separate microservices per component. The medium pizza β enough to feed the table without ordering one of everything on the menu.
Tempo runs in full distributed mode because its architecture benefits from separating distributors, ingesters, queriers, and compactors at our trace volume.
What We Are Actually Collecting
This is not a lab. Real production traffic across two data centers.
Metrics
| Source | Collection Method | What We Get |
|---|---|---|
| Nutanix clusters (10) | Custom Prism Central exporter + Objects Prometheus endpoint | Cluster health, storage utilization, DR readiness |
| NetApp ONTAP (2 clusters) | NetApp Harvest exporter | Volume performance, aggregate capacity, LUN latency |
| Cisco switches (200+) | Telegraf dial-out (gRPC) + SNMPv3 | Interface stats, CPU, memory, optics |
| Cisco IOS-XE | gNMI dial-in via Alloy | Interface counters, environment |
| VMware vSphere | Telegraf vsphere plugin | VM performance, host metrics |
| Meraki cloud | Telegraf meraki plugin | Wireless, switching, security appliance metrics |
| Kubernetes | Alloy DaemonSet (kubelet, cAdvisor, kube-state-metrics) | Pod health, resource usage, node status |
| Windows/Linux servers | Off-cluster Alloy agents | Performance counters, EventLog, IIS, SQL Server, AD |
Multiple storage vendors, multiple network platforms, multiple compute layers β same Mimir, same Grafana dashboards. Before this project, each of those was a separate portal with separate credentials and separate alerts. Now it is one Grafana with dropdowns. Same PromQL, same alert rules. That consolidation alone justified the project before we shipped a single log line.
Logs
| Source | Collection Method |
|---|---|
| Cisco switches | Syslog to Alloy Deployment (MetalLB VIP) |
| Firewalls | Syslog (UDP/514, TLS/6514) |
| Cisco ISE | Syslog |
| Windows/Linux servers | Off-cluster Alloy agents |
| Nutanix CVMs | Syslog (1515/UDP) |
| Kubernetes pods | Alloy DaemonSet tailing container stdout |
| Kubernetes audit logs | File tail with JSON parsing |
Every single one of those sources was previously either unmonitored, monitored in a vendor-specific tool I had to log into separately, or generating alerts that mixed real problems with known noise I could not filter out.
Why Not Just Use Prometheus Directly?
Fair question. Prometheus is excellent for short-term metrics and alerting; if your environment is small enough, a single Prometheus server with local storage works fine. But Prometheus was not designed for long-term storage or horizontal scaling. Local TSDB is limited by disk, retention beyond a few weeks gets expensive, and running multiple Prometheus servers for HA means dealing with federation or deduplication.
Mimir solves that. It accepts Prometheus remote-write, stores data in object storage (virtually unlimited capacity), and provides a query frontend with caching. We currently track around 620,000 active series with a configured limit of 5 million. We bumped that limit once when we hit 98% of the original 1.5 million ceiling after onboarding network device metrics. After that, we wired up a MimirHighCardinalitySeries alert at 80% of the new ceiling so the next surprise becomes a planned conversation, not a 3 a.m. page. That kind of organic growth β and the ability to react to it β would have been painful with standalone Prometheus.
Why Mimir over Thanos or VictoriaMetrics
We did the homework β it is documented as ADR-008. The short version: Thanos shines when you already have Prometheus deployments to retrofit with long-term storage, but its sidecar model adds a Prometheus layer we do not need. Alloy can remote_write straight to Mimirβs distributor. VictoriaMetrics is fast and lean, but less Grafana-native and a smaller ecosystem. Mimir uses the same S3 backend model, the same Helm chart conventions, and the same configuration philosophy as Loki β one ecosystem to learn, one set of operational patterns to run. Net-new beats retrofit when you are starting from a clean floor plan.
Honest Trade-Offs
I am not going to pretend this was painless. Building your own kitchen means you wash your own dishes.
What Is Harder Than Buying a Product
- You own the uptime. When something breaks at 3 AM, there is no support ticket to file. You fix it.
- Upgrades are on you. Grafana Labs ships new versions frequently. Keeping current is real work; skipping versions creates upgrade debt.
- Kubernetes is a prerequisite. If your team does not have Kubernetes experience, the LGTM stack is a steep way to learn it.
- Cardinality will surprise you. Understanding which metrics and labels create cardinality explosions is something you learn by hitting limits. We hit 98% of our Mimir series limit before we fully understood what was happening.
- Prioritization is the real challenge. The hardest part of this project has not been implementation β modern tooling makes deployment straightforward. The hard part is deciding what to instrument first when everything is suddenly visible.
What Is Better Than We Expected
- Zero incremental software spend. Recycled hardware, open-source stack, storage we already had.
- Data never leaves our network. Compliance conversations become non-conversations.
- Benefits show up immediately. This was not a βwait six months for valueβ project. Every component we deployed answered questions we could not answer before.
- The misconception is wrong. People assume self-hosted observability is too complicated, lacks features, or cannot meet real needs. With modern Helm charts, ArgoCD, and the Grafana ecosystem, it is genuinely not a huge lift. The tooling has matured enormously.
- Multi-vendor visibility in one place. NetApp, Nutanix, Cisco, VMware, Windows β same dashboards, same query language, same alert rules. Concretely: 113 dashboards live across 16 categories β 44 NetApp, 12 Nutanix, 11 network, plus VMware, Meraki, Rubrik, Windows, Linux, observability internals, tracing, jobs, backup, HTTP probes, and applications. One Grafana, one login.
The Kubernetes Platform
A quick note on the compute layer since it comes up in every conversation about self-hosted observability: where does this run?
We use RKE2 (Rancher Kubernetes Engine 2) across two data centers, three nodes per cluster. RKE2 is STIG-hardened out of the box, uses embedded etcd for HA, and is systemd-native β it fits a regulated environment where security baselines matter.
The hardware is recycled Cisco UCS rack servers we could not repurpose for our HCI platform (long story involving hardware compatibility lists). Rather than let them collect dust, they became the observability cluster. Each node has significant compute and memory headroom; we run a converged topology where every node is both control plane and worker.
Persistent volumes use the built-in RKE2 local-path provisioner β local disk, zero licensing cost, better latency than iSCSI SAN for write-ahead logs. All long-term data goes to Nutanix Objects.
Both clusters run the same stack with a dual-write architecture for most data pipelines. If one DC goes down, the other has a full copy. Grafana and its PostgreSQL database use an active/standby model with CloudNativePG streaming replication across DCs.
The Full Series Roadmap
This is article 1 of 10. Here is the full menu:
| # | Article | Publish Date | What You Will Learn |
|---|---|---|---|
| 1 | Building an LGTM Stack on Nutanix (this post) | Apr 28 | Architecture overview, component roles, trade-offs |
| 2 | Nutanix Objects as the Storage Backend for Loki and Mimir | May 1 | S3 config, bucket layout, retention, credentials |
| 3a | Grafana Alloy on Kubernetes: Deployment | May 5 | Alloy topology, Helm config, three-deployment pattern |
| 3b | Alloy in Production: Logs, Metrics, and Scaling | May 8 | What we collect, processing pipelines, resource tuning |
| 4a | Deploying Loki on Kubernetes with Nutanix Objects | May 12 | Helm values, Simple Scalable mode, storage config |
| 4b | Loki in Production: Labels, LogQL, and Retention | May 15 | Label strategy, per-stream retention, real queries |
| 5 | Mimir on Kubernetes: Scalable Metrics | May 19 | Simple Scalable setup, series limits, capacity planning |
| 6 | Grafana on Kubernetes with CloudNativePG | May 22 | PostgreSQL backend, cross-DC replication, upgrades |
| 7 | Dashboards That Actually Get Used | May 26 | What we monitor, alert strategy, multi-vendor views |
| 8 | Lessons Learned: What Worked and What Didnβt | May 29 | Retrospective, honest failures, what is next |
Each article includes working Helm values, configuration examples, and real pain points you can learn from. This is not theory β it is what runs in production at a regulated industry organization.
Before You Start
Common blockers we hit early. Knowing them up front saves an afternoon.
| Symptom | Most Likely Cause | Quick Fix |
|---|---|---|
The authorization mechanism is not supported from Loki/Mimir to Objects | Virtual-hosted-style addressing against an S3 endpoint that requires path-style | Set s3ForcePathStyle: true (Loki) / s3_force_path_style: true (Mimir); also set a non-empty region |
| Mimir series limit hit unexpectedly | Default limit too low for network device metrics at scale | Plan series budget before onboarding sources; alert at 70% of limit |
| Alloy syslog receiver OOM during a network event | No best_effort mode; defaults too generous | Set max_message_size, connection limits, lower HPA threshold |
| Nutanix CVM syslog parsed as garbage | RFC 3164 strict parser rejects ISO 8601 timestamps | Use raw mode for CVM listener; classify post-hoc |
MetalLB webhook drift causes ArgoCD OutOfSync | MetalLB rotates webhook certs in-cluster | Add ignoreDifferences blocks to the ArgoCD Application |
Each of these is documented in upstream issues. Pin the relevant ones in your runbook before you go live.
What Is Next
In the next article, we get into the storage foundation: Nutanix Objects as the S3 backend for Loki and Mimir. We walk through bucket layout across two DCs, retention policies, credential management with External Secrets Operator, and the cross-DC replication strategy that simplifies our DR story.
If you want to follow along, you will need:
- A Kubernetes cluster (we use RKE2; any distribution works)
- Nutanix Objects, or another S3-compatible object store (MinIO works for testing)
- Helm 3.x installed
kubectlaccess to your cluster- ArgoCD recommended for GitOps deployment
Happy automating!
This is article 1 of 10 in the LGTM on Nutanix series. Next up: Article 2 β Nutanix Objects as the Storage Backend.