Building an LGTM Observability Stack on Nutanix: Why We Did It and What It Looks Like

We failed a DR exercise. Not because our systems went down — because we could not prove they were up. The auditors asked straightforward questions: which backup jobs ran successfully last Tuesday? What was the failover time for the database cluster? Were there any authentication anomalies during the switchover? And we sat there cycling through a dozen vendor portals, trying to piece together an answer from tools that each showed a slice of the picture but none showed the whole thing.

That was the moment I realized we did not have an observability problem. We had an observability absence. We had monitoring — Nutanix Prism Central, VMware vROps, Azure Analytics, backup vendor dashboards, firewall consoles — but no way to correlate any of it, no way to filter signal from noise, and no way for me to review it as often as I needed to. We were reporting on what the vendors thought we needed to see, not what actually mattered.

Building an observability stack is like building a pizza from scratch. You could order delivery and let someone else pick the toppings, but when you need to control every ingredient and the delivery options do not cover your part of town, it is time to build your own kitchen. Five months in, that kitchen is feeding us answers we never had before — and we are just getting started.

What the LGTM Stack Actually Is

LGTM is not a single product. It is a collection of open-source projects from Grafana Labs that each handle one pillar of observability:

Letter	Component	Role	What It Replaces
L	Loki	Log aggregation	Splunk, ELK/Elasticsearch, CloudWatch Logs
G	Grafana	Visualization and dashboards	Kibana, vendor portals, Azure Monitor
T	Tempo	Distributed tracing	Jaeger, Zipkin
M	Mimir	Long-term metrics storage	Thanos, Cortex, VictoriaMetrics

And the collectors that move data into the stack:

Component	Role	What It Replaces
Alloy	Primary telemetry collection (metrics, logs, traces)	Grafana Agent (deprecated), Promtail, FluentBit
Telegraf	Network device telemetry (NX-OS dial-out, vSphere, Meraki)	Vendor-specific collectors
NetApp Harvest	Storage array metrics	NetApp ActiveIQ portal
Custom Nutanix Exporter	Prism Central inventory and DR readiness metrics	Manual portal checks

A few things worth calling out before we go deeper:

Loki is not Elasticsearch. Loki indexes labels (metadata), not log content. Compressed log chunks land in object storage. That is why it is dramatically cheaper to operate — you are not paying for full-text indexing on every line. The trade-off is that grep-style searches across unindexed fields are slower. If you label well, you rarely notice.

Mimir is Prometheus for the long haul. Grafana Labs forked CNCF Cortex to build Mimir, stripped years of accumulated technical debt, and added features from Grafana Enterprise Metrics. Horizontally scalable, speaks native Prometheus remote-write, stores metrics in object storage instead of local disk. If you know Prometheus, you know how to feed Mimir.

Alloy replaced Grafana Agent. As of 2024, Grafana Agent went into maintenance mode and Alloy became the recommended collector. Alloy is Grafana Labs’ OpenTelemetry Collector distribution with its own configuration language (Alloy syntax, formerly River). It collects metrics, logs, and traces in one binary.

Tempo is in scope. We deploy Tempo in distributed microservices mode with 30-day retention. Tracing was initially out of scope but came back once the core stack stabilized.

Why We Did Not Buy Something

Let me be direct: there was no budget for this. Zero. That is not a negotiating position — that is reality. When I brought up observability gaps after the DR exercise, the answer was not “here is money to fix it.” The answer was “figure it out.”

So we figured it out. Honestly? The constraints made the solution better.

What We Evaluated

Option	Why We Looked	Why We Passed
Nutanix Prism Central	Already deployed, native to platform	Strong for Nutanix-specific metrics; weak for logs and third-party infra
VMware vROps	Already deployed for vSphere	Useful for VMware visibility; misses storage arrays, network devices, application logs
Azure Analytics	Already a Microsoft shop	Most of our infrastructure is on-prem; egress for everything makes no sense
Splunk	The right feature set	Per-GB pricing is a non-starter for our log volume
Datadog / New Relic	Mentioned by every consultant	SaaS-only, off-prem data, premium pricing — same dealbreakers as Splunk

Every vendor portal gave us a piece of the puzzle. None gave us the whole picture. Layering more vendor portals on top was just adding more browser tabs, not more visibility — like trying to make a pizza by ordering five different appetizers and hoping they add up to dinner.

Why Open Source Won

The Grafana community is enormous, the documentation is solid, and the stack runs on infrastructure we already own. We started with recycled compute capacity — Cisco UCS rack nodes that were sitting underutilized — and an open-source project with a well-documented community. If the project proves its value (spoiler: it already has), there is a path to enterprise features and support from Grafana Labs without ripping anything out. A pizza kitchen that can grow from a food truck to a restaurant without rebuilding the ovens.

Why Nutanix as the Platform

This is not “Nutanix because that is what we run.” There are specific technical reasons it is a good fit for this stack.

Nutanix Objects: S3-Compatible Storage That Already Exists

Loki and Mimir both need object storage. In the cloud you would use S3 or Azure Blob. On-prem you need an S3-compatible store. Nutanix Objects gives us exactly that — an S3-compatible service running on infrastructure we already manage, with native Prometheus metrics for monitoring it.

We deploy Objects in both data centers with cross-DC replication for backups and a global load balancer in front for HA. The Nutanix developer community has documented this pattern for configuring Loki with Objects. We used that as a starting point and extended it to Mimir and Tempo. Article 2 of this series covers the full storage architecture.

The configuration in Loki’s Helm values is straightforward:

loki:
  storage:
    type: s3
    s3:
      endpoint: https://objects.example.com
      region: us-east-1            # Required but arbitrary for non-AWS S3
      bucketnames: loki-chunks
      access_key_id: ${S3_ACCESS_KEY}
      secret_access_key: ${S3_SECRET_KEY}
      s3ForcePathStyle: true       # Required for non-AWS S3 endpoints
      insecure: false

Mimir uses the same pattern with slightly different keys:

mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: objects.example.com
          region: us-east-1
          access_key_id: ${S3_ACCESS_KEY}
          secret_access_key: ${S3_SECRET_KEY}
          insecure: false
          s3_force_path_style: true

Native Prometheus Metrics Endpoint

Nutanix Objects exposes a Prometheus-compatible metrics endpoint through Prism Central. You scrape cluster-level and bucket-level metrics directly:

Object store metrics: https://<prism-central>:9440/oss/api/nutanix/metrics
Bucket metrics: https://<prism-central>:9440/oss/api/nutanix/metrics/<store>/<bucket>

Same stack monitoring its own storage backend. No separate exporter to deploy and maintain.

Full-Stack Hybrid Cloud

Nutanix gives us Kubernetes (RKE2 on Nutanix compute), S3-compatible object storage (Objects), cross-DC replication, and global load balancing. The entire observability platform runs on one stack. Adding capacity means adding a node — storage, compute, and networking scale together.

Architecture Overview

Think of the stack in four layers: sources generate telemetry, Alloy and Telegraf collect and route it, the LGTM backends store it, and Grafana lets you see it. Like a pizza supply chain — farms grow the ingredients, trucks deliver them to the kitchen, the kitchen stores and preps everything, and the counter is where you actually get your slice.

+------------------------------------------------------------------+
|                        DATA SOURCES                               |
|                                                                   |
|  +-------------+  +-------------+  +----------+  +------------+  |
|  | Application |  | Network     |  | Storage  |  | Nutanix    |  |
|  | Logs        |  | Devices     |  | Arrays   |  | Clusters   |  |
|  +------+------+  +------+------+  +-----+----+  +-----+------+  |
|         |                |               |              |         |
+---------+----------------+---------------+--------------+---------+
          |                |               |              |
          v                v               v              v
+------------------------------------------------------------------+
|                     COLLECTION LAYER                              |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                    Grafana Alloy                           |  |
|  |  DaemonSet (pod logs, node metrics, audit logs)            |  |
|  |  Deployment (syslog, gNMI, SNMP on MetalLB VIP)            |  |
|  |  Deployment (OTLP traces receiver)                         |  |
|  +-----+------------------+-------------------+---------------+  |
|        |                  |                   |                   |
|  +-----+------+    +-----+------+    +-------+------+            |
|  | Telegraf   |    | NetApp     |    | Nutanix      |            |
|  | (NX-OS     |    | Harvest    |    | Exporter     |            |
|  |  dial-out) |    |            |    | (custom)     |            |
|  +------------+    +------------+    +--------------+            |
|                                                                   |
+--------+------------------+-------------------+-------------------+
         |                  |                   |
         v                  v                   v
+------------------------------------------------------------------+
|                      STORAGE LAYER                                |
|                                                                   |
|  +--------------+   +---------------+   +------------------+     |
|  |    Loki      |   |    Mimir      |   |     Tempo        |     |
|  | (Logs)       |   |  (Metrics)    |   |   (Traces)       |     |
|  | SimpleScale  |   | SimpleScale   |   |  Distributed     |     |
|  +------+-------+   +-------+-------+   +--------+---------+     |
|         |                   |                     |               |
|         +-------------------+---------------------+               |
|                             |                                     |
|                    +--------v--------+                            |
|                    | Nutanix Objects |                            |
|                    | (S3-compatible) |                            |
|                    | Cross-DC + GSLB |                            |
|                    +-----------------+                            |
|                                                                   |
+------------------------------------------------------------------+
         |                  |                   |
         v                  v                   v
+------------------------------------------------------------------+
|                   VISUALIZATION + ALERTING                        |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                      Grafana                               |  |
|  |  PostgreSQL backend via CloudNativePG (cross-DC repl.)     |  |
|  |  OIDC authentication via Entra ID                          |  |
|  |  Data sources: Loki, Mimir, Tempo                          |  |
|  +------------------------------------------------------------+  |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                    Alertmanager                            |  |
|  |  Cross-DC gossip, routes to Teams + ServiceNow             |  |
|  +------------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

Data Flow

Infrastructure and applications generate logs, metrics, and traces across two data centers.
Grafana Alloy collects most of it — DaemonSets tail pod logs and scrape node metrics, a Deployment receives syslog and SNMP from network devices, and another Deployment receives OTLP traces. Telegraf handles Cisco NX-OS dial-out telemetry where Alloy cannot. NetApp Harvest and a custom Nutanix exporter feed storage and infrastructure metrics.
Loki stores logs, Mimir stores metrics, and Tempo stores traces. All three write to Nutanix Objects, with per-DC buckets and dual-write for resilience.
Grafana queries all three backends with its configuration stored in PostgreSQL managed by the CloudNativePG operator (cross-DC streaming replication for DR).
Alertmanager routes alerts via cross-DC gossip, sending notifications to Microsoft Teams and creating incidents in ServiceNow.

Component-by-Component Breakdown

What we are running and how we deploy it. Every service uses a wrapper Helm chart pattern — a local Chart.yaml wrapping the upstream dependency, with shared values plus per-DC overrides. Everything ships through ArgoCD. No manual helm install against production.

Component	App Version	Helm Chart	Deployment Mode	Notes
Grafana	13.0.1 Enterprise (unlicensed)	grafana-community 12.1.1	Single replica	PostgreSQL via CloudNativePG, OIDC via Entra ID
Loki	9.3.6	grafana-community 9.3.6	Simple Scalable	Write/read/backend split, S3 on Nutanix Objects
Mimir	6.0.6	grafana/mimir-distributed 6.0.6	Simple Scalable	~620K active series, S3 on Nutanix Objects
Tempo	v2.9.0	grafana 1.61.3	Distributed	30-day retention, S3 on Nutanix Objects
Alloy	1.6.0	grafana 1.6.2	DaemonSet + 2 Deployments	Universal collector for logs, metrics, traces
Alertmanager	Upstream	Custom wrapper	Standalone	Cross-DC gossip, Teams + ServiceNow
CloudNativePG	1.28	cloudnative-pg 0.28.0	Operator	Primary (3 inst.) + replica (1 inst.) cross-DC

A Note on Deployment Modes

Loki and Mimir each support three deployment modes:

Monolithic: Single binary, all components in one process. Good for dev and test.
Simple Scalable: A few read/write/backend components. Good for small-to-medium production.
Distributed (Microservices): Each component runs independently. Best for very large scale.

We run Simple Scalable for both Loki and Mimir. It lets us scale reads and writes independently without managing a dozen separate microservices per component. The medium pizza — enough to feed the table without ordering one of everything on the menu.

Tempo runs in full distributed mode because its architecture benefits from separating distributors, ingesters, queriers, and compactors at our trace volume.

What We Are Actually Collecting

This is not a lab. Real production traffic across two data centers.

Metrics

Source	Collection Method	What We Get
Nutanix clusters (10)	Custom Prism Central exporter + Objects Prometheus endpoint	Cluster health, storage utilization, DR readiness
NetApp ONTAP (2 clusters)	NetApp Harvest exporter	Volume performance, aggregate capacity, LUN latency
Cisco switches (200+)	Telegraf dial-out (gRPC) + SNMPv3	Interface stats, CPU, memory, optics
Cisco IOS-XE	gNMI dial-in via Alloy	Interface counters, environment
VMware vSphere	Telegraf vsphere plugin	VM performance, host metrics
Meraki cloud	Telegraf meraki plugin	Wireless, switching, security appliance metrics
Kubernetes	Alloy DaemonSet (kubelet, cAdvisor, kube-state-metrics)	Pod health, resource usage, node status
Windows/Linux servers	Off-cluster Alloy agents	Performance counters, EventLog, IIS, SQL Server, AD

Multiple storage vendors, multiple network platforms, multiple compute layers — same Mimir, same Grafana dashboards. Before this project, each of those was a separate portal with separate credentials and separate alerts. Now it is one Grafana with dropdowns. Same PromQL, same alert rules. That consolidation alone justified the project before we shipped a single log line.

Logs

Source	Collection Method
Cisco switches	Syslog to Alloy Deployment (MetalLB VIP)
Firewalls	Syslog (UDP/514, TLS/6514)
Cisco ISE	Syslog
Windows/Linux servers	Off-cluster Alloy agents
Nutanix CVMs	Syslog (1515/UDP)
Kubernetes pods	Alloy DaemonSet tailing container stdout
Kubernetes audit logs	File tail with JSON parsing

Every single one of those sources was previously either unmonitored, monitored in a vendor-specific tool I had to log into separately, or generating alerts that mixed real problems with known noise I could not filter out.

Why Not Just Use Prometheus Directly?

Fair question. Prometheus is excellent for short-term metrics and alerting; if your environment is small enough, a single Prometheus server with local storage works fine. But Prometheus was not designed for long-term storage or horizontal scaling. Local TSDB is limited by disk, retention beyond a few weeks gets expensive, and running multiple Prometheus servers for HA means dealing with federation or deduplication.

Mimir solves that. It accepts Prometheus remote-write, stores data in object storage (virtually unlimited capacity), and provides a query frontend with caching. We currently track around 620,000 active series with a configured limit of 5 million. We bumped that limit once when we hit 98% of the original 1.5 million ceiling after onboarding network device metrics. After that, we wired up a MimirHighCardinalitySeries alert at 80% of the new ceiling so the next surprise becomes a planned conversation, not a 3 a.m. page. That kind of organic growth — and the ability to react to it — would have been painful with standalone Prometheus.

Why Mimir over Thanos or VictoriaMetrics

We did the homework — it is documented as ADR-008. The short version: Thanos shines when you already have Prometheus deployments to retrofit with long-term storage, but its sidecar model adds a Prometheus layer we do not need. Alloy can remote_write straight to Mimir’s distributor. VictoriaMetrics is fast and lean, but less Grafana-native and a smaller ecosystem. Mimir uses the same S3 backend model, the same Helm chart conventions, and the same configuration philosophy as Loki — one ecosystem to learn, one set of operational patterns to run. Net-new beats retrofit when you are starting from a clean floor plan.

Honest Trade-Offs

I am not going to pretend this was painless. Building your own kitchen means you wash your own dishes.

What Is Harder Than Buying a Product

You own the uptime. When something breaks at 3 AM, there is no support ticket to file. You fix it.
Upgrades are on you. Grafana Labs ships new versions frequently. Keeping current is real work; skipping versions creates upgrade debt.
Kubernetes is a prerequisite. If your team does not have Kubernetes experience, the LGTM stack is a steep way to learn it.
Cardinality will surprise you. Understanding which metrics and labels create cardinality explosions is something you learn by hitting limits. We hit 98% of our Mimir series limit before we fully understood what was happening.
Prioritization is the real challenge. The hardest part of this project has not been implementation — modern tooling makes deployment straightforward. The hard part is deciding what to instrument first when everything is suddenly visible.

What Is Better Than We Expected

Zero incremental software spend. Recycled hardware, open-source stack, storage we already had.
Data never leaves our network. Compliance conversations become non-conversations.
Benefits show up immediately. This was not a “wait six months for value” project. Every component we deployed answered questions we could not answer before.
The misconception is wrong. People assume self-hosted observability is too complicated, lacks features, or cannot meet real needs. With modern Helm charts, ArgoCD, and the Grafana ecosystem, it is genuinely not a huge lift. The tooling has matured enormously.
Multi-vendor visibility in one place. NetApp, Nutanix, Cisco, VMware, Windows — same dashboards, same query language, same alert rules. Concretely: 113 dashboards live across 16 categories — 44 NetApp, 12 Nutanix, 11 network, plus VMware, Meraki, Rubrik, Windows, Linux, observability internals, tracing, jobs, backup, HTTP probes, and applications. One Grafana, one login.

The Kubernetes Platform

A quick note on the compute layer since it comes up in every conversation about self-hosted observability: where does this run?

We use RKE2 (Rancher Kubernetes Engine 2) across two data centers, three nodes per cluster. RKE2 is STIG-hardened out of the box, uses embedded etcd for HA, and is systemd-native — it fits a regulated environment where security baselines matter.

The hardware is recycled Cisco UCS rack servers we could not repurpose for our HCI platform (long story involving hardware compatibility lists). Rather than let them collect dust, they became the observability cluster. Each node has significant compute and memory headroom; we run a converged topology where every node is both control plane and worker.

Persistent volumes use the built-in RKE2 local-path provisioner — local disk, zero licensing cost, better latency than iSCSI SAN for write-ahead logs. All long-term data goes to Nutanix Objects.

Both clusters run the same stack with a dual-write architecture for most data pipelines. If one DC goes down, the other has a full copy. Grafana and its PostgreSQL database use an active/standby model with CloudNativePG streaming replication across DCs.

The Full Series Roadmap

This is article 1 of 10. Here is the full menu:

#	Article	Publish Date	What You Will Learn
1	Building an LGTM Stack on Nutanix (this post)	Apr 28	Architecture overview, component roles, trade-offs
2	Nutanix Objects as the Storage Backend for Loki and Mimir	May 1	S3 config, bucket layout, retention, credentials
3a	Grafana Alloy on Kubernetes: Deployment	May 5	Alloy topology, Helm config, three-deployment pattern
3b	Alloy in Production: Logs, Metrics, and Scaling	May 8	What we collect, processing pipelines, resource tuning
4a	Deploying Loki on Kubernetes with Nutanix Objects	May 12	Helm values, Simple Scalable mode, storage config
4b	Loki in Production: Labels, LogQL, and Retention	May 15	Label strategy, per-stream retention, real queries
5	Mimir on Kubernetes: Scalable Metrics	May 19	Simple Scalable setup, series limits, capacity planning
6	Grafana on Kubernetes with CloudNativePG	May 22	PostgreSQL backend, cross-DC replication, upgrades
7	Dashboards That Actually Get Used	May 26	What we monitor, alert strategy, multi-vendor views
8	Lessons Learned: What Worked and What Didn’t	May 29	Retrospective, honest failures, what is next

Each article includes working Helm values, configuration examples, and real pain points you can learn from. This is not theory — it is what runs in production at a regulated industry organization.

Before You Start

Common blockers we hit early. Knowing them up front saves an afternoon.

Symptom	Most Likely Cause	Quick Fix
`The authorization mechanism is not supported` from Loki/Mimir to Objects	Virtual-hosted-style addressing against an S3 endpoint that requires path-style	Set `s3ForcePathStyle: true` (Loki) / `s3_force_path_style: true` (Mimir); also set a non-empty `region`
Mimir series limit hit unexpectedly	Default limit too low for network device metrics at scale	Plan series budget before onboarding sources; alert at 70% of limit
Alloy syslog receiver OOM during a network event	No `best_effort` mode; defaults too generous	Set `max_message_size`, connection limits, lower HPA threshold
Nutanix CVM syslog parsed as garbage	RFC 3164 strict parser rejects ISO 8601 timestamps	Use raw mode for CVM listener; classify post-hoc
MetalLB webhook drift causes ArgoCD `OutOfSync`	MetalLB rotates webhook certs in-cluster	Add `ignoreDifferences` blocks to the ArgoCD Application

Each of these is documented in upstream issues. Pin the relevant ones in your runbook before you go live.

What Is Next

In the next article, we get into the storage foundation: Nutanix Objects as the S3 backend for Loki and Mimir. We walk through bucket layout across two DCs, retention policies, credential management with External Secrets Operator, and the cross-DC replication strategy that simplifies our DR story.

If you want to follow along, you will need:

A Kubernetes cluster (we use RKE2; any distribution works)
Nutanix Objects, or another S3-compatible object store (MinIO works for testing)
Helm 3.x installed
kubectl access to your cluster
ArgoCD recommended for GitOps deployment

Happy automating!

This is article 1 of 10 in the LGTM on Nutanix series. Next up: Article 2 — Nutanix Objects as the Storage Backend.