Skip to content
Pipelines and Pizza πŸ•
Go back

Building an LGTM Observability Stack on Nutanix: Why We Did It and What It Looks Like

20 min read

We failed a DR exercise. Not because our systems went down β€” because we could not prove they were up. The auditors asked straightforward questions: which backup jobs ran successfully last Tuesday? What was the failover time for the database cluster? Were there any authentication anomalies during the switchover? And we sat there cycling through a dozen vendor portals, trying to piece together an answer from tools that each showed a slice of the picture but none showed the whole thing.

That was the moment I realized we did not have an observability problem. We had an observability absence. We had monitoring β€” Nutanix Prism Central, VMware vROps, Azure Analytics, backup vendor dashboards, firewall consoles β€” but no way to correlate any of it, no way to filter signal from noise, and no way for me to review it as often as I needed to. We were reporting on what the vendors thought we needed to see, not what actually mattered.

Building an observability stack is like building a pizza from scratch. You could order delivery and let someone else pick the toppings, but when you need to control every ingredient and the delivery options do not cover your part of town, it is time to build your own kitchen. Five months in, that kitchen is feeding us answers we never had before β€” and we are just getting started.


What the LGTM Stack Actually Is

LGTM is not a single product. It is a collection of open-source projects from Grafana Labs that each handle one pillar of observability:

LetterComponentRoleWhat It Replaces
LLokiLog aggregationSplunk, ELK/Elasticsearch, CloudWatch Logs
GGrafanaVisualization and dashboardsKibana, vendor portals, Azure Monitor
TTempoDistributed tracingJaeger, Zipkin
MMimirLong-term metrics storageThanos, Cortex, VictoriaMetrics

And the collectors that move data into the stack:

ComponentRoleWhat It Replaces
AlloyPrimary telemetry collection (metrics, logs, traces)Grafana Agent (deprecated), Promtail, FluentBit
TelegrafNetwork device telemetry (NX-OS dial-out, vSphere, Meraki)Vendor-specific collectors
NetApp HarvestStorage array metricsNetApp ActiveIQ portal
Custom Nutanix ExporterPrism Central inventory and DR readiness metricsManual portal checks

A few things worth calling out before we go deeper:

Loki is not Elasticsearch. Loki indexes labels (metadata), not log content. Compressed log chunks land in object storage. That is why it is dramatically cheaper to operate β€” you are not paying for full-text indexing on every line. The trade-off is that grep-style searches across unindexed fields are slower. If you label well, you rarely notice.

Mimir is Prometheus for the long haul. Grafana Labs forked CNCF Cortex to build Mimir, stripped years of accumulated technical debt, and added features from Grafana Enterprise Metrics. Horizontally scalable, speaks native Prometheus remote-write, stores metrics in object storage instead of local disk. If you know Prometheus, you know how to feed Mimir.

Alloy replaced Grafana Agent. As of 2024, Grafana Agent went into maintenance mode and Alloy became the recommended collector. Alloy is Grafana Labs’ OpenTelemetry Collector distribution with its own configuration language (Alloy syntax, formerly River). It collects metrics, logs, and traces in one binary.

Tempo is in scope. We deploy Tempo in distributed microservices mode with 30-day retention. Tracing was initially out of scope but came back once the core stack stabilized.


Why We Did Not Buy Something

Let me be direct: there was no budget for this. Zero. That is not a negotiating position β€” that is reality. When I brought up observability gaps after the DR exercise, the answer was not β€œhere is money to fix it.” The answer was β€œfigure it out.”

So we figured it out. Honestly? The constraints made the solution better.

What We Evaluated

OptionWhy We LookedWhy We Passed
Nutanix Prism CentralAlready deployed, native to platformStrong for Nutanix-specific metrics; weak for logs and third-party infra
VMware vROpsAlready deployed for vSphereUseful for VMware visibility; misses storage arrays, network devices, application logs
Azure AnalyticsAlready a Microsoft shopMost of our infrastructure is on-prem; egress for everything makes no sense
SplunkThe right feature setPer-GB pricing is a non-starter for our log volume
Datadog / New RelicMentioned by every consultantSaaS-only, off-prem data, premium pricing β€” same dealbreakers as Splunk

Every vendor portal gave us a piece of the puzzle. None gave us the whole picture. Layering more vendor portals on top was just adding more browser tabs, not more visibility β€” like trying to make a pizza by ordering five different appetizers and hoping they add up to dinner.

Why Open Source Won

The Grafana community is enormous, the documentation is solid, and the stack runs on infrastructure we already own. We started with recycled compute capacity β€” Cisco UCS rack nodes that were sitting underutilized β€” and an open-source project with a well-documented community. If the project proves its value (spoiler: it already has), there is a path to enterprise features and support from Grafana Labs without ripping anything out. A pizza kitchen that can grow from a food truck to a restaurant without rebuilding the ovens.


Why Nutanix as the Platform

This is not β€œNutanix because that is what we run.” There are specific technical reasons it is a good fit for this stack.

Nutanix Objects: S3-Compatible Storage That Already Exists

Loki and Mimir both need object storage. In the cloud you would use S3 or Azure Blob. On-prem you need an S3-compatible store. Nutanix Objects gives us exactly that β€” an S3-compatible service running on infrastructure we already manage, with native Prometheus metrics for monitoring it.

We deploy Objects in both data centers with cross-DC replication for backups and a global load balancer in front for HA. The Nutanix developer community has documented this pattern for configuring Loki with Objects. We used that as a starting point and extended it to Mimir and Tempo. Article 2 of this series covers the full storage architecture.

The configuration in Loki’s Helm values is straightforward:

loki:
  storage:
    type: s3
    s3:
      endpoint: https://objects.example.com
      region: us-east-1            # Required but arbitrary for non-AWS S3
      bucketnames: loki-chunks
      access_key_id: ${S3_ACCESS_KEY}
      secret_access_key: ${S3_SECRET_KEY}
      s3ForcePathStyle: true       # Required for non-AWS S3 endpoints
      insecure: false

Mimir uses the same pattern with slightly different keys:

mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: objects.example.com
          region: us-east-1
          access_key_id: ${S3_ACCESS_KEY}
          secret_access_key: ${S3_SECRET_KEY}
          insecure: false
          s3_force_path_style: true

Native Prometheus Metrics Endpoint

Nutanix Objects exposes a Prometheus-compatible metrics endpoint through Prism Central. You scrape cluster-level and bucket-level metrics directly:

  • Object store metrics: https://<prism-central>:9440/oss/api/nutanix/metrics
  • Bucket metrics: https://<prism-central>:9440/oss/api/nutanix/metrics/<store>/<bucket>

Same stack monitoring its own storage backend. No separate exporter to deploy and maintain.

Full-Stack Hybrid Cloud

Nutanix gives us Kubernetes (RKE2 on Nutanix compute), S3-compatible object storage (Objects), cross-DC replication, and global load balancing. The entire observability platform runs on one stack. Adding capacity means adding a node β€” storage, compute, and networking scale together.


Architecture Overview

Think of the stack in four layers: sources generate telemetry, Alloy and Telegraf collect and route it, the LGTM backends store it, and Grafana lets you see it. Like a pizza supply chain β€” farms grow the ingredients, trucks deliver them to the kitchen, the kitchen stores and preps everything, and the counter is where you actually get your slice.

+------------------------------------------------------------------+
|                        DATA SOURCES                               |
|                                                                   |
|  +-------------+  +-------------+  +----------+  +------------+  |
|  | Application |  | Network     |  | Storage  |  | Nutanix    |  |
|  | Logs        |  | Devices     |  | Arrays   |  | Clusters   |  |
|  +------+------+  +------+------+  +-----+----+  +-----+------+  |
|         |                |               |              |         |
+---------+----------------+---------------+--------------+---------+
          |                |               |              |
          v                v               v              v
+------------------------------------------------------------------+
|                     COLLECTION LAYER                              |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                    Grafana Alloy                           |  |
|  |  DaemonSet (pod logs, node metrics, audit logs)            |  |
|  |  Deployment (syslog, gNMI, SNMP on MetalLB VIP)            |  |
|  |  Deployment (OTLP traces receiver)                         |  |
|  +-----+------------------+-------------------+---------------+  |
|        |                  |                   |                   |
|  +-----+------+    +-----+------+    +-------+------+            |
|  | Telegraf   |    | NetApp     |    | Nutanix      |            |
|  | (NX-OS     |    | Harvest    |    | Exporter     |            |
|  |  dial-out) |    |            |    | (custom)     |            |
|  +------------+    +------------+    +--------------+            |
|                                                                   |
+--------+------------------+-------------------+-------------------+
         |                  |                   |
         v                  v                   v
+------------------------------------------------------------------+
|                      STORAGE LAYER                                |
|                                                                   |
|  +--------------+   +---------------+   +------------------+     |
|  |    Loki      |   |    Mimir      |   |     Tempo        |     |
|  | (Logs)       |   |  (Metrics)    |   |   (Traces)       |     |
|  | SimpleScale  |   | SimpleScale   |   |  Distributed     |     |
|  +------+-------+   +-------+-------+   +--------+---------+     |
|         |                   |                     |               |
|         +-------------------+---------------------+               |
|                             |                                     |
|                    +--------v--------+                            |
|                    | Nutanix Objects |                            |
|                    | (S3-compatible) |                            |
|                    | Cross-DC + GSLB |                            |
|                    +-----------------+                            |
|                                                                   |
+------------------------------------------------------------------+
         |                  |                   |
         v                  v                   v
+------------------------------------------------------------------+
|                   VISUALIZATION + ALERTING                        |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                      Grafana                               |  |
|  |  PostgreSQL backend via CloudNativePG (cross-DC repl.)     |  |
|  |  OIDC authentication via Entra ID                          |  |
|  |  Data sources: Loki, Mimir, Tempo                          |  |
|  +------------------------------------------------------------+  |
|                                                                   |
|  +------------------------------------------------------------+  |
|  |                    Alertmanager                            |  |
|  |  Cross-DC gossip, routes to Teams + ServiceNow             |  |
|  +------------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

Data Flow

  1. Infrastructure and applications generate logs, metrics, and traces across two data centers.
  2. Grafana Alloy collects most of it β€” DaemonSets tail pod logs and scrape node metrics, a Deployment receives syslog and SNMP from network devices, and another Deployment receives OTLP traces. Telegraf handles Cisco NX-OS dial-out telemetry where Alloy cannot. NetApp Harvest and a custom Nutanix exporter feed storage and infrastructure metrics.
  3. Loki stores logs, Mimir stores metrics, and Tempo stores traces. All three write to Nutanix Objects, with per-DC buckets and dual-write for resilience.
  4. Grafana queries all three backends with its configuration stored in PostgreSQL managed by the CloudNativePG operator (cross-DC streaming replication for DR).
  5. Alertmanager routes alerts via cross-DC gossip, sending notifications to Microsoft Teams and creating incidents in ServiceNow.

Component-by-Component Breakdown

What we are running and how we deploy it. Every service uses a wrapper Helm chart pattern β€” a local Chart.yaml wrapping the upstream dependency, with shared values plus per-DC overrides. Everything ships through ArgoCD. No manual helm install against production.

ComponentApp VersionHelm ChartDeployment ModeNotes
Grafana13.0.1 Enterprise (unlicensed)grafana-community 12.1.1Single replicaPostgreSQL via CloudNativePG, OIDC via Entra ID
Loki9.3.6grafana-community 9.3.6Simple ScalableWrite/read/backend split, S3 on Nutanix Objects
Mimir6.0.6grafana/mimir-distributed 6.0.6Simple Scalable~620K active series, S3 on Nutanix Objects
Tempov2.9.0grafana 1.61.3Distributed30-day retention, S3 on Nutanix Objects
Alloy1.6.0grafana 1.6.2DaemonSet + 2 DeploymentsUniversal collector for logs, metrics, traces
AlertmanagerUpstreamCustom wrapperStandaloneCross-DC gossip, Teams + ServiceNow
CloudNativePG1.28cloudnative-pg 0.28.0OperatorPrimary (3 inst.) + replica (1 inst.) cross-DC

A Note on Deployment Modes

Loki and Mimir each support three deployment modes:

  • Monolithic: Single binary, all components in one process. Good for dev and test.
  • Simple Scalable: A few read/write/backend components. Good for small-to-medium production.
  • Distributed (Microservices): Each component runs independently. Best for very large scale.

We run Simple Scalable for both Loki and Mimir. It lets us scale reads and writes independently without managing a dozen separate microservices per component. The medium pizza β€” enough to feed the table without ordering one of everything on the menu.

Tempo runs in full distributed mode because its architecture benefits from separating distributors, ingesters, queriers, and compactors at our trace volume.


What We Are Actually Collecting

This is not a lab. Real production traffic across two data centers.

Metrics

SourceCollection MethodWhat We Get
Nutanix clusters (10)Custom Prism Central exporter + Objects Prometheus endpointCluster health, storage utilization, DR readiness
NetApp ONTAP (2 clusters)NetApp Harvest exporterVolume performance, aggregate capacity, LUN latency
Cisco switches (200+)Telegraf dial-out (gRPC) + SNMPv3Interface stats, CPU, memory, optics
Cisco IOS-XEgNMI dial-in via AlloyInterface counters, environment
VMware vSphereTelegraf vsphere pluginVM performance, host metrics
Meraki cloudTelegraf meraki pluginWireless, switching, security appliance metrics
KubernetesAlloy DaemonSet (kubelet, cAdvisor, kube-state-metrics)Pod health, resource usage, node status
Windows/Linux serversOff-cluster Alloy agentsPerformance counters, EventLog, IIS, SQL Server, AD

Multiple storage vendors, multiple network platforms, multiple compute layers β€” same Mimir, same Grafana dashboards. Before this project, each of those was a separate portal with separate credentials and separate alerts. Now it is one Grafana with dropdowns. Same PromQL, same alert rules. That consolidation alone justified the project before we shipped a single log line.

Logs

SourceCollection Method
Cisco switchesSyslog to Alloy Deployment (MetalLB VIP)
FirewallsSyslog (UDP/514, TLS/6514)
Cisco ISESyslog
Windows/Linux serversOff-cluster Alloy agents
Nutanix CVMsSyslog (1515/UDP)
Kubernetes podsAlloy DaemonSet tailing container stdout
Kubernetes audit logsFile tail with JSON parsing

Every single one of those sources was previously either unmonitored, monitored in a vendor-specific tool I had to log into separately, or generating alerts that mixed real problems with known noise I could not filter out.

Why Not Just Use Prometheus Directly?

Fair question. Prometheus is excellent for short-term metrics and alerting; if your environment is small enough, a single Prometheus server with local storage works fine. But Prometheus was not designed for long-term storage or horizontal scaling. Local TSDB is limited by disk, retention beyond a few weeks gets expensive, and running multiple Prometheus servers for HA means dealing with federation or deduplication.

Mimir solves that. It accepts Prometheus remote-write, stores data in object storage (virtually unlimited capacity), and provides a query frontend with caching. We currently track around 620,000 active series with a configured limit of 5 million. We bumped that limit once when we hit 98% of the original 1.5 million ceiling after onboarding network device metrics. After that, we wired up a MimirHighCardinalitySeries alert at 80% of the new ceiling so the next surprise becomes a planned conversation, not a 3 a.m. page. That kind of organic growth β€” and the ability to react to it β€” would have been painful with standalone Prometheus.

Why Mimir over Thanos or VictoriaMetrics

We did the homework β€” it is documented as ADR-008. The short version: Thanos shines when you already have Prometheus deployments to retrofit with long-term storage, but its sidecar model adds a Prometheus layer we do not need. Alloy can remote_write straight to Mimir’s distributor. VictoriaMetrics is fast and lean, but less Grafana-native and a smaller ecosystem. Mimir uses the same S3 backend model, the same Helm chart conventions, and the same configuration philosophy as Loki β€” one ecosystem to learn, one set of operational patterns to run. Net-new beats retrofit when you are starting from a clean floor plan.


Honest Trade-Offs

I am not going to pretend this was painless. Building your own kitchen means you wash your own dishes.

What Is Harder Than Buying a Product

  • You own the uptime. When something breaks at 3 AM, there is no support ticket to file. You fix it.
  • Upgrades are on you. Grafana Labs ships new versions frequently. Keeping current is real work; skipping versions creates upgrade debt.
  • Kubernetes is a prerequisite. If your team does not have Kubernetes experience, the LGTM stack is a steep way to learn it.
  • Cardinality will surprise you. Understanding which metrics and labels create cardinality explosions is something you learn by hitting limits. We hit 98% of our Mimir series limit before we fully understood what was happening.
  • Prioritization is the real challenge. The hardest part of this project has not been implementation β€” modern tooling makes deployment straightforward. The hard part is deciding what to instrument first when everything is suddenly visible.

What Is Better Than We Expected

  • Zero incremental software spend. Recycled hardware, open-source stack, storage we already had.
  • Data never leaves our network. Compliance conversations become non-conversations.
  • Benefits show up immediately. This was not a β€œwait six months for value” project. Every component we deployed answered questions we could not answer before.
  • The misconception is wrong. People assume self-hosted observability is too complicated, lacks features, or cannot meet real needs. With modern Helm charts, ArgoCD, and the Grafana ecosystem, it is genuinely not a huge lift. The tooling has matured enormously.
  • Multi-vendor visibility in one place. NetApp, Nutanix, Cisco, VMware, Windows β€” same dashboards, same query language, same alert rules. Concretely: 113 dashboards live across 16 categories β€” 44 NetApp, 12 Nutanix, 11 network, plus VMware, Meraki, Rubrik, Windows, Linux, observability internals, tracing, jobs, backup, HTTP probes, and applications. One Grafana, one login.

The Kubernetes Platform

A quick note on the compute layer since it comes up in every conversation about self-hosted observability: where does this run?

We use RKE2 (Rancher Kubernetes Engine 2) across two data centers, three nodes per cluster. RKE2 is STIG-hardened out of the box, uses embedded etcd for HA, and is systemd-native β€” it fits a regulated environment where security baselines matter.

The hardware is recycled Cisco UCS rack servers we could not repurpose for our HCI platform (long story involving hardware compatibility lists). Rather than let them collect dust, they became the observability cluster. Each node has significant compute and memory headroom; we run a converged topology where every node is both control plane and worker.

Persistent volumes use the built-in RKE2 local-path provisioner β€” local disk, zero licensing cost, better latency than iSCSI SAN for write-ahead logs. All long-term data goes to Nutanix Objects.

Both clusters run the same stack with a dual-write architecture for most data pipelines. If one DC goes down, the other has a full copy. Grafana and its PostgreSQL database use an active/standby model with CloudNativePG streaming replication across DCs.


The Full Series Roadmap

This is article 1 of 10. Here is the full menu:

#ArticlePublish DateWhat You Will Learn
1Building an LGTM Stack on Nutanix (this post)Apr 28Architecture overview, component roles, trade-offs
2Nutanix Objects as the Storage Backend for Loki and MimirMay 1S3 config, bucket layout, retention, credentials
3aGrafana Alloy on Kubernetes: DeploymentMay 5Alloy topology, Helm config, three-deployment pattern
3bAlloy in Production: Logs, Metrics, and ScalingMay 8What we collect, processing pipelines, resource tuning
4aDeploying Loki on Kubernetes with Nutanix ObjectsMay 12Helm values, Simple Scalable mode, storage config
4bLoki in Production: Labels, LogQL, and RetentionMay 15Label strategy, per-stream retention, real queries
5Mimir on Kubernetes: Scalable MetricsMay 19Simple Scalable setup, series limits, capacity planning
6Grafana on Kubernetes with CloudNativePGMay 22PostgreSQL backend, cross-DC replication, upgrades
7Dashboards That Actually Get UsedMay 26What we monitor, alert strategy, multi-vendor views
8Lessons Learned: What Worked and What Didn’tMay 29Retrospective, honest failures, what is next

Each article includes working Helm values, configuration examples, and real pain points you can learn from. This is not theory β€” it is what runs in production at a regulated industry organization.


Before You Start

Common blockers we hit early. Knowing them up front saves an afternoon.

SymptomMost Likely CauseQuick Fix
The authorization mechanism is not supported from Loki/Mimir to ObjectsVirtual-hosted-style addressing against an S3 endpoint that requires path-styleSet s3ForcePathStyle: true (Loki) / s3_force_path_style: true (Mimir); also set a non-empty region
Mimir series limit hit unexpectedlyDefault limit too low for network device metrics at scalePlan series budget before onboarding sources; alert at 70% of limit
Alloy syslog receiver OOM during a network eventNo best_effort mode; defaults too generousSet max_message_size, connection limits, lower HPA threshold
Nutanix CVM syslog parsed as garbageRFC 3164 strict parser rejects ISO 8601 timestampsUse raw mode for CVM listener; classify post-hoc
MetalLB webhook drift causes ArgoCD OutOfSyncMetalLB rotates webhook certs in-clusterAdd ignoreDifferences blocks to the ArgoCD Application

Each of these is documented in upstream issues. Pin the relevant ones in your runbook before you go live.


What Is Next

In the next article, we get into the storage foundation: Nutanix Objects as the S3 backend for Loki and Mimir. We walk through bucket layout across two DCs, retention policies, credential management with External Secrets Operator, and the cross-DC replication strategy that simplifies our DR story.

If you want to follow along, you will need:

  • A Kubernetes cluster (we use RKE2; any distribution works)
  • Nutanix Objects, or another S3-compatible object store (MinIO works for testing)
  • Helm 3.x installed
  • kubectl access to your cluster
  • ArgoCD recommended for GitOps deployment

Happy automating!


This is article 1 of 10 in the LGTM on Nutanix series. Next up: Article 2 β€” Nutanix Objects as the Storage Backend.