Table of Contents
Open Table of Contents
Where We Left Off
We’ve deployed Alloy, Loki, Mimir, and Grafana. The data is flowing. The query backends are tuned. The Grafana 13 upgrade was clean.
This article is about what’s on top — the dashboard set. There are 118 dashboards in the repo right now, organized into 18 folders. None of them are auto-generated from a community import without review. Every one was either built or adapted for our environment. Every one is a ConfigMap in Git, deployed by ArgoCD, served by Grafana’s dashboard sidecar.
This post is the architecture and rules behind that set — not the dashboards themselves, but how we maintain a hundred-plus dashboards without the set becoming the kind of graveyard that everyone has seen at least once.
How We Organize: By Domain, Not By Tier
The standard advice on dashboards is to organize by tier: Tier 1 NOC TV, Tier 2 service-level, Tier 3 debug. The idea is that you start with a glance at the wall TV, drill into a service when something looks off, and end up at a deep debug view to actually fix it.
That framework works. We don’t use it.
What we use instead is domain organization. A folder per source of data. A nutanix folder for Nutanix dashboards. A netapp folder for NetApp Harvest dashboards. A windows folder for Windows servers. A network folder for Cisco switch and ISE dashboards. Inside each folder, there’s typically an overview dashboard that surfaces the high-level health and a set of drill-down dashboards for specific subsystems.
The reasoning: when someone has a question, they have it about a domain, not a tier. “How is Nutanix looking?” goes to the Nutanix folder. “Is the NetApp cluster behaving?” goes to NetApp. “What’s the ArgoCD state right now?” goes to observability. You don’t need a “first I check the NOC, then I check the service, then I check the debug view” mental model — you just go to the folder for what you’re looking at.
It also matches how our team is shaped. The platform team owns the observability and kubernetes folders. The network team owns the network and meraki folders. The storage team owns NetApp, storage-array (Pure), and the Nutanix folders. The Windows admins own windows. Domain folders map to ownership in a way that tier folders don’t.
If you have a NOC team watching a TV wall and want a Tier 1 overview dashboard, build one — but don’t make it the organizing principle of the whole set. Make it a single dashboard in the observability folder named cluster-health or similar.
The Folder-by-Folder Inventory
The actual folder set, with the dashboard count and a one-line description:
| Folder | Count | What’s in it |
|---|---|---|
| netapp | 42 | NetApp Harvest dashboards — by far the largest set. Aggregate, cluster, disk, FlexCache, FlexGroup, SVM, volume, snapmirror, fpolicy, headroom, NFS, S3, security, support, mailbox, and many more. Harvest ships great default content. |
| observability | 16 | The stack watching itself — alertmanager-overview, alloy-pipeline-health, argocd-overview, cache-performance, cluster-health, etcd-health, k8s-namespace-overview, k8s-workload-health, log-error-leaderboard, loki-explorer, mimir-cardinality, postgres-health, scrape-target-health, telegraf-health, tempo-operational, tempo-services-observability. |
| nutanix | 11 | Prism Central + cluster dashboards — alerts-analytics, disk-health, DR-readiness, host-inventory, images, logs, network, objects, prism, vm-efficiency, vm-lifecycle. |
| windows | 9 | Windows server dashboards — overview, task scheduler, eventlog, services, plus deeper ones for specific roles (domain controller, SQL, IIS). |
| network | 9 | Cisco switch + ISE dashboards — catalyst-per-interface, catalyst-switch-health, cisco-ise-auth, network-syslog-overview, nexus-overview, nexus-per-interface, nexus-switch-health-syslog, nexus-switch-telemetry, node-status. |
| vendor-status | 9 | Synthetic probes via blackbox-exporter — retail app reachability, internal services, Azure endpoints, vendor SaaS. |
| linux | 6 | Linux server dashboards — overview, network, entropy/security, disk I/O, etc. Sourced from node_exporter via Alloy. |
| backup | 4 | Rubrik backup appliance dashboards — job status, SLA, capacity, recent activity. |
| jobs | 3 | Scheduled job dashboards — cron health, batch job status, recent runs. |
| http-probes | 3 | Detailed blackbox-exporter views — probe latency, TLS expiry, status code trends. |
| meraki | 1 | Meraki network overview via telegraf-meraki. |
| security | 1 | CrowdStrike host coverage — RFM degraded, containment, sensor version, host age. |
| storage-array | 1 | Pure FlashArray dashboards (will grow as we onboard more arrays). |
| applications | 1 | Application-level dashboards. |
| database | 1 | Postgres / CNPG dashboards beyond what’s in observability. |
| tracing | 1 | Tempo trace exploration. |
| vmware | 1 | vCenter via telegraf-vsphere. |
| preview | 1 | Staging area for in-progress dashboards before promotion. |
A few observations on the shape:
NetApp dominates at 42 dashboards because Harvest ships a massive default set. We didn’t build 42 NetApp dashboards from scratch — we imported the Harvest defaults, kept the ones we use, and dropped the ones we don’t. The Harvest project provides much better content for NetApp than I could build in a year.
The observability folder is the second largest because the stack itself has a lot to watch. Loki ingestion, Mimir cardinality, Alloy pipeline health, ArgoCD sync status, etcd write latency, Postgres connection state — each of those is a thing that can go wrong and needs its own dashboard.
Single-dashboard folders are fine. meraki, security, vmware, database, applications — each has one dashboard right now. The folder structure is the navigational unit; we’d rather have a thinly-populated folder that’s discoverable than a fat catch-all “misc” folder where nobody can find anything.
preview is a staging area. New dashboards land there for the team to review before being promoted to their permanent home. It’s not a graveyard — we delete dashboards from preview that don’t graduate within a sprint. Keeps the set honest.
The ConfigMap Sidecar Pattern
Every dashboard in the repo is a JSON file at a path like observability/grafana/dashboards/<folder>/<name>.json. The chart template wraps each JSON file into a ConfigMap with two important pieces of metadata:
- A label (
grafana_dashboard: "true") — tells the Grafana sidecar container to load this ConfigMap as a dashboard. - An annotation (
grafana_dashboard_folder: <folder>) — tells the sidecar which Grafana folder to put the dashboard in.
The sidecar container runs alongside Grafana, watches for labeled ConfigMaps in the observability namespace, and POSTs each one to Grafana’s HTTP API. New dashboard? Drop the JSON in the right folder, ArgoCD syncs the ConfigMap, sidecar uploads it, done. Dashboard removed from the repo? ConfigMap goes away, sidecar deletes the dashboard from Grafana.
The Grafana chart values that enable this:
grafana:
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
searchNamespace: observability
folderAnnotation: grafana_dashboard_folder
provider:
foldersFromFilesStructure: true
foldersFromFilesStructure: true means the folder annotation drives the folder, not the filesystem layout in the sidecar’s cache directory. That matters when you’re managing folders centrally rather than relying on the path.
The reason this matters more than terraform apply against the Grafana API: GitOps consistency. The dashboards live in the same repo as the Loki values, the Mimir values, the Alloy config, the alert rules. Every change goes through PR review. Every deployment is auditable in git history. When the auditor asks “who changed the Cluster Health dashboard at 4 PM on Friday,” there’s a commit with a name on it.
The trade-off is that you can’t edit a dashboard live in the Grafana UI and have it stick — the next ArgoCD sync would revert your changes. This sounds bad until you’ve experienced the alternative, which is the Cluster Health v2 (FINAL) problem where everyone forks dashboards locally and nobody knows which one is current. Editing through Git, with proper review, is friction that pays you back.
The workflow for a new dashboard:
- Build it in your local Grafana (or a dev instance) until you’re happy.
- Export the JSON.
- Strip dashboard-specific fields that shouldn’t be in source (
id,version, instance-specific UIDs of datasources where applicable). - Drop the JSON at
observability/grafana/dashboards/<folder>/<name>.json. - PR. Review. Merge. Sync.
It’s slower than clicking save in the UI. It’s also the difference between “a dashboard we run” and “a dashboard somebody saved in production once.”
Multi-Vendor Storage: One Dashboard, Three Vendors
This is the dashboard pattern I’m happiest with from a build-quality perspective.
We have storage from three vendors:
- Nutanix Files / Objects — built into the HCI platform, monitored via the
nutanix-exporterdeployment in the cluster. - NetApp ONTAP — physical filers, monitored via Harvest (NetApp’s official exporter).
- Pure FlashArray — physical block arrays, monitored via Pure’s native
pure-fa-openmetrics-exporter(ADR-048).
Each vendor has its own metric names, label conventions, and view of what “capacity” or “performance” means. The metric prefixes alone tell the story:
- Nutanix:
nutanix_cluster_storage_capacity_bytes,nutanix_cluster_num_read_iops - NetApp Harvest:
volume_size_total,volume_size_used,volume_read_ops - Pure:
purefa_array_space_bytes{space="capacity"},purefa_array_performance_throughput_iops{dimension="read"}
For a multi-vendor capacity dashboard to work, you need these to land in one consistent shape. The lever for that is Mimir recording rules.
We have a recording rule group that normalizes each vendor’s capacity and performance metrics into a common infrastructure:storage_* namespace, with a vendor label that lets you filter or break down. The recording rule for Nutanix capacity looks like:
- record: infrastructure:storage_total_bytes:sum
expr: |
label_replace(
sum by (cluster_name) (nutanix_cluster_storage_capacity_bytes),
"vendor", "nutanix", "", ""
)
NetApp’s recording rule against volume_size_total and Pure’s against purefa_array_space_bytes{space="capacity"} follow the same pattern with vendor-appropriate inner queries. After the recording rules evaluate, we have:
infrastructure:storage_total_bytes:sum{vendor="nutanix"}
infrastructure:storage_total_bytes:sum{vendor="netapp"}
infrastructure:storage_total_bytes:sum{vendor="pure"}
The multi-vendor capacity dashboard can then query the normalized series with a single PromQL expression and a vendor template variable. Filtering to one vendor gives you that vendor’s view. Showing all three gives you a unified picture for capacity planning.
The cost of this pattern is a non-trivial set of recording rules — each vendor needs its own normalization for each metric (capacity, performance, etc.). The payoff is one dashboard that anybody can use, not three vendor-specific dashboards that require translating units and labels in your head every time you switch tabs.
The same approach extends to performance metrics (IOPS, throughput, latency). Different rule definitions, same naming convention, same dashboard pattern.
Self-Observability: We Watch the Watchers
The observability folder has 16 dashboards, and most of them are watching the stack itself. The set:
alertmanager-overview— alert volume, notification success/failure, current silences.alloy-pipeline-health— Alloy DaemonSet ingestion rate, dropped samples, WAL queue depth.argocd-overview— application sync status, drift detection, last reconcile time.cache-performance— memcached hit rates across Loki and Mimir caches.cluster-health— high-level Kubernetes node and pod health.etcd-health— etcd request latency, leader changes, DB size.k8s-namespace-overview/k8s-workload-health— pod restart rates, OOM kills, resource utilization.log-error-leaderboard— top error-emitting pods over the last hour (LogQL).loki-explorer— interactive log exploration view (more of a starting-point dashboard than a static view).mimir-cardinality— top metrics by series count, label cardinality breakdown.postgres-health— CNPG cluster state, replication lag, query stats.scrape-target-health— which Prometheus scrape targets are up/down across the cluster.telegraf-health— Telegraf instance health for the network/vSphere/Meraki polling.tempo-operational/tempo-services-observability— Tempo backend metrics and traced-service view.
The principle is dogfooding. We use our own LGTM stack to monitor the LGTM stack. When Loki has a problem, we see it on a Grafana dashboard fed by Mimir. When Mimir has a problem, we see it on a Grafana dashboard fed by self-scraped Mimir metrics. When Grafana itself has a problem… we look at logs in Loki and curl health endpoints from a laptop. There’s a recursion limit, and we live with it.
The mimir-cardinality dashboard is the one that’s saved us from a stream/series explosion the most often. It tracks the top label names by cardinality and the top metrics by series count over time. When some new exporter starts emitting a metric with a request_id label, this dashboard catches it before it causes an ingester OOM. Pair it with an alert on “any single label’s cardinality grew by >20% in an hour” and you have a real safety net.
Rules That Keep the Set From Sprawling
The standing rules that keep 118 dashboards from becoming 213:
Every folder has an owner. Listed in a README inside each folder, or in the team’s runbook. No owner, no folder, no dashboard. Owners review their dashboards quarterly. Anything that hasn’t been opened in 90 days gets archived or deleted.
preview/ is a staging area, not storage. Dashboards in preview/ graduate to a permanent folder within a sprint or they get deleted. The folder is not a graveyard for half-baked ideas.
Build for a specific question. Every dashboard should answer one question (or one tight cluster of questions). “Cluster Health” answers “is the cluster healthy right now.” “Loki Explorer” answers “can I find this log line.” “Mimir Cardinality” answers “is anything about to blow up our ingester.” A dashboard that doesn’t have a clear answer-this-question purpose is a dashboard that won’t get used.
Use community starters, then trim. Harvest’s NetApp dashboards and Grafana’s stock kube-prometheus dashboards are excellent. Import them, then drop the panels you don’t use. Don’t try to build NetApp visibility from scratch when NetApp has a team of engineers building Harvest dashboards full-time.
Variables on every panel. Time range, cluster, namespace, hostname — whatever scopes the dashboard. Hardcoded scopes mean a dashboard that only works for one scenario.
Provision as code. The ConfigMap-sidecar pattern from above. The “v2 (FINAL)” problem is real and the cost is one PR per dashboard change. It’s worth it.
Link from one to another. A panel on the cluster health dashboard that’s red should have a click-through to the dashboard that explains why. The drill-down breadcrumb is what turns a set of dashboards into a system.
Default time ranges thoughtfully. Most of our dashboards default to “last 1 hour.” Some default to “last 24 hours” or “last 7 days” because that’s the right window for that question. Default time range is a small detail that compounds — when an engineer opens a dashboard at 3 AM and the default range doesn’t show the incident, they have to fiddle with the time picker before they can start working. Pick the right default.
Wrapping Up
A maintained dashboard set is a multiplier on the rest of the stack. Without it, all the Loki ingestion and Mimir compaction in the world is just data nobody looks at.
The shape that works for us:
- 118 dashboards across 18 domain-organized folders. Domain organization maps to ownership and matches how questions actually get asked.
- ConfigMap + sidecar pattern makes every dashboard a Git artifact. Slower than UI edits; eliminates the v2 (FINAL) problem entirely.
- Multi-vendor normalization through recording rules — one dashboard reads
infrastructure:storage_*{vendor="..."}, three vendors land in one view. - Self-observability is a top-priority folder. Watch the watchers.
- Standing rules (owners, quarterly review, build-for-a-question, preview/ as staging) keep the set from sprawling.
One post left in the series — the First 100 Days retrospective. What worked, what didn’t, what we’d do differently if we were starting over today. It’s the post I wanted to write all along, and now that we’ve covered every other piece of the stack, it can actually rest on shared context rather than handwaving.
Happy automating!