Skip to content
Pipelines and Pizza 🍕
Go back

Grafana 13 on CloudNativePG: The Real Upgrade Walkthrough

14 min read

Table of Contents

Open Table of Contents

Where We Left Off

Loki for logs, Mimir for metrics, both running on Kubernetes with Nutanix Objects as the S3 backend. Grafana is the consumer — the dashboards, the alerts, the user-facing window into everything we’ve spent the last seven articles deploying.

Grafana the binary is straightforward. Grafana the service in a regulated environment, with three replicas, OIDC auth, cross-DC active/standby, automated backups, and a documented upgrade process that survives audit — that’s the part this article is about.

This is also the article where I get to walk through the actual Grafana 12.4.2 → 13.0.1 upgrade we ran last month. It went well. It almost didn’t. The pre-upgrade Postgres backup is the reason it went well.


SQLite vs PostgreSQL for Grafana

Grafana defaults to SQLite for its internal database. That’s fine for a single-user laptop install. For three replicas behind an ingress in a regulated environment, it’s not.

The reasons line up almost exactly with how SQLite differs from PostgreSQL anywhere:

  • Concurrency. SQLite uses a single-writer file lock. Multiple Grafana replicas all writing to the same SQLite file fight over the lock; the chart actually refuses to start more than one replica with SQLite for this reason. PostgreSQL handles concurrent writers natively.
  • High availability. SQLite is one file on one disk. The file lives on a PVC. The PVC is pinned to one node. Lose the node and you lose Grafana state until recovery. With PostgreSQL streaming replication, a primary loss is a promotion of an existing replica.
  • Backup. You can copy a SQLite file. You hope nobody’s writing to it. You hope the backup is consistent. PostgreSQL has continuous WAL archiving, scheduled base backups, and point-in-time recovery as documented features.
  • Schema migrations. SQLite migrations run on Grafana startup. If a migration goes wrong, there is no rollback — the file is forward-migrated and you’re restoring from backup. With PostgreSQL, you can take a backup right before the migration runs and restore to that exact state if needed. We’ll see this play out in the Grafana 13 upgrade section.
  • Audit and compliance. When a regulator asks “show me your database backup and recovery procedure,” “we copy a SQLite file off a PVC” is not the answer that earns a clean audit. CNPG with WAL archival to S3 is.

For a personal-laptop install, SQLite is fine. For our use, no.


Why CloudNativePG

The decision is in ADR-010. Short version: GitOps purity.

The competing options were:

  • PostgreSQL on dedicated VMs. Traditional model, well-understood, but creates an operational island. Separate backup, separate monitoring, separate patching, separate access management. Everything else in our stack is ArgoCD-managed; running a database on VMs means an exception to that.
  • CrunchyData PGO. Also a great Postgres operator. CloudNativePG has stronger CNCF community momentum and is now a CNCF Sandbox project. For net-new, CNPG was the cleaner pick.
  • Managed cloud Postgres. Adds a cloud dependency to an on-prem architecture, adds latency from on-prem K8s to cloud Postgres, defeats the on-prem design principle. No.

CloudNativePG (CNPG) treats a PostgreSQL cluster as a Kubernetes CRD. You declare a Cluster resource in YAML. The operator handles provisioning, replication, failover, WAL archival, backup, even Postgres minor-version upgrades. Everything is declarative. ArgoCD reconciles the CRDs the same way it reconciles every other resource. DBA work becomes “edit YAML, merge PR, watch ArgoCD apply it.”


The Cluster We Run

The actual Cluster spec for grafana-db on EastCoastDC:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: grafana-db
  namespace: postgres
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised

  storage:
    storageClass: local-path
    size: 10Gi

  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 4000m
      memory: 4Gi

  postgresql:
    parameters:
      shared_buffers: "1GB"
      effective_cache_size: "3GB"
      max_connections: "200"
      wal_level: "replica"
      max_wal_senders: "10"
      max_replication_slots: "10"

  bootstrap:
    initdb:
      database: grafana
      owner: grafana
      secret:
        name: grafana-db-credentials

  backup:
    barmanObjectStore:
      destinationPath: s3://gpl-backups/postgres/grafana-db/
      endpointURL: https://objects.conveyor.internal
      s3Credentials:
        accessKeyId:
          name: postgres-s3-credentials
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: postgres-s3-credentials
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
    retentionPolicy: "7d"
    target: prefer-standby

A few decisions worth talking about:

3 instances on EastCoastDC (1 primary + 2 hot replicas), 1 instance on WestCoastDC streaming from EastCoastDC. EastCoastDC is the active site for Grafana; WestCoastDC is the standby. On a failover, WestCoastDC’s single instance gets promoted to primary, WestCoastDC serves Grafana, and we re-establish replication from WestCoastDC back to EastCoastDC once it’s back online.

primaryUpdateStrategy: unsupervised — when CNPG needs to update the primary (Postgres patch version, configuration change), it does so automatically by failing over to a replica, updating the old primary, and promoting it back when ready. The alternative supervised mode requires manual intervention. For a database that isn’t customer-facing, unsupervised is the lower-overhead choice.

Local-path PVCs at 10 Gi. Grafana’s database is small — dashboards, users, data source configs, alert state. The size is measured in megabytes, not gigabytes. Ten gigs is overprovisioning for headroom. Local-path because the data is durable through Barman archival to S3, not through PVC durability.

shared_buffers: "1GB" — PostgreSQL’s main in-memory cache. Set to roughly 25% of the memory limit (4 GiB), which is the standard tuning recommendation. effective_cache_size: "3GB" is the planner hint about how much OS-level cache is available; it’s not an allocation.

barmanObjectStore backup to Nutanix Objects. Continuous WAL archival plus scheduled base backups go to an S3 bucket on Nutanix Objects via the Barman protocol. CNPG handles the orchestration. The retentionPolicy: "7d" means base backups older than 7 days are pruned; WAL archives are retained according to the configured PITR window.

target: prefer-standby — base backups run against a replica when possible, so they don’t load the primary. CNPG falls back to the primary if no healthy replica is available.

The replication credentials and the per-DC IP addresses live in externalClusters for WestCoastDC’s case. The replication is async — committed transactions on the primary are streamed to the replicas, with a small replication lag visible in Postgres metrics.


Grafana Connecting to CNPG

The Grafana side is short. Three replicas, no PVC, all state in Postgres:

grafana:
  image:
    repository: grafana/grafana-enterprise
    tag: "13.0.1"
  replicas: 3
  deploymentStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  podDisruptionBudget:
    minAvailable: 1

  persistence:
    enabled: false                # All state in Postgres

  grafana.ini:
    database:
      type: postgres
      host: grafana-db-rw.postgres.svc:5432
      name: grafana
      user: $__env{username}
      password: $__env{password}
      ssl_mode: require
    server:
      root_url: https://grafana.conveyor.internal
    auth.azuread:
      enabled: true
      auto_login: true
      # ... Entra ID config
    auth:
      disable_login_form: false   # Intentional — break-glass admin

A handful of details:

grafana-db-rw.postgres.svc:5432 is the CNPG service that always routes to the current primary. If a failover happens, CNPG updates the endpoints behind this service and Grafana’s connections move with it. There are also grafana-db-ro (read-only, load-balances replicas) and grafana-db-r (any instance) services if you needed them for read scaling. For Grafana, write traffic dominates and there’s no benefit to splitting reads.

maxUnavailable: 0 during rolling updates. We keep all three replicas serving during an upgrade — a new replica comes up before an old one goes down. The chart default is fine for most apps; for Grafana behind a TV dashboard wall, the strict mode prevents brief 502s during pod transitions.

disable_login_form: false is intentional. We use Entra ID OIDC as the primary auth, but the local admin login form stays enabled as a break-glass option. If Entra ID is down, or OIDC is misconfigured and locking everyone out, the local admin password (stored in Azure Key Vault, accessed via External Secrets) is the way back in. The trade-off is that local form auth is enabled at all, which our security team explicitly accepted for the emergency-access value.

Image renderer enabled for PDF/PNG dashboard exports. Grafana 13 requires a shared JWT between Grafana and the renderer (the old in-process renderer was removed), which the chart provisions automatically from a Kubernetes secret.

GODEBUG: "x509negativeserial=1" in the pod env. Our SolarWinds TLS certificate has a negative serial number (technically violates RFC 5280). Go’s x509 library rejects negative serials by default. This GODEBUG flag re-enables parsing them. A one-line workaround for a vendor cert we can’t change. The kind of thing that’s invisible in normal operation and impossible to diagnose without the env var visible somewhere.


The Real Grafana 12.4.2 to 13.0.1 Upgrade

This is the section I want to spend real time on, because the upgrade has a sharp edge that we navigated successfully but that I’d hate to hit blind.

Last month we upgraded Grafana from 12.4.2 to 13.0.1. The chart bump was straightforward — 11.3.6 → 12.1.1. The image went from grafana/grafana:12.4.2 (OSS) to grafana/grafana-enterprise:13.0.1 (Enterprise binary, unlicensed — behaves identically to OSS but keeps the option open to license later without re-deploying). We also skipped v13.0.0, which was withdrawn for a Git Sync migration bug.

The upgrade ran in phases:

Phase 0–1: Zero-impact prep

Done weeks before the actual upgrade. Things that could be deployed against 12.4.2 without breaking anything but were prerequisites for 13:

  • Enabled the grafanaAdvisor feature toggle — surfaces deprecated configs and plugin issues in-UI at /a/grafana-advisor-app. Caught two things we needed to fix before the upgrade.
  • Pre-wired the renderer JWT token. Grafana 13’s image renderer uses JWT auth instead of the legacy DB-token model. We added the Key Vault secret, the ExternalSecret, and the env var wiring on 12.4.2 (where it’s a no-op) so it was already there on 13.0.
  • Pinned the Infinity datasource plugin to a specific version (3.8.0). Floating-version plugins are a recipe for upgrade-day surprises.

Phase 2: Dashboard conversion

Grafana 13 makes Scenes the default runtime with no opt-out. Older dashboards using the legacy graph panel type need to be converted to timeseries to avoid Scenes migration artifacts.

We programmatically converted 70 panels across several dashboards from graph to timeseries while still on 12.4.2. The conversions are visible in the renders, but the Scenes runtime handles converted timeseries panels cleanly while it stumbles on legacy graph panels.

The biggest single conversion was argocd-overview.json — schemaVersion 30 → 40, 31 graph panels → timeseries (we initially undercounted because some were inside collapsed rows). Stacking, legend calculations, heatmaps, and the underlying queries were all preserved.

Phase 3: The actual upgrade

EastCoastDC first, then WestCoastDC after a bake period. Per-DC because each DC has its own Postgres instance to back up, its own ArgoCD application to sync, its own verification pass.

The sequence per DC:

  1. Take a fresh Postgres backup. The most important step. Create an on-demand Backup CRD against the grafana-db cluster, wait for status completed, record the backup name. The whole point of this is the rollback target if the upgrade goes wrong.
  2. Merge the chart bump PR. Standard ArgoCD-managed deployment.
  3. Force ArgoCD hard refresh to avoid the 3-minute poll wait: kubectl annotate application grafana argocd.argoproj.io/refresh=hard --overwrite.
  4. Watch the rolling update. With maxUnavailable: 0, new pods come up first, become ready, then old pods terminate. No 502s. kubectl rollout status deploy/grafana --timeout=600s.
  5. Watch for unified-storage migration in the new pod logs. On first 13.0 startup, Grafana migrates folders and dashboards to its new unified storage backend. Messages look like migration / unified_storage / folder / dashboard. For our ~110 dashboards and ~12 folders, the migration finished in tens of seconds.
  6. Run the verification checklist before bake.

The verification checklist is the part I’d want to share with anyone planning a Grafana 13 upgrade. Ordered fastest-to-slowest-to-fail, so regressions surface early:

  • Infinity-backed dashboard renders. First check — catches plugin/React-19 breakage fastest.
  • Entra ID OIDC login succeeds, the user lands in the expected role.
  • Break-glass local admin works (test in an incognito window).
  • All datasources return healthy from GET /api/datasources. Cross-DC federation (loki-remote, mimir-remote) responsive.
  • Image renderer fires — trigger a panel PNG render, confirm success.
  • Five high-traffic dashboards render under Scenes: a NetApp Harvest one, a Windows one, the freshly-converted argocd-overview, a network dashboard, a vendor-status dashboard.
  • Unified alerting fires — trigger a test rule that routes to the Alertmanager datasource.
  • MCP API calls (UIDs only, no numeric IDs) succeed.
  • Pod memory stays within limits. Known 13.0.1 regression: ~20% higher than 12.4.3. If pods OOM, bump resources.limits.memory to 3 Gi.
  • Annotation / dashboard save round-trip works.
  • Advisor page loads and runs a fresh report.

That last one — memory regression — was the only post-upgrade surprise. We caught it on Pod 1 of the rollout, pre-emptively bumped the memory limit, and the rest of the rollout completed cleanly. No outage.

Bake on EastCoastDC for as long as the team is comfortable. Then repeat on WestCoastDC. The WestCoastDC pass is identical except that ArgoCD auto-sync is disabled there (WestCoastDC runs on a delayed-sync cadence), so the sync has to be triggered manually.


Backups and the Irreversible Moment

This is the part that justified the entire CNPG investment.

The Grafana 13 unified-storage migration is irreversible. Once it runs on first 13.0 boot, the schema in the Postgres database has changed, and Grafana 12.x can’t read it anymore. A SQLite rollback would mean restoring the file. A CNPG rollback means restoring the database from the backup we took in step 1.

The rollback procedure if we’d needed it:

# 1. Scale Grafana to 0 to stop further writes
kubectl scale deploy/grafana --replicas=0

# 2. Restore Postgres from the pre-upgrade backup
#    (creates a new Cluster from the Backup CRD with bootstrap.recovery)

# 3. Pin the chart back to 11.3.6, revert the values changes
git revert <upgrade-commit>
git push

# 4. Let ArgoCD reconcile; scale Grafana back to 3
kubectl scale deploy/grafana --replicas=3

The cost of this rollback is any dashboard edits or annotations made during the bake window — those changes happened against the post-migration schema and don’t survive a restore to the pre-migration backup. Caught within hours? Probably fine. Caught after a full day of active use? You’re looking at a difficult conversation about lost work.

In our case the verification checklist passed and we didn’t need to roll back. But the option to roll back is what made the upgrade feasible at all. If Grafana ran on SQLite, this would have been a backup-the-file-and-pray exercise. With CNPG, it was a documented, reversible operation that the change management board could approve with a straight face.

The general principle: the upgrade plan and the rollback plan are not separate documents. They are one document where one section describes the forward path and another describes the reverse. The pre-upgrade backup is the explicit point of no return — everything after it is rollback-able to that point, with bounded data loss.

The scheduled backup pattern keeps the long-term picture sane too:

apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: grafana-db-daily
  namespace: postgres
spec:
  schedule: "0 0 2 * * *"   # 2 AM daily
  cluster:
    name: grafana-db
  target: prefer-standby

Continuous WAL archiving plus a daily base backup means we can restore to any point in the last 7 days (the retention policy on the bucket). If somebody deletes a dashboard at 11 PM, we can restore the database to 10:55 PM, copy the dashboard JSON out, and put it back without disturbing the rest of production.

We’ve never needed to do that. But we’ve been able to.


Wrapping Up

Putting Grafana on CNPG isn’t glamorous infrastructure work. It’s not the kind of project that goes on a quarterly OKR. But every quarter I’ve run this setup, the CNPG part has paid for itself in some small way — an easy upgrade with a clear rollback target, a 5-minute “restore this dashboard from yesterday” recovery, a clean answer to the auditor’s question about database backup procedures.

What’s worth keeping from this article:

  • Grafana with CNPG over SQLite — concurrency, HA, backups, recoverable upgrades. Worth the additional moving parts in any production environment.
  • CNPG topology: 3-instance primary cluster on EastCoastDC, 1-instance replica on WestCoastDC streaming asynchronously, primary updates unsupervised, backups to Nutanix Objects via Barman.
  • Grafana 13 upgrade in phases: prep weeks before, dashboard conversions on 12.4.2, then the actual chart bump with a fresh CNPG backup as the rollback target.
  • The verification checklist runs fastest-to-slowest-to-fail. Catches the cheapest regressions first.
  • The irreversible moment is the unified-storage migration on first 13.0 boot. The pre-upgrade Postgres backup is the rollback target.

Next post (article 9) covers dashboards that actually get used — the ~118 we maintain across 18 domain-organized folders, what makes a dashboard worth opening at 2 AM, the multi-vendor storage section that wraps NetApp Harvest + Pure FlashArray + Nutanix into one view, and the ConfigMap-sidecar pattern that puts every dashboard under GitOps.

Happy automating!