Alloy in Production: The DaemonSet Config Running The Conveyor's Observability

Open Table of Contents

Where We Left Off
Pod Logs
Kubernetes Audit Logs
Node Metrics
Kubelet and cAdvisor
Control Plane: apiserver and etcd
Annotated Pods, CoreDNS, and CNPG
Synthetic Probes with Blackbox Exporter
Resource Sizing
Lessons From Three Months In Production
Wrapping Up

Where We Left Off

In the last post we covered why Alloy, the three deployment topologies we run, and how the DaemonSet, the syslog Deployment, and the traces Deployment fit together. That article was the menu. This one is the recipe — the actual config running on the cluster right now, collecting telemetry in our monitoring platform.

Everything in this post comes from observability/alloy/configs/daemonset.alloy in our GitOps repo. The pipeline has been running since February, and the HA side is in the middle of coming online. Most of the lines below have a story behind them — usually a bug, sometimes a surprise, occasionally an upstream gotcha that cost a Friday afternoon. Where the story is interesting, I will share it.

A note on tense before we start: the LGTM stack on Nutanix is in production for us, but the GitOps repo is still tagged pre-1.0. v0.6.2 is the current version. The “production” line gets crossed when we cut v1.0.0 after DR comes online and we run a full DR drill. Until then we run the platform like it is in production — because operationally, it is.

Pod Logs

The most-used part of the pipeline. Every container running on every node in the cluster lands its stdout and stderr here.

discovery.kubernetes "pods" {
  role = "pod"
}

discovery.relabel "pod_logs" {
  targets = discovery.kubernetes.pods.targets

  // Keep only running pods
  rule {
    source_labels = ["__meta_kubernetes_pod_phase"]
    regex         = "Pending|Succeeded|Failed|Unknown"
    action        = "drop"
  }

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    target_label  = "node"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
    target_label  = "app"
  }
}

loki.source.kubernetes "pod_logs" {
  targets    = discovery.relabel.pod_logs.output
  forward_to = [loki.process.pod_pipeline.receiver]
}

loki.process "pod_pipeline" {
  stage.static_labels {
    values = {
      source = "kubernetes",
    }
  }

  stage.drop {
    expression = "^\\s*$"
  }

  // Forward to local Loki only — remote dual-write disabled
  // until DR DC is online. Re-enable:
  //   forward_to = [loki.write.local.receiver, loki.write.remote.receiver]
  forward_to = [loki.write.local.receiver]
}

A few things worth calling out:

The phase-drop rule. Discovery picks up every pod in every phase, including ones that have not started yet or have already exited. Tailing logs from a Succeeded pod that the kubelet is about to garbage-collect generates noise and occasional errors. Dropping non-Running pods at relabel time keeps the pipeline tidy.
stage.drop for blank lines. A handful of our applications emit empty log lines on shutdown. They cost nothing to drop and they keep LogQL aggregations honest.
Dual-write to two Loki endpoints, currently single. The intended steady state is dual-write — every log line lands in both DC’s Loki simultaneously, so a DC outage does not lose log telemetry. DR is still being built, so the remote receiver is commented out with a re-enable comment right next to it. Future me, please read that comment.

The two loki.write components themselves are at the top of the file and pull their URLs from env vars:

loki.write "local" {
  endpoint {
    url = env("LOKI_WRITE_URL_LOCAL")
  }
  external_labels = {
    cluster = "conveyor-platform",
    dc      = env("DC_LABEL"),
  }
}

loki.write "remote" {
  endpoint {
    url = env("LOKI_WRITE_URL_REMOTE")
  }
  external_labels = {
    cluster = "conveyor-platform",
    dc      = env("DC_LABEL"),
  }
}

The DC-specific values (ProdDC vs DRDC, local vs remote Loki URLs) come from the per-DC Helm values file. One Alloy config; two DCs; zero config drift.

Kubernetes Audit Logs

Auditors love these. The control plane writes JSON-formatted audit events to /var/log/kube-audit/audit.log on each control plane node. We tail that file and ship it to Loki with the relevant fields promoted to labels.

local.file_match "audit_logs" {
  path_targets = [{
    __path__ = "/var/log/kube-audit/audit.log",
    job      = "kube-audit",
    source   = "audit",
  }]
}

loki.source.file "audit_logs" {
  targets    = local.file_match.audit_logs.targets
  forward_to = [loki.process.audit_pipeline.receiver]
}

loki.process "audit_pipeline" {
  stage.json {
    expressions = {
      verb         = "verb",
      user         = "user.username",
      resource     = "objectRef.resource",
      audit_ns     = "objectRef.namespace",
      api_group    = "objectRef.apiGroup",
      status_code  = "responseStatus.code",
      audit_level  = "level",
    }
  }

  stage.labels {
    values = {
      verb        = "",
      user        = "",
      resource    = "",
      audit_ns    = "",
      status_code = "",
      audit_level = "",
    }
  }

  // Drop noisy low-value audit events
  stage.drop {
    source     = "resource"
    expression = "^(selfsubjectaccessreviews|selfsubjectrulesreviews|leases|customresourcedefinitions)$"
  }

  forward_to = [loki.write.local.receiver]
}

That last stage.drop is a story.

The first version of this pipeline didn’t have it. We turned on audit logging, deployed Alloy, opened Grafana, and immediately the Loki ingester started rejecting entries. The errors looked like: entry too large, max size is 256KB.

The culprit was every CustomResourceDefinition update. CRDs carry their full OpenAPI v3 schema in spec.versions[].schema.openAPIV3Schema. For something like a CNPG Cluster CRD, that JSON blob is around 368KB. Every time ArgoCD reconciled and the CRD touched, the audit pipeline emitted an entry larger than Loki was willing to accept.

We had three options: raise Loki’s max_entry_size (treats the symptom, not the cause), filter the schema field in the JSON stage (fragile, the field path changes between API versions), or just drop CRD audit events entirely. We picked the third — the security value of auditing CRD changes is low, and we capture them via ArgoCD’s own audit trail. Decision documented in ADR-036; problem solved with one regex.

The other noisy resources in the drop list — selfsubjectaccessreviews, selfsubjectrulesreviews, leases — show up on every healthy cluster in enormous volume. RBAC self-checks fire on every controller startup; lease renewals happen multiple times per second per controller. None of them tell you anything an attacker would do that you couldn’t catch elsewhere. Out they go.

Node Metrics

The classic node_exporter set. Alloy ships a built-in equivalent via prometheus.exporter.unix, so we use that instead of the separate node_exporter binary. One fewer DaemonSet to manage.

prometheus.exporter.unix "node" {
  set_collectors = [
    "cpu", "meminfo", "diskstats", "filesystem", "loadavg",
    "netdev", "netstat", "stat", "time", "uname", "vmstat",
  ]
}

prometheus.scrape "node_metrics" {
  targets         = prometheus.exporter.unix.node.targets
  forward_to      = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "30s"
}

The set_collectors list is deliberately short. node_exporter ships with around 50 collectors. Most are off by default and most of those should stay off. The ones we keep give us CPU, memory, disk, filesystem, load, network, basic stat counters, time sync, kernel/OS identification, and VM stats. That covers every dashboard we care about and a few I haven’t built yet.

The prometheus.remote_write it forwards to looks like this:

prometheus.remote_write "mimir" {
  endpoint {
    url = env("MIMIR_WRITE_URL")
  }
  external_labels = {
    cluster = "conveyor-platform",
    dc      = env("DC_LABEL"),
  }
}

Mimir writes are single-DC. Unlike Loki, we don’t dual-write metrics across data centers — Mimir uses Nutanix Objects as the long-term backend, and the buckets are replicated at the object layer. Two streams writing to the same Mimir would be duplicate ingestion for no benefit.

Kubelet and cAdvisor

This is the section where every Alloy DaemonSet config goes wrong the first time. The mistake looks like this: scrape kubelet and cAdvisor from every node, get every node’s metrics back, multiply by the number of DaemonSet pods (one per node), and watch Mimir ingest the same data N times.

The fix is to scrape only the node the local Alloy pod is running on:

discovery.kubernetes "nodes" {
  role = "node"
}

discovery.relabel "kubelet" {
  targets = discovery.kubernetes.nodes.targets

  // Only scrape the node this Alloy instance runs on
  // (prevents duplicate samples from DaemonSet)
  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    regex         = env("HOSTNAME")
    action        = "keep"
  }

  rule {
    replacement  = "/metrics"
    target_label = "__metrics_path__"
  }
  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    target_label  = "node"
  }
}

prometheus.scrape "kubelet" {
  targets           = discovery.relabel.kubelet.output
  forward_to        = [prometheus.remote_write.mimir.receiver]
  scrape_interval   = "30s"
  scheme            = "https"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  }
}

The env("HOSTNAME") keep rule is the important part. Kubernetes sets the pod’s HOSTNAME env var to the pod name by default, and for a DaemonSet pod that matches the node name. Each Alloy instance discovers all nodes, drops everything that isn’t its own host, and scrapes once. Three nodes × three Alloy pods = three scrapes, not nine.

The same pattern repeats for cAdvisor:

discovery.relabel "cadvisor" {
  targets = discovery.kubernetes.nodes.targets

  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    regex         = env("HOSTNAME")
    action        = "keep"
  }

  rule {
    replacement  = "/metrics/cadvisor"
    target_label = "__metrics_path__"
  }
  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    target_label  = "node"
  }
}

prometheus.scrape "cadvisor" {
  targets           = discovery.relabel.cadvisor.output
  forward_to        = [prometheus.remote_write.mimir.receiver]
  scrape_interval   = "30s"
  scheme            = "https"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  }
}

Control Plane: apiserver and etcd

These two are RKE2-specific. The apiserver listens on each node’s InternalIP on port 6443, and RKE2 exposes etcd metrics on port 2381 over plain HTTP — no TLS, no auth, intentionally bound to localhost-ish in the cluster network.

The apiserver scrape:

discovery.relabel "apiserver" {
  targets = discovery.kubernetes.nodes.targets

  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    regex         = env("HOSTNAME")
    action        = "keep"
  }

  rule {
    source_labels = ["__meta_kubernetes_node_address_InternalIP"]
    regex         = "(.+)"
    replacement   = "${1}:6443"
    target_label  = "__address__"
  }

  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    target_label  = "node"
  }
}

prometheus.scrape "apiserver" {
  targets           = discovery.relabel.apiserver.output
  forward_to        = [prometheus.remote_write.mimir.receiver]
  scrape_interval   = "30s"
  metrics_path      = "/metrics"
  scheme            = "https"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
  }
  job_name = "apiserver"
}

Same HOSTNAME dedup trick. The address rewrite takes the discovered node’s InternalIP and slaps :6443 on the end — that’s where RKE2’s apiserver lives on each control-plane node.

The etcd scrape is more interesting because the cert is the hard part:

discovery.relabel "etcd" {
  targets = discovery.kubernetes.nodes.targets

  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    regex         = env("HOSTNAME")
    action        = "keep"
  }

  rule {
    source_labels = ["__meta_kubernetes_node_address_InternalIP"]
    regex         = "(.+)"
    replacement   = "${1}:2381"
    target_label  = "__address__"
  }

  rule {
    source_labels = ["__meta_kubernetes_node_name"]
    target_label  = "node"
  }
}

prometheus.scrape "etcd" {
  targets         = discovery.relabel.etcd.output
  forward_to      = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "30s"
  metrics_path    = "/metrics"
  scheme          = "http"
}

Wait — scheme = "http"? On etcd? The thing storing all my cluster state in plaintext over the wire?

Yes. RKE2 exposes etcd’s metrics endpoint on :2381 with no TLS by design. The real etcd peer/client traffic on :2379 and :2380 is mutual-TLS and bound tightly. The :2381 endpoint is metrics only — no read or write access to keys, no peer protocol, just Prometheus text format. It’s only reachable from inside the cluster network, and our network policy locks it down to the Alloy service account. Compliant with our internal review; documented.

The etcd cert init container in our Helm values is for a different reason: the etcd CA cert (/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt) lives on the host filesystem owned by root:root with mode 0600. Alloy runs as UID 473 (the chart’s default non-root user). It cannot read root-owned files. So we run a privileged init container that copies the cert from the hostPath mount to a shared emptyDir with mode 0644, and the main Alloy container reads it from there. This isn’t used by the :2381 scrape today, but it’s wired up for future TLS work on related endpoints.

Annotated Pods, CoreDNS, and CNPG

Annotation-based discovery is the conventional path: pods that want to be scraped set prometheus.io/scrape: "true", optionally prometheus.io/port, optionally prometheus.io/path, and the scraper picks them up.

discovery.relabel "pod_metrics" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
    regex         = "true"
    action        = "keep"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_path"]
    regex         = "(.+)"
    target_label  = "__metrics_path__"
  }
  rule {
    source_labels = [
      "__meta_kubernetes_pod_annotation_prometheus_io_port",
      "__meta_kubernetes_pod_ip",
    ]
    regex        = "(\\d+);(\\d+\\.\\d+\\.\\d+\\.\\d+)"
    replacement  = "${2}:${1}"
    target_label = "__address__"
  }
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
}

prometheus.scrape "pod_metrics" {
  targets         = discovery.relabel.pod_metrics.output
  forward_to      = [prometheus.remote_write.mimir.receiver]
  honor_labels    = true
  scrape_interval = "30s"
}

This covers most apps that expose Prometheus metrics. Slap two annotations on the pod spec, redeploy, the metrics show up. No code change to Alloy required. The Rubrik exporter, kube-state-metrics, and a dozen others ride on this pipeline.

CoreDNS doesn’t ride on it though, because RKE2 doesn’t annotate its pods. So CoreDNS gets a dedicated scrape:

discovery.relabel "coredns" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    regex         = "kube-system"
    action        = "keep"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    regex         = "coredns"
    action        = "keep"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    regex         = env("HOSTNAME")
    action        = "keep"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_ip"]
    regex         = "(.+)"
    replacement   = "${1}:9153"
    target_label  = "__address__"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_node_name"]
    target_label  = "node"
  }
}

prometheus.scrape "coredns" {
  targets         = discovery.relabel.coredns.output
  forward_to      = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "30s"
  job_name        = "coredns"
}

CNPG (CloudNativePG) is in the same boat — the operator-managed pods don’t carry Prometheus annotations either, but they do carry a cnpg.io/cluster label naming the database cluster they belong to. So we discover by that label instead:

discovery.relabel "cnpg_metrics" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_pod_label_cnpg_io_cluster"]
    regex         = ".+"
    action        = "keep"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_ip"]
    regex         = "(.+)"
    replacement   = "${1}:9187"
    target_label  = "__address__"
  }
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_label_cnpg_io_cluster"]
    target_label  = "cluster_name"
  }
}

prometheus.scrape "cnpg_metrics" {
  targets         = discovery.relabel.cnpg_metrics.output
  forward_to      = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "30s"
}

The pattern — keep on a label, build the address from the pod IP and a known port, promote the cluster name into a metric label — generalizes. Anywhere you have operator-managed pods with predictable labels and a known metrics port, label-based discovery beats waiting for annotations to land upstream.

Synthetic Probes with Blackbox Exporter

The last block in the DaemonSet is probably the one I get the most value out of relative to its size. Blackbox exporter probes endpoints over HTTP, TCP, or ICMP and exports probe_success, probe_duration_seconds, and a handful of TLS-related metrics. The Alloy side just feeds it a target list and points each scrape at /probe?target=<url>&module=<module>.

discovery.relabel "blackbox_targets" {
  targets = [
    { "__address__" = "http://grafana.observability.svc:80/api/health",
      "module" = "http_2xx", "instance" = "grafana" },
    { "__address__" = "http://argocd-server.argocd.svc:80/healthz",
      "module" = "http_2xx_skip_verify", "instance" = "argocd" },
    { "__address__" = "http://loki-read.observability.svc:3100/ready",
      "module" = "http_2xx", "instance" = "loki" },
    { "__address__" = "https://prism-east.conveyor.internal:9440",
      "module" = "http_2xx_skip_verify", "instance" = "nutanix-prism-east" },
    { "__address__" = "https://observability-kv.vault.azure.net/healthstatus",
      "module" = "http_2xx_tls", "instance" = "azure-keyvault" },
    // ...about 50 more — internal retail apps, public marketing sites,
    // Azure services, vendor SaaS, the long tail of things that page
    // someone at 2 AM when they're down.
  ]

  rule {
    source_labels = ["__address__"]
    target_label  = "__param_target"
  }
  rule {
    source_labels = ["module"]
    target_label  = "__param_module"
  }
  rule {
    target_label = "__address__"
    replacement  = "blackbox-exporter-prometheus-blackbox-exporter.observability.svc.cluster.local:9115"
  }
}

prometheus.scrape "blackbox" {
  targets         = discovery.relabel.blackbox_targets.output
  forward_to      = [prometheus.remote_write.mimir.receiver]
  scrape_interval = "60s"
  scrape_timeout  = "15s"
  metrics_path    = "/probe"
  job_name        = "blackbox"
}

Why a static list and not service discovery? Because most of what we probe doesn’t run in this cluster. Internal retail apps, public-facing customer portals, Azure endpoints, vendor SaaS — they live anywhere but here. The static list is curated in the Helm values for blackbox-exporter and reflected here. Adding a target is two lines in a values file, a PR, an ArgoCD sync, and you have a new probe.

The three relabel rules at the end are the standard blackbox idiom. __param_target and __param_module become URL query parameters when Alloy builds the scrape request, and the address rewrite points the actual TCP connection at the blackbox-exporter pod instead of the target. Alloy never opens a socket to your CRM directly — it asks blackbox to do it from inside the cluster network and report what happened.

Resource Sizing

For a 3-node cluster sized like ours — a few dozen apps, kube-state-metrics, CNPG, ArgoCD, the LGTM stack itself — here are the Helm requests and limits we run:

Resource	Request	Limit
CPU	200m	1000m
Memory	512Mi	4Gi

The 4Gi memory limit is generous on purpose. Alloy’s memory profile is mostly driven by active series count for the metrics pipeline and the volume of log streams being tailed. When WestCoastDC is fully back online and Loki dual-write is re-enabled, the WAL buffering on the remote-write side will push memory up. Better to size for that day than to wake up to OOMKilled pods when it arrives.

A couple of other knobs in the values that matter:

priorityClassName: platform-critical — Alloy is what alerts run on. If the cluster is under pressure, evicting Alloy makes the on-call situation worse, not better.
tolerations: [{ operator: Exists }] — Run on every node including control plane. Otherwise you miss kubelet/cAdvisor/etcd/apiserver metrics on the control plane nodes, which is where they matter most.
securityContext with runAsUser: 473, runAsNonRoot: true, allowPrivilegeEscalation: false, capabilities.drop: [ALL], seccompProfile: RuntimeDefault — the strict-but-runnable baseline. The DaemonSet does need host paths (varlog, dockercontainers) so readOnlyRootFilesystem: true isn’t an option, but everything else clamps down.

The pod also gets an init container that copies the etcd CA cert from the host to an emptyDir with permissions UID 473 can read. We covered the why in the etcd section above.

Lessons From Three Months In Production

Five things that bit us, in rough order of how long it took to figure out.

1. Duplicate samples from the DaemonSet

The first 30 minutes after we turned on kubelet scraping, Mimir’s ingestion rate spiked to three times what it should have been. Three Alloy pods, each scraping all three nodes’ kubelets — duplicate samples landed in Mimir under indistinguishable label sets and the deduplication-on-ingest didn’t catch all of them.

The fix is the env("HOSTNAME") keep rule in every discovery.relabel block that targets node-level endpoints. With it, each Alloy instance only scrapes its own node. Without it, you scale your scrape volume by the size of your DaemonSet, which is the opposite of what you want.

2. CRD audit entries blowing past Loki’s 256KB limit

Covered in the kube-audit section above. The short version: Kubernetes audit entries include the full request body, CRDs carry huge OpenAPI schemas, and Loki rejects any single entry over 256KB. The right fix was to drop CRD audit events at the Alloy stage instead of raising the Loki limit. ADR-036 has the full reasoning.

The general lesson is broader though: anything that emits a single log entry over a couple hundred KB should be filtered or split before it gets to Loki. Loki is not a database for blobs. If you find yourself wanting to raise max_entry_size, stop and look at what is actually emitting that line.

3. Alloy hardcodes service.name=alloy in its tracing block

We send a small sample of Alloy’s own traces to Tempo to dogfood the tracing pipeline. The first time we looked, every Alloy span across both DCs landed in Tempo under service.name=alloy. Indistinguishable. Useless.

It turns out Alloy’s tracing config block emits a hardcoded service.name=alloy on every span. There’s no field to override it. The workaround is to route the tracing pipeline through an OTel processor that rewrites the resource attribute before exporting:

tracing {
  sampling_fraction = 0.1
  write_to          = [otelcol.processor.transform.self_trace_name.input]
}

otelcol.processor.transform "self_trace_name" {
  trace_statements {
    context = "resource"
    statements = [
      "set(attributes[\"service.name\"], \"" + env("OTEL_SERVICE_NAME") + "\")",
    ]
  }
  output {
    traces = [otelcol.exporter.otlp.self_trace.input]
  }
}

otelcol.exporter.otlp "self_trace" {
  client {
    endpoint = "tempo-distributor.observability.svc:4317"
    tls { insecure = true }
  }
}

Now alloy-daemonset shows up distinctly from alloy-network and alloy-traces, and we can actually use the spans.

4. The Rubrik exporter cache stalling scrapes

We wrote a small Python exporter for Rubrik backup metrics — Rubrik’s GraphQL API is slow enough that you can’t hit it on every scrape. The first version called collect_metrics() synchronously inside the request handler whenever the cache expired. Result: the unlucky scrape that triggered the refresh ran ~3 minutes, exceeded Alloy’s 10-second scrape_timeout, recorded up=0, and made the dashboard go blank every 15 minutes on the dot.

Fix was to move the refresh to a background thread on the same TTL cadence, and have the request handler only ever read the cache under a lock. Bumped exporter to v0.2.0. The lesson — any custom exporter that wraps a slow upstream needs to decouple the fetch from the scrape, full stop. Don’t make Alloy wait.

5. Built-in UI is your friend, especially the component graph

Port-forward to :12345 on any Alloy pod and you get a live graph of every component, its inputs, its outputs, its health, and its target list. When discovery.kubernetes shows zero targets, you see it. When loki.write is queued up and not flushing, you see it. When a relabel rule is dropping everything, you click through and see which rule.

I don’t run kubectl logs against Alloy first anymore. I port-forward and look at the graph. It’s saved me an hour at least three times. Like having a clear glass door on the oven — you can see exactly which slice is burning before you open it up.

kubectl port-forward -n alloy ds/alloy 12345:12345

Wrapping Up

That’s the production DaemonSet config — three months into running it on a live cluster, and most of the shape has held. A few cells in the troubleshooting matrix got filled in along the way, but the architecture from day one is what’s still running.

Five things to take from this post:

Use env("HOSTNAME") to dedup DaemonSet scrapes against per-node endpoints. Without it, you scale your ingest volume by the size of your cluster.
Filter your audit pipeline aggressively. CRDs, RBAC self-reviews, lease renewals — none of it has security value, all of it has volume cost.
Label-based discovery beats annotation discovery when the operator doesn’t cooperate. CNPG and CoreDNS both ride on labels here.
Decouple slow exporters from Alloy scrapes. Background refresh, synchronous read from cache. Never make Alloy wait on a 3-minute API call.
The Alloy UI on port 12345 is the first place to look when something is wrong. Before kubectl logs. Before anything else.

Next post we move one layer down the stack — deploying Loki on Kubernetes. Single-binary vs. simple-scalable, storage to Nutanix Objects, the ingester/distributor/querier topology, and the local-path PVC patterns that play nice with rolling node upgrades. The collection layer is in. Now we build the place the logs actually land.

Happy automating!