Skip to main content

On This Page

The Grafana Observability Stack: A Pragmatic Deep Dive

35 min read
Share

Something broke in production at 2 AM. The on-call engineer opens Slack, sees a firing alert, clicks the dashboard link, stares at a spike in error rate, pivots to the logs for the failing service, finds a trace ID, follows it across six microservices, and discovers a database connection pool exhaustion caused by a deployment three hours ago that changed a timeout value.

That entire workflow — alert to dashboard to logs to trace to root cause — is what an observability stack does. Not monitoring. Not logging. Observability.

The distinction matters. Monitoring tells you when something is wrong. Observability tells you why.

What Observability Actually Means

Observability rests on three pillars, each answering a different question:

Metrics answer “what is happening right now?” They are numeric time-series data points. CPU usage over time. Request rate per endpoint. Error count per status code. Metrics are cheap to store, fast to query, and excellent for alerting. They tell you the system’s vital signs.

Logs answer “what happened?” They are timestamped text records of discrete events. A request was received. A query was executed. An error was thrown. Logs are rich in detail but expensive to store and slow to search at scale because they are unstructured or semi-structured text.

Traces answer “how did it happen?” They are records of a single request’s journey through a distributed system. This service called that service, which called a database, which timed out. Traces connect the dots that metrics and logs leave scattered.

Each pillar alone is insufficient. Metrics without logs means you know something is slow but not why. Logs without traces means you know a query failed but not which user request triggered it. Traces without metrics means you can debug individual requests but can’t see systemic patterns.

Modern distributed systems — microservices communicating over networks, stateless containers spinning up and dying, asynchronous message queues decoupling producers and consumers — make all three pillars mandatory. A monolith’s stack trace tells you the full story. A distributed system’s stack trace tells you one chapter.

Where the Grafana Stack Fits

The observability ecosystem is crowded. Commercial SaaS platforms (Datadog, New Relic, Splunk) offer turnkey solutions at significant cost. The Elastic stack (ELK) has been the open-source default for a decade. Cloud providers offer native tools (CloudWatch, Cloud Monitoring, Azure Monitor) that are deeply integrated but lock you in.

The Grafana stack occupies a specific niche: open-source, composable, cost-conscious, and cloud-native. It is not a monolithic platform. It is a collection of purpose-built tools that share design principles and integrate tightly. You can adopt pieces independently. You can swap components. You can run it on a single machine or across multiple Kubernetes clusters.

The core components:

  • Grafana: Visualization and dashboards
  • Prometheus: Metrics collection and storage
  • Loki: Log aggregation
  • Tempo: Distributed tracing
  • Alertmanager: Alert routing and deduplication

Each has a distinct job. Each was designed to do that job well and nothing else. This modularity is the stack’s greatest strength and, sometimes, its most frustrating characteristic.

Core Components Explained

Grafana: The Glass

Grafana is the visualization layer. It does not collect data. It does not store data. It queries data sources and renders dashboards.

This is an important architectural decision. By decoupling visualization from storage, Grafana can query Prometheus, Loki, Tempo, Elasticsearch, PostgreSQL, InfluxDB, CloudWatch, and dozens of other data sources from a single interface. Your dashboards are not trapped inside a vendor’s ecosystem.

A Grafana dashboard is a collection of panels. Each panel runs a query against a data source and renders the result as a graph, table, stat, heatmap, or other visualization. Dashboards are JSON documents, version-controllable, and shareable.

Grafana’s query editor adapts to each data source. For Prometheus, you write PromQL. For Loki, you write LogQL. For Tempo, you search by trace ID or use TraceQL. The experience is consistent even though the underlying query languages differ.

Beyond dashboards, Grafana provides:

  • Explore mode: Ad-hoc querying for debugging, distinct from curated dashboards
  • Alerting: Grafana has its own alerting engine (separate from Alertmanager) that can evaluate queries and fire alerts
  • Annotations: Mark events on graphs (deployments, incidents) for correlation
  • Correlations: Link from a metric panel to related logs, or from a log line to a trace

Grafana is free and open-source (AGPLv3). Grafana Labs offers a commercial cloud version (Grafana Cloud) with managed backends, but the open-source version is fully functional.

Prometheus: The Metrics Engine

Prometheus is a time-series database (TSDB) purpose-built for metrics. It was the second project to graduate from the Cloud Native Computing Foundation (CNCF), after Kubernetes. It is the de facto standard for metrics in cloud-native environments.

Data Model: Prometheus stores metrics as time-series identified by a metric name and a set of key-value labels. For example:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1547

This is a counter named http_requests_total with three labels. The labels are the dimensions for this metric. You can query “all requests to /api/users” or “all 200 responses across all endpoints” by selecting on labels.

Labels are Prometheus’s superpower. They replace the need for metric name hierarchies (like servers.web01.requests.get.200) with flat, queryable dimensions. But they are also Prometheus’s foot-gun: high-cardinality labels (like user IDs or request IDs) cause combinatorial explosion in stored time-series, consuming memory and disk.

PromQL: Prometheus has its own query language. It is functional, composable, and powerful once you internalize the model.

# Request rate per second over the last 5 minutes
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

PromQL is not intuitive at first. The distinction between instant vectors and range vectors, the behavior of rate() versus increase(), and the mechanics of histogram quantile calculation all require study. But once learned, it is expressive enough for almost any metrics query.

Pull Model: Prometheus scrapes targets. It does not receive pushed data. Every 15 seconds (configurable), Prometheus sends HTTP GET requests to each target’s /metrics endpoint and ingests the response.

This pull model has consequences:

  • Service discovery: Prometheus needs to know what to scrape. It integrates with Kubernetes, Consul, DNS, EC2, and many other discovery mechanisms.
  • Firewall direction: Prometheus initiates connections outward, which simplifies firewall rules in many environments.
  • Short-lived processes: Jobs that start, do work, and exit may complete between scrape intervals. Prometheus offers a Pushgateway for this case, but it is considered a band-aid, not a primary ingestion path.

Storage: Prometheus stores data locally on disk in a custom time-series format. It compresses efficiently (around 1.5 bytes per sample with default settings). Local storage is fast and reliable but has limitations: it is not replicated, and scaling requires careful planning.

For long-term storage, Prometheus supports remote write and remote read protocols. Tools like Thanos, Cortex, and Mimir extend Prometheus with durable object storage, global query views across multiple Prometheus instances, and compaction. Grafana Mimir is Grafana Labs’ offering for this.

Loki: The Log Aggregator

Loki is Grafana Labs’ answer to Elasticsearch for log aggregation. Its design philosophy can be summarized in one sentence: index the metadata, not the content.

Traditional log aggregation systems (Elasticsearch, Splunk) build full-text indices on log content. This enables powerful search but requires significant CPU during ingestion and substantial disk for the indices. The index can be larger than the raw log data.

Loki takes a different approach. It indexes only the labels (metadata) attached to log streams, not the log content itself. When you query, Loki identifies relevant streams by label selectors, then performs a brute-force grep through the log chunks for those streams.

# Find all error logs from the payment service in production
{namespace="production", app="payment-service"} |= "error"

# Parse structured logs and filter
{app="api-gateway"} | json | status >= 500 | line_format "{{.method}} {{.path}} {{.status}}"

This design has profound implications:

  • Ingestion is cheap: No tokenization, no inverted index construction. Compress logs and write them to object storage.
  • Storage is cheap: Logs are stored as compressed chunks in object storage (S3, GCS, MinIO). Labels are indexed in a small index (BoltDB, or more recently, TSDB-based indexing).
  • Full-text search is slow: Grep across compressed chunks is not fast. Loki compensates by parallelizing the scan across many workers, but it will never match ElasticSearch’s full-text query speed.

The trade-off is explicit: Loki is optimized for “I know roughly what I’m looking for and where it is” rather than “search everything for this keyword.” If you know the service name and time window, Loki is fast. If you want to search all logs from all services for a stack trace fragment, Loki is slow compared to a full-text index.

LogQL: Loki’s query language mirrors PromQL but extends it with log-specific operations: line filtering (|=, !=, |~, !~), parsing (json, logfmt, regexp), and formatting. It can also produce metrics from logs:

# Count error log lines per minute
count_over_time({app="payment-service"} |= "error" [1m])

# Average response time from structured logs
avg_over_time({app="api-gateway"} | json | unwrap response_time [5m])

This is powerful because it lets you derive metrics from logs without running a separate metrics pipeline. But it is slower than dedicated metrics and should not replace Prometheus for high-frequency queries.

Agents: Loki does not scrape logs. Agents push logs to Loki. The primary agents are:

  • Promtail: The original Loki agent, purpose-built for log collection
  • Grafana Alloy (formerly Grafana Agent): A unified telemetry collector that replaces Promtail, Prometheus agent, and OpenTelemetry Collector in a single binary
  • Fluent Bit / Fluentd: Popular alternatives with Loki output plugins

Tempo: The Tracing Backend

Tempo is Grafana Labs’ distributed tracing backend. Like Loki’s approach to logs, Tempo’s philosophy is: store traces cheaply, index minimally.

Traditional tracing backends (Jaeger with Elasticsearch/Cassandra, Zipkin with Cassandra) index trace data for search. Tempo stores traces in object storage and indexes only the trace ID and a few key attributes. Finding a trace by ID is fast. Searching for traces by arbitrary attributes requires the newer TraceQL or integration with other tools.

Tempo accepts traces in multiple formats:

  • OpenTelemetry (OTLP)
  • Jaeger (Thrift, gRPC)
  • Zipkin

This multi-format support means you can adopt Tempo without changing your instrumentation. If your services already emit Jaeger spans, point them at Tempo.

TraceQL: Tempo’s query language for searching traces by attributes:

# Find traces where the HTTP status is 500 and duration exceeds 2 seconds
{span.http.status_code = 500 && duration > 2s}

# Find traces involving a specific service
{resource.service.name = "payment-service"}

TraceQL is newer and less mature than PromQL or LogQL, but it fills a critical gap: finding traces without already having a trace ID.

Correlations: Tempo’s value multiplies when connected to Grafana, Prometheus, and Loki. The workflow becomes:

  1. See a latency spike in a Prometheus metric graph
  2. Click “exemplars” to jump to a specific trace
  3. View the trace in Tempo, see which span is slow
  4. Click from a span to the Loki logs for that service during that time window

This integrated debugging workflow is the primary argument for using the full Grafana stack rather than mixing vendors.

Alertmanager: The Alert Router

Alertmanager is a Prometheus project, not a Grafana project, but it is integral to the stack.

Prometheus evaluates alerting rules against metric data and fires alerts to Alertmanager. Alertmanager handles the rest:

  • Grouping: Combine related alerts into single notifications. If 50 pods are failing the same health check, you get one alert, not 50.
  • Inhibition: Suppress certain alerts when others are firing. If the entire cluster is down, suppress individual service alerts.
  • Silencing: Temporarily mute alerts during maintenance windows.
  • Routing: Send alerts to different receivers (Slack, PagerDuty, email, webhooks) based on labels.
route:
  receiver: 'slack-default'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-oncall'
    - match:
        severity: warning
      receiver: 'slack-warnings'

Alert design is more important than alert tooling. An Alertmanager that routes 200 noisy alerts to PagerDuty will burn out your on-call engineers regardless of how elegantly it deduplicates them. We’ll address this later.

Data Flow

The typical data flow in a Grafana stack deployment:

Application Services
  ├── Expose /metrics endpoint ──► Prometheus scrapes ──► Prometheus TSDB
  ├── Write logs to stdout ──► Agent (Alloy) collects ──► Push to Loki ──► Object Storage
  └── Emit traces (OTLP) ──► Tempo receives ──► Object Storage

Prometheus
  ├── Evaluates alert rules ──► Fires to Alertmanager ──► Routes to Slack/PagerDuty
  └── Serves PromQL queries ──► Grafana dashboards

Loki
  └── Serves LogQL queries ──► Grafana Explore / dashboards

Tempo
  └── Serves trace queries ──► Grafana Explore / trace view

Grafana
  └── Queries all three backends ──► Unified dashboards and Explore

Design Philosophy

Three principles unify the stack:

Label-based: Everything is organized by labels. Prometheus metrics have labels. Loki log streams have labels. Tempo traces have resource attributes that function like labels. Labels are the common language for querying, grouping, and routing across all three pillars.

Cost-efficient by default: Loki and Tempo are designed to run on commodity object storage (S3, GCS, MinIO). They avoid expensive full-text indexing. This makes them dramatically cheaper to operate than Elasticsearch-based alternatives at scale, at the cost of raw query performance.

Cloud-native: Every component is designed to run in containers, scale horizontally, and integrate with Kubernetes. Service discovery, configuration, and deployment all assume a container orchestration environment.

Architecture Deep Dive

Pull vs Push

The stack uses both models:

Pull (Prometheus): Prometheus scrapes targets on a configurable interval. Targets expose a /metrics HTTP endpoint that returns the current value of all metrics in Prometheus exposition format. The target is stateless — it does not buffer data, send retries, or manage a connection to Prometheus. This simplicity is valuable.

Pull requires that Prometheus can reach targets over the network. In environments with strict network segmentation, VPCs, or NAT, this can be problematic. Remote write (push) mode exists for these scenarios, using Prometheus Agent mode or Grafana Alloy to scrape locally and forward to a remote Prometheus-compatible backend.

Push (Loki, Tempo): Logs and traces are pushed by agents. Applications emit traces to Tempo’s endpoint. Agents (Alloy, Promtail) tail log files or capture container stdout and push batches to Loki.

Push is natural for logs and traces because these are event-driven data. You cannot “scrape” a log line — it exists when it happens and must be captured immediately. Push also handles short-lived processes better because the process pushes its data before exiting.

Storage Backends

Prometheus:

  • Local disk: Default. Fast, simple, no external dependencies. Limited by disk size. Not replicated.
  • Remote write to Mimir/Thanos/Cortex: For long-term retention and global queries. Mimir stores blocks in object storage (S3/GCS/Azure Blob), compacts them, and serves queries across a distributed cluster.

Loki:

  • Filesystem: For development and small deployments. Stores chunks and index on local disk.
  • Object storage + index: Production configuration. Log chunks are stored in S3/GCS/MinIO. The index (which maps labels to chunk locations) is stored in BoltDB-shipper (being deprecated) or the newer TSDB index format that also uses object storage.
  • Single-store (TSDB): The current recommended setup. Both index and chunks go to object storage. No separate index database needed.

Tempo:

  • Local disk: For development.
  • Object storage: Production. Traces are stored as blocks in S3/GCS/MinIO. The backend is append-only; traces are written and later compacted.

Object storage is the unifying theme for production deployments. S3-compatible storage (including MinIO for on-premises) is the lowest common denominator. This keeps storage costs low and decouples compute from storage, allowing each to scale independently.

Loki’s “Index-Less Logs” Concept

This is the most frequently misunderstood aspect of the Grafana stack.

Loki is not truly index-less. It does maintain an index, but the index maps label sets to chunk references, not words to log lines. The distinction is critical.

In Elasticsearch, when you ingest a log line like "Connection timeout to database host db-01 after 30s", the system tokenizes it and creates inverted index entries for “Connection”, “timeout”, “database”, “host”, “db-01”, ”30s”, and so on. Any of these tokens can be searched efficiently.

In Loki, that same log line is compressed and written to a chunk associated with its label set (e.g., {app="payment-service", env="production"}). The index records that this chunk exists for this label set during this time range. That is all.

When you query {app="payment-service"} |= "timeout":

  1. Loki finds all chunks for {app="payment-service"} in the relevant time range (fast index lookup)
  2. Loki decompresses those chunks and greps for “timeout” (brute-force scan)

Step 1 is fast. Step 2 depends on how many chunks match and how much data they contain.

This is why Loki queries must always include a label selector. Querying {} |= "timeout" across all streams is prohibitively expensive — it would scan every chunk from every stream. Good label design is essential.

Scalability and High Availability

Prometheus HA: Run two identical Prometheus instances scraping the same targets. They independently scrape, store, and alert. Alertmanager deduplicates alerts from both. For query deduplication, use Thanos Query or Mimir as a fan-out layer.

This approach is simple but wastes resources (double ingestion, double storage). It is the standard recommendation because the alternative — clustered Prometheus — does not exist natively. Mimir and Thanos solve this at the cost of additional infrastructure.

Loki scaling modes:

  • Monolithic: All components in one process. Good for up to ~100GB/day of log volume.
  • Simple Scalable Deployment (SSD): Read path and write path separated into two targets. Scales to ~1TB/day.
  • Microservices mode: Each component (distributor, ingester, querier, query-frontend, compactor, ruler) runs as a separate service. Scales well beyond 1TB/day. Complex to operate.

Tempo scaling modes: Similar to Loki. Monolithic for small deployments, microservices for large ones. Key components: distributor, ingester, compactor, querier.

Grafana HA: Grafana is stateless (assuming external database for dashboard storage). Run multiple replicas behind a load balancer. Use PostgreSQL or MySQL for dashboard and user storage instead of the default SQLite.

From Local Development to Production

Running Locally

Docker Compose is the standard path for local development. Grafana Labs maintains example configurations in their documentation.

A minimal local stack:

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  loki:
    image: grafana/loki:latest
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

  tempo:
    image: grafana/tempo:latest
    volumes:
      - ./tempo-config.yml:/etc/tempo/tempo.yaml
    ports:
      - "3200:3200"   # Tempo API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    command: -config.file=/etc/tempo/tempo.yaml

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
      - loki
      - tempo

  alloy:
    image: grafana/alloy:latest
    volumes:
      - ./alloy-config.alloy:/etc/alloy/config.alloy
      - /var/log:/var/log:ro
    command: run /etc/alloy/config.alloy
    depends_on:
      - loki
      - prometheus

This gives you the full stack on a single machine. Point your application’s metrics endpoint at Prometheus, send OTLP traces to localhost:4317, and configure Alloy to tail your application’s log files.

Single-Node vs Multi-Node

Single-node is appropriate for:

  • Local development
  • Small teams (fewer than 10 services)
  • Low data volume (under 10GB/day of logs, under 100k active time-series)

Move to multi-node when:

  • Data volumes exceed single-disk capacity
  • You need high availability (cannot tolerate single-machine failure)
  • Query latency becomes unacceptable
  • You need to scale read and write paths independently

Development vs Staging vs Production Concerns

Development: Optimize for fast startup and zero maintenance. Use monolithic mode for all components. Store data locally. Disable authentication. Short retention (24 hours).

Staging: Mirror production topology at smaller scale. Run the same deployment mode (SSD or microservices) as production. Test alerting rules. Validate dashboards with realistic data volume.

Production: Full HA deployment. Object storage backends. Authentication and authorization enabled. Long retention (30-90 days for hot, years for cold). Backup configurations. Proper resource limits. Network policies.

Configuration Management and Secrets

Every component in the stack is configured via YAML files. Store these in version control. Use templating (Helm values, Kustomize overlays, environment variable substitution) for environment-specific values.

Secrets (API keys, object storage credentials, database passwords) must not be in version control. Options:

  • Kubernetes Secrets (base64-encoded, not encrypted by default — enable encryption at rest)
  • External secret stores (HashiCorp Vault, AWS Secrets Manager) with Kubernetes operators (External Secrets Operator)
  • SOPS for encrypted secrets in Git

Grafana dashboards should be provisioned as code (JSON files in a provisioning directory or using Grafonnet/Jsonnet), not hand-crafted in the UI. Dashboard drift between environments is a real operational problem.

Containers and Kubernetes

Running the Stack in Containers

Every component of the Grafana stack is distributed as a container image. Grafana Labs publishes images to Docker Hub and their own registry. Images are multi-arch (amd64, arm64) and follow semantic versioning.

Container considerations:

  • Resource limits are mandatory. Prometheus and Loki ingesters are memory-hungry. Without limits, a single component can OOM-kill other pods.
  • Persistent volumes for stateful components. Prometheus needs PVCs for its TSDB. Loki ingesters need PVCs for their WAL (write-ahead log). Tempo ingesters need PVCs for in-flight traces.
  • Graceful shutdown matters. Ingesters need time to flush data to long-term storage before terminating. Set terminationGracePeriodSeconds appropriately (60-120 seconds).

Kubernetes-Native Deployment

The Grafana stack is designed for Kubernetes. Key integration points:

Service discovery: Prometheus discovers scrape targets through the Kubernetes API. Pod annotations (prometheus.io/scrape: "true", prometheus.io/port: "8080") or ServiceMonitor custom resources (from Prometheus Operator) tell Prometheus what to scrape.

Namespace-based multi-tenancy: Run separate instances per namespace, or use Loki/Mimir’s native multi-tenancy with tenant IDs injected from namespace labels.

Network policies: Restrict traffic so that only Prometheus can reach application metrics endpoints. Restrict Loki/Tempo ingress to only the agent. Restrict Grafana egress to only the backends.

Helm Charts and Operators

kube-prometheus-stack: The widely used Helm chart that deploys Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics with pre-configured dashboards and alerting rules. It is comprehensive but complex — the chart has hundreds of configurable values.

Loki Helm chart: Grafana Labs maintains official Helm charts for Loki in all three deployment modes (monolithic, SSD, microservices). The SSD mode chart is recommended for most production deployments.

Tempo Helm chart: Similar structure to Loki. A distributed chart for microservices mode.

Grafana Alloy Helm chart: Deploys Alloy as a DaemonSet (for node-level log/metric collection) or a Deployment (for centralized collection).

Prometheus Operator: Extends Kubernetes with custom resources:

  • ServiceMonitor: Declares what services Prometheus should scrape
  • PodMonitor: Declares what pods Prometheus should scrape
  • PrometheusRule: Declares alerting and recording rules
  • AlertmanagerConfig: Declares alert routing

These CRDs let application teams define their monitoring configuration alongside their application manifests. This is the recommended approach for organizations with multiple teams.

Resource Sizing

Rules of thumb (adjust based on actual workload):

Prometheus:

  • Memory: ~2-3 bytes per active time-series for the in-memory index. 1 million active series ≈ 2-3 GB RAM. Add headroom for queries.
  • Disk: ~1.5 bytes per sample after compaction. 1 million series scraped every 15s for 15 days ≈ 130 GB.
  • CPU: Low for ingestion. Spikes during compaction and complex queries.

Loki:

  • Ingesters: Memory-proportional to active streams. Each active stream consumes memory for its chunk buffer. Start with 2 GB per ingester, scale based on stream count.
  • Queriers: Memory-proportional to query scope. Wide-range queries over many streams consume more memory.
  • Object storage: Roughly 60-70% of raw log volume after compression, depending on log content.

Tempo:

  • Ingesters: Memory-proportional to in-flight traces (traces not yet flushed to backend).
  • Queriers: Memory-proportional to trace size during retrieval.
  • Object storage: Roughly 50-60% of raw trace size after compression.

Grafana:

  • Lightweight. 256 MB RAM and 0.5 CPU is sufficient for moderate use. Scale horizontally for many concurrent users.

Multi-Cluster Observability

For organizations with multiple Kubernetes clusters:

Federation model: Each cluster runs its own Prometheus, Loki, and Tempo. A central Grafana queries all of them, or a central Mimir/multi-tenant Loki aggregates data from all clusters.

Remote-write model: Each cluster’s Prometheus writes to a central Mimir. Each cluster’s Alloy pushes logs to a central Loki and traces to a central Tempo. The central stack handles storage, querying, and alerting.

Considerations:

  • Network bandwidth between clusters and central stack
  • Latency for cross-cluster queries
  • Tenant isolation if clusters belong to different teams
  • Failure modes: if the central stack is down, clusters lose observability (buffer locally or accept data loss)

Common Pitfalls in Kubernetes

1. High-cardinality labels: Adding Kubernetes labels like pod or container_id to metrics creates a new time-series for every pod restart. In a deployment with 100 replicas that restart daily, this creates thousands of stale time-series. Use relabel_configs to drop unnecessary labels.

2. Log volume explosion: A single application logging at DEBUG level can produce more data than the rest of the cluster combined. Rate-limit logs at the application level. Use Alloy’s pipeline stages to drop or sample verbose streams before they reach Loki.

3. Resource under-provisioning: The observability stack itself consumes cluster resources. Budget 10-15% of cluster resources for observability. Under-provisioned Prometheus OOMs during compaction. Under-provisioned Loki ingesters drop logs.

4. Missing PersistentVolumeClaims: Running Prometheus or Loki ingesters without persistent storage means data loss on pod restart. Always configure PVCs for stateful components.

5. Scrape interval mismatch: Setting Prometheus scrape interval to 5 seconds “for more granularity” on 200 targets generates 40x more data than a 15-second interval. The default 15-second interval is sufficient for most use cases.

6. Alert storms during cluster operations: Node drains, rolling updates, and cluster upgrades trigger cascading alerts. Use Alertmanager silences for planned maintenance. Design alerts that are resilient to expected transient failures (e.g., for: 5m instead of instant-fire).

Operational Concerns

Performance Tuning

Prometheus:

  • Use recording rules to pre-compute expensive queries. A dashboard querying histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) across 1000 series every refresh is expensive. A recording rule computes it once per evaluation interval.
  • Enable WAL compression to reduce disk usage.
  • Tune --storage.tsdb.retention.time and --storage.tsdb.retention.size to bound local storage.

Loki:

  • Choose appropriate chunk encoding (gzip vs snappy vs zstd). Snappy is faster; gzip/zstd compress more.
  • Tune max_chunk_age and chunk_idle_period to balance flush frequency against chunk size. Larger chunks compress better but delay query availability.
  • Use bloom filters (experimental/newer feature) for faster content filtering.
  • Limit query lookback periods in Grafana to prevent users from accidentally querying months of data.

Tempo:

  • Tune max_bytes_per_trace to prevent a single pathological trace from consuming excessive resources.
  • Enable search (TraceQL) only if needed — it adds overhead to the compaction process.

Storage Cost Management

Object storage is cheap but not free at scale. Strategies:

  • Tiered retention: Keep recent data (7-30 days) in hot storage with fast query access. Move older data to cheaper tiers (S3 Infrequent Access, GCS Coldline). Delete data beyond a maximum retention period.
  • Sampling: For tracing, sample a percentage of traces in production (e.g., 10% of successful requests, 100% of errors). This dramatically reduces Tempo storage.
  • Log filtering: Not all logs need to reach Loki. Filter out health check logs, verbose DEBUG logs, and other noise at the agent level.
  • Metric aggregation: Use recording rules to downsample high-resolution metrics for long-term storage. Keep 15-second resolution for 7 days, 1-minute resolution for 30 days, 5-minute resolution for a year.

Retention Strategies

Each component handles retention differently:

  • Prometheus: --storage.tsdb.retention.time=15d or --storage.tsdb.retention.size=50GB. Local storage only. Long-term handled by Mimir with per-tenant retention policies.
  • Loki: Table-based retention. Configure retention_period in the compactor. Can be set per-tenant.
  • Tempo: Block-based retention. Configure max_block_duration and compactor retention in the backend.
  • Grafana: Dashboards don’t expire. Annotations can be configured with max age.

Alert Fatigue and Alert Design

Alert fatigue kills on-call culture faster than any technical failure. Principles:

Alert on symptoms, not causes: Alert on “error rate exceeds 1%” rather than “CPU usage above 80%.” High CPU might be normal during batch processing. High error rate always matters.

Use for durations: for: 5m means the condition must be true for 5 continuous minutes before firing. This eliminates transient spikes from noise.

Tier severity levels:

  • Critical: Requires immediate human action. Pages on-call. Examples: service down, data loss, SLA breach.
  • Warning: Requires action within business hours. Sends to Slack. Examples: disk filling up, certificate expiring in 7 days.
  • Info: No action required. Logged for context. Examples: deployment completed, backup finished.

Use runbooks: Every alert should link to documentation explaining what it means, why it fires, and how to resolve it. An alert without a runbook is a puzzle delivered at 3 AM.

Review alerts quarterly: If an alert fires frequently and is always ignored, either fix the underlying issue or delete the alert. An alert that nobody reads is worse than no alert because it normalizes ignoring alerts.

Security

Authentication: Grafana supports local accounts, LDAP, OAuth (Google, GitHub, Azure AD, Okta), and SAML. Use SSO in production.

Authorization / RBAC: Grafana has organization-level and folder-level permissions. Editors can modify dashboards in their folders. Viewers can only read. Admins manage data sources and users.

Multi-tenancy: Loki, Mimir, and Tempo support multi-tenancy via an X-Scope-OrgID HTTP header. Each tenant’s data is isolated. Tenants cannot query other tenants’ data. This is enforced at the storage level.

Network security: Use TLS for all internal communication. Use mTLS between components in zero-trust environments. Restrict network access with Kubernetes NetworkPolicies or service mesh (Istio, Linkerd).

Data sensitivity: Logs often contain PII, API keys, passwords, and other sensitive data. Scrub sensitive fields at the agent level before data reaches Loki. Do not rely on access controls alone.

Backup and Disaster Recovery

Prometheus: Back up the data directory or use remote write to a durable backend. Alerting rules and configuration should be in version control.

Loki and Tempo: Data in object storage is inherently durable (S3 provides 99.999999999% durability). Back up the configuration and index if using BoltDB. With TSDB index in object storage, the entire state is in the object store.

Grafana: Back up the database (PostgreSQL/MySQL). If using provisioning-from-code, dashboards can be recreated from Git. User preferences and annotations may be lost without database backup.

Configuration: All YAML configurations, Helm values, and Terraform/Pulumi definitions should be in version control. The stack should be rebuildable from code.

Grafana Stack vs Elastic Stack

This comparison is inevitable and important. Both stacks are used for observability. They make fundamentally different trade-offs.

Philosophical Differences

The Elastic stack (Elasticsearch, Logstash/Beats, Kibana) started as a log search engine. Elasticsearch is a general-purpose search and analytics engine built on Apache Lucene. It evolved into an observability platform by adding metrics (Metricbeat), APM (Elastic APM), and security analytics.

The Grafana stack started as a metrics visualization tool. Grafana was born to visualize Prometheus metrics. Loki and Tempo were added later to round out the three pillars, designed from the ground up with specific trade-offs (no full-text indexing) that Elasticsearch would never make.

Elastic thinks search-first: Everything is a document. Every field is indexed by default. Query anything.

Grafana thinks labels-first: Everything is a stream. Identify the stream by labels. Search within the stream.

Loki vs Elasticsearch for Logs

DimensionLokiElasticsearch
IndexingLabels only (metadata)Full-text inverted index
Query speed (keyword search)Slower (brute-force scan within streams)Faster (index lookup)
Query speed (known stream + time)Fast (label selector + time range)Fast (but overkill)
Ingestion costLow (compress and write)High (tokenize, index, store)
Storage costLow (compressed chunks in object storage)High (data + index, typically 1.5-3x raw size)
OperationsSimpler (object storage backend)Complex (cluster management, shard allocation, rebalancing)
Query languageLogQL (PromQL-like)Lucene / KQL / ES
Full-text searchLimited (grep-like)Excellent (fuzzy, stemming, relevance scoring)
AggregationsBasic (count, rate, avg from log content)Advanced (nested, pipeline, matrix)
Schema managementNone (schema-free)Dynamic mapping (can cause mapping explosions)

When Loki wins: You know which service’s logs you need. You want cheap storage. You already run Prometheus and Grafana. Your primary workflow is “check logs for a specific service during a time window.”

When Elasticsearch wins: You need to search across all logs for an arbitrary string. You need advanced text analytics (tokenization, stemming, fuzzy matching). You need complex aggregations on structured log fields. Your primary workflow is “find the needle in the haystack.”

Prometheus vs Elastic Metrics

Prometheus is purpose-built for metrics. Elasticsearch handles metrics through Metricbeat and the data streams feature, but it is a general-purpose engine applying search-oriented architecture to metric storage.

Prometheus advantages:

  • PromQL is more expressive for metrics than KQL
  • Pull model with native Kubernetes service discovery
  • Efficient time-series storage format (~1.5 bytes per sample)
  • Massive ecosystem of exporters

Elasticsearch advantages:

  • Unified storage with logs (no context switching between backends)
  • Better at high-cardinality metrics (tag-based with document model)
  • Cross-index queries combining logs and metrics

In practice, Prometheus is the better metrics engine. Elastic’s strength is unification: if you already run Elasticsearch for logs, adding metrics avoids a second system.

Tempo vs Elastic APM

Tempo is a trace storage backend. Elastic APM is a broader Application Performance Monitoring solution that includes tracing, profiling, error tracking, and service maps.

Tempo advantages:

  • Accepts any trace format (OTLP, Jaeger, Zipkin)
  • Cheap storage (object storage, minimal indexing)
  • Simple to operate

Elastic APM advantages:

  • Richer feature set (service maps, anomaly detection, correlations)
  • Integrated with the rest of Elastic for cross-referencing
  • Auto-instrumentation agents for many languages
  • More mature search capabilities for traces

If you need a tracing backend, Tempo is simpler and cheaper. If you need a full APM platform, Elastic APM offers more out of the box.

Cost Model Comparison

Elasticsearch cost drivers:

  • Compute: CPU-heavy indexing. Budget significant CPU for ingest nodes.
  • Memory: JVM heap per node (typically 50% of available RAM). A 3-node cluster with 32 GB each costs real money.
  • Disk: Data + indices. Plan for 1.5-3x raw data volume. SSD expected for acceptable query performance.
  • Operations: Shard management, index lifecycle policies, cluster health monitoring. Requires dedicated expertise.

Grafana stack cost drivers:

  • Object storage: The primary cost. Cheap per GB ($0.023/GB/month for S3 Standard). Compression reduces it further.
  • Compute: Ingesters and queriers need memory. Less than Elasticsearch for equivalent data volumes.
  • Disk: Minimal. Used for WAL and caching, not primary storage.
  • Operations: Simpler per-component, but multiple components to manage.

At scale (terabytes per day of logs), the Grafana stack’s cost advantage is substantial — often 5-10x cheaper than Elasticsearch for log storage. The gap narrows for smaller deployments where operational complexity dominates cost.

Operational Complexity

Elasticsearch:

  • Cluster management (master nodes, data nodes, ingest nodes)
  • Shard sizing and rebalancing
  • Index lifecycle management
  • JVM tuning (heap size, GC settings)
  • Mapping explosions from uncontrolled log schemas
  • Upgrade procedures (rolling upgrades with shard allocation control)
  • Single system to manage but that system is complex

Grafana stack:

  • Multiple independent systems (Prometheus, Loki, Tempo, Grafana, Alertmanager)
  • Each system has its own configuration, scaling model, and failure modes
  • More moving parts but each part is simpler
  • Object storage management (lifecycle rules, cost monitoring)
  • Upgrade procedures per-component (stagger upgrades, test compatibility)

Neither is “simple.” Elasticsearch concentrates complexity in one system. The Grafana stack distributes it across several. Choose your preferred flavor of complexity.

Use Cases and Real-World Scenarios

Microservices

The Grafana stack was designed for this. Prometheus excels at per-service metrics with label-based aggregation. Loki naturally segments logs by service (Kubernetes labels map directly to Loki labels). Tempo connects requests across service boundaries. The integrated Grafana workflow (metric → trace → log) is purpose-built for debugging distributed request flows.

Kubernetes-Heavy Platforms

Native Kubernetes integration in every component. Service discovery, pod-level metrics, container log collection, and namespace-based tenancy all work out of the box. The kube-prometheus-stack Helm chart provides a ready-made monitoring solution for any Kubernetes cluster.

Cost-Sensitive Startups

Loki and Tempo’s object-storage-first architecture makes them ideal for startups that need observability without the hardware cost of an Elasticsearch cluster. A small team can run the monolithic deployment mode on a single machine and scale to SSD mode when data volumes grow. Grafana Cloud’s free tier (50 GB logs, 10k metrics series, 50 GB traces) is sufficient for small projects.

Large Enterprises

Enterprises benefit from multi-tenancy (isolating teams’ data), RBAC (controlling who sees what), and SSO (integrating with corporate identity providers). Grafana’s folder-based permissions and Loki/Mimir’s tenant isolation support organizational structures. Grafana Cloud and Grafana Enterprise add features like auditing, SLA reporting, and support contracts.

Hybrid and Cloud-Native Systems

Grafana’s data source model shines in hybrid environments. A single Grafana instance can query Prometheus in Kubernetes, CloudWatch in AWS, and an on-premises Elasticsearch cluster simultaneously. This makes Grafana a natural “single pane of glass” even in heterogeneous environments.

Limitations and Trade-offs

What the Grafana Stack Does Poorly

Full-text log search: Loki is not a search engine. If your primary use case is “search all logs for this error message across all services,” Loki will be slower than Elasticsearch. Significantly slower for broad queries.

Complex log analytics: Aggregations, statistical analysis, and machine learning on log data are not Loki’s strengths. Elasticsearch’s aggregation framework is far more powerful.

Out-of-the-box APM: Tempo is a tracing backend, not an APM platform. Features like auto-instrumentation, service maps, error tracking, and anomaly detection require additional tools or manual implementation. Elastic APM and commercial platforms (Datadog, New Relic) provide these out of the box.

Unified query language: PromQL, LogQL, and TraceQL are three different languages. Correlation between metrics, logs, and traces requires manual navigation in Grafana. Elastic’s single query language across all data types is more convenient for users who want to query everything in one place.

Long-term metrics without additional tooling: Prometheus alone has limited local retention. For months or years of metrics, you need Mimir, Thanos, or Cortex — additional systems to deploy and manage.

Ease of initial setup: For someone unfamiliar with the ecosystem, setting up Prometheus + Loki + Tempo + Grafana + Alertmanager + Alloy + object storage is more daunting than deploying a single Elasticsearch cluster (though the latter’s operational complexity catches up quickly).

When Elastic or Other Tools Outperform

  • Security analytics and SIEM: Elastic Security is a mature product. The Grafana stack has no equivalent.
  • Business analytics on logs: When logs are a source of business intelligence, Elasticsearch’s aggregation capabilities are superior.
  • When you need one system: If operational simplicity (fewer systems to manage) outweighs cost, Elasticsearch’s all-in-one model is appealing.
  • When search flexibility is paramount: Ad-hoc, free-form search across all data is Elasticsearch’s core strength.

Common Misconceptions

“Loki is free Elasticsearch”: No. Loki makes fundamentally different trade-offs. It is cheaper but less capable for search-heavy workloads.

“Prometheus handles everything metrics”: Prometheus is powerful but not unlimited. High-cardinality metrics, long-term retention, and global aggregation require additional tools (Mimir, Thanos).

“The Grafana stack is simple”: Simpler per-component, but the total system complexity of five or more components is non-trivial. It requires understanding each component’s architecture, failure modes, and tuning parameters.

“You must use all the components”: You can use Grafana with Elasticsearch. You can use Prometheus without Loki. You can adopt incrementally. The components are designed to work together but do not require each other.

Conclusion

The Grafana observability stack is a compelling choice for organizations that prioritize cost efficiency, cloud-native architecture, and composability. Its label-based philosophy unifies metrics, logs, and traces under a consistent mental model. Its object-storage-first design for Loki and Tempo keeps costs low at scale. Its separation of concerns (one tool per pillar) means you adopt only what you need.

Its weaknesses are real. Loki is not a search engine. Tempo is not an APM platform. The ecosystem requires understanding multiple components with different configuration models. Full-text log search will always be slower than Elasticsearch.

Choose the Grafana stack when:

  • You operate in Kubernetes-native environments
  • Cost efficiency at scale matters more than query flexibility
  • You need metrics-first observability with logs and traces as supporting pillars
  • Your team is comfortable managing multiple focused tools rather than one large platform
  • You want to avoid vendor lock-in (open-source core, portable data formats)

Choose the Elastic stack when:

  • Log search and analytics is your primary use case
  • You need security analytics (SIEM)
  • You prefer a single system over multiple components
  • Full-text search flexibility is non-negotiable
  • You need advanced APM features out of the box

Consider commercial platforms (Datadog, New Relic) when:

  • You want zero operational burden for the observability infrastructure itself
  • Developer experience and ease of onboarding are top priorities
  • The cost (which can be substantial) is acceptable relative to engineering time saved

The observability landscape is converging. OpenTelemetry is standardizing telemetry collection. Grafana is adding more Elasticsearch-like features (bloom filters, structured metadata). Elastic is adding more Prometheus-like features (TSDB data streams). The gap between stacks is narrowing.

But architectural philosophies don’t converge easily. Loki will never build full-text indices. Elasticsearch will never drop them. The choice between indexing everything and indexing only metadata is a fundamental design decision, not a feature gap to be closed.

Choose based on your workload, your team’s expertise, and your cost model. Not based on hype.

Continue reading

Next article

AI Agents from Scratch Part 4: Human-in-the-Loop Validation (Research Report Generator)

Related Content