Prometheus Storage Full? Scaling to S3 with Thanos Sidecar

It started with a classic paging alert at 3:14 AM: DiskUsageHigh: 95% on prometheus-data. We were running a standard Prometheus setup on Kubernetes, collecting metrics from about 400 microservices. Initially, a 500GB Persistent Volume seemed like overkill for 14 days of retention. But as our engineering team embraced Observability, adding high-cardinality metrics (like user-specific request tags), that buffer evaporated. We found ourselves in a loop of resizing EBS volumes every month, forcing downtime and increasing costs.

The Bottle-neck: Local TSDB Limits

The core issue isn't just disk space; it's the architectural limitation of standard Monitoring Systems that rely on local storage. Prometheus is brilliant at scraping and writing to its local Time Series Database (TSDB), but it is not designed to be a distributed database for Long-term Storage.

In our scenario (AWS EKS, Prometheus v2.45), simple vertical scaling hit a wall. Resizing the volume required re-mounting, and querying 6 months of historical data for capacity planning caused OOM (Out of Memory) kills on the Prometheus pod because it tried to load massive chunks of data into RAM.

Critical Failure: When the Prometheus pod crashed due to OOM during a heavy query, we lost visibility (blind spots) for the 15 minutes it took to restart and replay the WAL (Write Ahead Log).

We needed a way to offload older data to cheap object storage (S3) while keeping the scraping lightweight. This is where the Thanos Architecture comes in, specifically the Sidecar pattern.

Why Federation Failed Us

Before settling on Thanos, we tried Prometheus Federation. The idea was to have "slave" Prometheus servers scrape apps, and a "master" server scrape the slaves. This failed miserably in practice. The "master" server became a massive single point of failure (SPOF). It essentially replicated the same storage problem one level higher, and configuring the filtering rules to avoid duplicating raw metrics was a maintenance nightmare. Federation is good for aggregating specific subsets of data, not for full-blown HA and archival.

The Solution: Thanos Sidecar & Object Store

The Thanos Sidecar pattern is elegant because it sits effectively "next to" your Prometheus container. It reads the Prometheus TSDB blocks every 2 hours (as soon as Prometheus creates them) and uploads them to an Object Store (AWS S3, GCS, etc.).

This solves two problems:

  1. Long-term Storage: Prometheus only needs to keep 2-6 hours of data locally. Everything else lives in S3.
  2. Prometheus HA: By running two identical Prometheus replicas (A and B), both scraping the same targets, and having Thanos Query deduplicate the data on read, we achieve high availability without complex clustering protocols.

Configuration Logic

Here is the Kubernetes StatefulSet configuration snippet required to inject the Thanos Sidecar. Note the shared volume mount logic.

# prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  replicas: 2 # Running 2 replicas for HA
  template:
    spec:
      containers:
      # 1. The Standard Prometheus Container
      - name: prometheus
        image: prom/prometheus:v2.45.0
        args:
          - "--config.file=/etc/prometheus/prometheus.yml"
          - "--storage.tsdb.path=/prometheus"
          - "--storage.tsdb.retention.time=4h" # Keep local short!
          - "--storage.tsdb.min-block-duration=2h"
          - "--storage.tsdb.max-block-duration=2h"
          - "--web.enable-lifecycle"
        volumeMounts:
          - name: prometheus-storage
            mountPath: /prometheus
          - name: config-volume
            mountPath: /etc/prometheus

      # 2. The Thanos Sidecar Container
      - name: thanos-sidecar
        image: thanosio/thanos:v0.32.2
        args:
          - "sidecar"
          - "--tsdb.path=/prometheus"
          - "--prometheus.url=http://127.0.0.1:9090"
          - "--objstore.config-file=/etc/thanos/bucket_config.yaml"
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
        ports:
          - name: grpc
            containerPort: 10901
        volumeMounts:
          - name: prometheus-storage # MUST share the same volume
            mountPath: /prometheus
          - name: thanos-config
            mountPath: /etc/thanos

The --storage.tsdb.min-block-duration=2h and max-block-duration=2h flags on Prometheus are critical. They force Prometheus to finalize TSDB blocks every 2 hours, which allows the Thanos sidecar to pick them up and upload them immediately. If you leave this dynamic, the upload latency increases significantly.

You also need the bucket_config.yaml secret to define where the data goes. Here is an S3 example:

# bucket_config.yaml
type: S3
config:
  bucket: "my-company-metrics-archive"
  endpoint: "s3.us-east-1.amazonaws.com"
  region: "us-east-1"
  # Use IAM Roles (ServiceAccount) in prod, keys are for demo only
  # access_key: "..."
  # secret_key: "..."
Note: For Prometheus HA, ensure both replicas upload to the same bucket. The Thanos Compactor (a separate component) will need to handle downsampling, but the Querier handles the realtime deduplication based on the replica label.

Performance & Cost Impact

After deploying the Sidecar and connecting it to a Thanos Querier (which serves as the global view interface), we measured the impact on our infrastructure.

Metric Prometheus Standalone Prometheus + Thanos Sidecar
Local Disk Retention 14 Days (Risk of Full Disk) 4 Hours (Stable)
Query Range Limit ~3 Days (OOM Risk) Unlimited (streaming from S3)
EBS Cost $0.10/GB (gp3) Minimal (Small buffer)
S3 Cost N/A $0.023/GB (Standard)

The shift to S3 reduced our storage costs by approximately 75% for data older than a month. More importantly, the stability of the collection layer improved. Prometheus became "dumb" and reliable—it just scrapes and dumps data. The heavy lifting of querying historical data is offloaded to the Thanos Store Gateway, which streams directly from S3.

Edge Cases & Configuration Gotchas

While this architecture solves the retention problem, it introduces new variables you must manage:

  1. Network Egress Costs: If your Prometheus is in AWS and your S3 bucket is in another region, the cross-region data transfer costs will destroy your budget. Always colocate metrics and buckets.
  2. Compactor Concurrency: To query data efficiently over months or years, you need the Thanos Compactor to downsample data (e.g., 5-minute resolution for 1-month-old data). The Compactor is a singleton and requires significant disk space for temporary processing.
  3. External Labels: You must configure unique external labels in your Prometheus config (e.g., cluster="prod-us", replica="0"). Without these, Thanos cannot distinguish or deduplicate the blocks in S3, leading to query failures.
Warning: Do not enable Compactor retention enforcement and S3 bucket lifecycle policies simultaneously. Let the Compactor manage deletions to avoid data corruption.
View Official Thanos Sidecar Documentation

Conclusion

Implementing the Thanos Sidecar pattern transformed our Monitoring Systems from a fragile, disk-bound liability into a scalable, stateless architecture. By decoupling storage from collection, we achieved true Prometheus HA and cost-effective Long-term Storage on S3. If you are managing Kubernetes clusters at scale, this move is not optional; it is the industry standard for robust Observability.

Post a Comment