It started with a classic paging alert at 3:14 AM: DiskUsageHigh: 95% on prometheus-data. We were running a standard Prometheus setup on Kubernetes, collecting metrics from about 400 microservices. Initially, a 500GB Persistent Volume seemed like overkill for 14 days of retention. But as our engineering team embraced Observability, adding high-cardinality metrics (like user-specific request tags), that buffer evaporated. We found ourselves in a loop of resizing EBS volumes every month, forcing downtime and increasing costs.
The Bottle-neck: Local TSDB Limits
The core issue isn't just disk space; it's the architectural limitation of standard Monitoring Systems that rely on local storage. Prometheus is brilliant at scraping and writing to its local Time Series Database (TSDB), but it is not designed to be a distributed database for Long-term Storage.
In our scenario (AWS EKS, Prometheus v2.45), simple vertical scaling hit a wall. Resizing the volume required re-mounting, and querying 6 months of historical data for capacity planning caused OOM (Out of Memory) kills on the Prometheus pod because it tried to load massive chunks of data into RAM.
We needed a way to offload older data to cheap object storage (S3) while keeping the scraping lightweight. This is where the Thanos Architecture comes in, specifically the Sidecar pattern.
Why Federation Failed Us
Before settling on Thanos, we tried Prometheus Federation. The idea was to have "slave" Prometheus servers scrape apps, and a "master" server scrape the slaves. This failed miserably in practice. The "master" server became a massive single point of failure (SPOF). It essentially replicated the same storage problem one level higher, and configuring the filtering rules to avoid duplicating raw metrics was a maintenance nightmare. Federation is good for aggregating specific subsets of data, not for full-blown HA and archival.
The Solution: Thanos Sidecar & Object Store
The Thanos Sidecar pattern is elegant because it sits effectively "next to" your Prometheus container. It reads the Prometheus TSDB blocks every 2 hours (as soon as Prometheus creates them) and uploads them to an Object Store (AWS S3, GCS, etc.).
This solves two problems:
- Long-term Storage: Prometheus only needs to keep 2-6 hours of data locally. Everything else lives in S3.
- Prometheus HA: By running two identical Prometheus replicas (A and B), both scraping the same targets, and having Thanos Query deduplicate the data on read, we achieve high availability without complex clustering protocols.
Configuration Logic
Here is the Kubernetes StatefulSet configuration snippet required to inject the Thanos Sidecar. Note the shared volume mount logic.
# prometheus-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
replicas: 2 # Running 2 replicas for HA
template:
spec:
containers:
# 1. The Standard Prometheus Container
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=4h" # Keep local short!
- "--storage.tsdb.min-block-duration=2h"
- "--storage.tsdb.max-block-duration=2h"
- "--web.enable-lifecycle"
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
- name: config-volume
mountPath: /etc/prometheus
# 2. The Thanos Sidecar Container
- name: thanos-sidecar
image: thanosio/thanos:v0.32.2
args:
- "sidecar"
- "--tsdb.path=/prometheus"
- "--prometheus.url=http://127.0.0.1:9090"
- "--objstore.config-file=/etc/thanos/bucket_config.yaml"
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
ports:
- name: grpc
containerPort: 10901
volumeMounts:
- name: prometheus-storage # MUST share the same volume
mountPath: /prometheus
- name: thanos-config
mountPath: /etc/thanos
The --storage.tsdb.min-block-duration=2h and max-block-duration=2h flags on Prometheus are critical. They force Prometheus to finalize TSDB blocks every 2 hours, which allows the Thanos sidecar to pick them up and upload them immediately. If you leave this dynamic, the upload latency increases significantly.
You also need the bucket_config.yaml secret to define where the data goes. Here is an S3 example:
# bucket_config.yaml
type: S3
config:
bucket: "my-company-metrics-archive"
endpoint: "s3.us-east-1.amazonaws.com"
region: "us-east-1"
# Use IAM Roles (ServiceAccount) in prod, keys are for demo only
# access_key: "..."
# secret_key: "..."
replica label.
Performance & Cost Impact
After deploying the Sidecar and connecting it to a Thanos Querier (which serves as the global view interface), we measured the impact on our infrastructure.
| Metric | Prometheus Standalone | Prometheus + Thanos Sidecar |
|---|---|---|
| Local Disk Retention | 14 Days (Risk of Full Disk) | 4 Hours (Stable) |
| Query Range Limit | ~3 Days (OOM Risk) | Unlimited (streaming from S3) |
| EBS Cost | $0.10/GB (gp3) | Minimal (Small buffer) |
| S3 Cost | N/A | $0.023/GB (Standard) |
The shift to S3 reduced our storage costs by approximately 75% for data older than a month. More importantly, the stability of the collection layer improved. Prometheus became "dumb" and reliable—it just scrapes and dumps data. The heavy lifting of querying historical data is offloaded to the Thanos Store Gateway, which streams directly from S3.
Edge Cases & Configuration Gotchas
While this architecture solves the retention problem, it introduces new variables you must manage:
- Network Egress Costs: If your Prometheus is in AWS and your S3 bucket is in another region, the cross-region data transfer costs will destroy your budget. Always colocate metrics and buckets.
- Compactor Concurrency: To query data efficiently over months or years, you need the Thanos Compactor to downsample data (e.g., 5-minute resolution for 1-month-old data). The Compactor is a singleton and requires significant disk space for temporary processing.
- External Labels: You must configure unique external labels in your Prometheus config (e.g.,
cluster="prod-us",replica="0"). Without these, Thanos cannot distinguish or deduplicate the blocks in S3, leading to query failures.
Conclusion
Implementing the Thanos Sidecar pattern transformed our Monitoring Systems from a fragile, disk-bound liability into a scalable, stateless architecture. By decoupling storage from collection, we achieved true Prometheus HA and cost-effective Long-term Storage on S3. If you are managing Kubernetes clusters at scale, this move is not optional; it is the industry standard for robust Observability.
Post a Comment