Surviving the 2-Minute Warning: Zero Downtime on EKS Spot Instances

It started with a subtle anomaly in our Datadog dashboards. Every day at roughly 10:00 AM UTC—coinciding with the daily market price fluctuation in our chosen availability zone—our API gateway success rate would dip from 99.99% to 99.5%. For a platform handling thousands of transactions per second, that 0.49% drop translated to hundreds of failed customer requests and 502 Bad Gateway errors. The culprit? We were running our stateless workloads on AWS Spot Instances to align with our FinOps strategy, aiming for that sweet 70% cost reduction. However, we were paying for those savings with reliability.

The Anatomy of a Spot Termination

In our environment, we were running AWS EKS version 1.28 with a mix of `t3.large` and `m5.large` instances managed by Karpenter. The premise of Spot Instances is simple: you get unused compute capacity at a steep discount, but AWS can reclaim it with a mere two-minute warning. This is known as the Instance Termination Notice (ITN).

When AWS needs the capacity back, it issues this warning via the EC2 Instance Metadata Service (IMDS). If your Kubernetes cluster isn't actively listening for this specific signal, the underlying EC2 instance simply vanishes after two minutes. The Kubernetes control plane eventually notices the node is `NotReady`, but by then, it's too late. The pods running on that node are killed abruptly, in-flight requests are severed, and your load balancer is left routing traffic to a black hole until health checks catch up.

The Symptom: Your logs show `upstream prematurely closed connection` or `Connection reset by peer` exactly when an EC2 instance ID disappears from the cluster.

Why Standard Kubernetes Grace Periods Failed

My first attempt to fix this was arguably naive. I assumed that simply increasing the `terminationGracePeriodSeconds` in our deployment YAML would solve the issue. My logic was: "If I give the pod 60 seconds to shut down, it will finish processing requests."

I updated the configuration and deployed it. The next day, the errors persisted. Why? Because `terminationGracePeriodSeconds` only applies when Kubernetes initiates the pod eviction (like during a `kubectl delete pod` or a rolling update). In a Spot interruption scenario, Kubernetes didn't know the node was dying. The OS shut down underneath the Kubelet, and the Kubelet never got the chance to execute the graceful shutdown logic. We needed a bridge between the AWS hardware events and the Kubernetes API.

The Solution: Node Termination Handler & Signal Bridging

To achieve true High Availability on Spot Instances, we must intercept the termination signal and translate it into a Kubernetes Draining command. This is where the AWS Node Termination Handler (NTH) comes in. While you can run this in Queue Processor mode (using SQS and EventBridge), for most clusters, the IMDS Processor (DaemonSet) mode is faster to implement and sufficiently robust.

The NTH runs a small pod on every node. It polls the EC2 metadata service for the interruption notice. As soon as it sees the "2-minute warning," it immediately cordons the node (preventing new pods from scheduling) and drains it (evicting existing pods gracefully).

However, installing NTH is only half the battle. You must also ensure your application handles the `SIGTERM` signal correctly. Below is the production-ready configuration we used to stabilize our cluster.

1. Application Layer: The PreStop Hook

Even with NTH draining the node, your application needs to stop accepting new connections while finishing current ones. NGINX and Spring Boot handle this differently, but for a generic Node.js or Go service, a `preStop` hook is essential to create a buffer between the load balancer removing the endpoint and the app shutting down.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:v2
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]
        # Essential: Give K8s enough time to run the preStop hook
        terminationGracePeriodSeconds: 45 
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          # Fail fast so traffic stops routing here immediately
          periodSeconds: 2
          failureThreshold: 1

Logic Breakdown: The `sleep 10` is not arbitrary. When NTH drains a node, Kubernetes updates the Endpoints API to remove the pod IP. This change takes time to propagate to kube-proxy and the AWS Load Balancer Target Groups. The `sleep` keeps the container alive and accepting traffic just long enough for the load balancer to stop routing new requests to it. Without this, you drain the node, but the LB still sends traffic for a few seconds, resulting in errors.

2. Infrastructure Layer: Installing NTH

We used Helm to deploy the Node Termination Handler. The key configuration here is enabling the spot interruption handling and ensuring it has the right permissions.

# helm repo add eks https://aws.github.io/eks-charts
# helm install aws-node-termination-handler eks/aws-node-termination-handler --namespace kube-system --values values.yaml

# values.yaml configuration
enableSpotInterruptionDraining: true
enableRebalanceMonitoring: false # Set true if you want proactive draining
enableScheduledEventDraining: true

# Webhook configuration is critical for instant response
webhookURL: "https://oapi.coreos.com/v1/namespaces/default/events"

# Resource limits to ensure the handler itself doesn't get evicted
resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 100m
    memory: 128Mi

Verification & ROI

After implementing the Node Termination Handler and refining our `preStop` hooks, we simulated a spot interruption using the AWS Fault Injection Simulator. The results were immediate and drastic.

Metric Before Optimization After NTH Implementation
5xx Errors during Scale-in ~1.5% of traffic 0.00% (Zero)
Pod Rescheduling Time Unpredictable (Node failure) Controlled (Eviction API)
Compute Cost Savings 15% (Hesitant Spot usage) 72% (Aggressive Spot usage)

The combination of Kubernetes Draining triggered by NTH and the application-level wait periods allowed us to fully embrace Spot Instances without compromising our SLA. The FinOps impact was substantial; because we were no longer afraid of interruptions, we moved our critical production workloads to Spot, reducing our monthly AWS bill by nearly $4,000 for this cluster alone.

Get Official Helm Chart

Edge Cases & Critical Warnings

While this solution works for 95% of web services, there are edge cases where relying on the 2-minute warning is dangerous.

Hard Limit: You strictly have 120 seconds. If your pod requires 5 minutes to flush data to disk or finish a batch job, NTH cannot save you. The instance will terminate.

For long-running batch processing or stateful sets (databases), you should avoid Spot Instances or implement checkpointing logic that can resume work after an interruption. Also, be aware of "Spot Storms" where AWS reclaims an entire Availability Zone's capacity at once. In such cases, if you don't have a Pod Disruption Budget (PDB) properly configured, NTH might drain all nodes simultaneously, leaving your service with zero replicas. Always ensure your PDB sets `minAvailable` to maintain service continuity.

Pro Tip: Use the `enableRebalanceMonitoring` flag in NTH. This listens for "Rebalance Recommendation" events, which often arrive earlier than the 2-minute termination notice, giving you slightly more time to migrate gracefully.

Conclusion

Running production workloads on Spot Instances is a high-reward strategy, but it requires a shift in mindset from "uptime guarantees" to "failure handling." By bridging the gap between AWS infrastructure events and Kubernetes pod lifecycles using the Node Termination Handler, you turn a chaotic server failure into a boring, automated operational event. This is the essence of resilience.

Post a Comment