Debugging Kubernetes OOMKilled in Go: Memory Leaks & pprof

There is nothing more frustrating than waking up to a PagerDuty alert because your pods are stuck in a CrashLoopBackOff. You check the logs, but they end abruptly. You check the Kubernetes events, and there it is: OOMKilled (Exit Code 137). The container exceeded its memory limit, and the Linux kernel stepped in to terminate the process. In a recent high-throughput microservice handling 15k TPS, we faced this exact scenario—memory usage would slowly creep up until the inevitable crash, despite our calculations suggesting plenty of headroom.

The Root Cause: Go GC vs. Kubernetes Limits

When debugging memory issues in Go on Kubernetes, it is crucial to understand that the Go Runtime and the Kubernetes Cgroups limits do not always talk to each other by default. The Go Garbage Collector (GC) is designed to optimize for throughput and CPU usage, often allowing the heap to grow significantly before triggering a collection. If this growth outpaces the hard limit set in your resources.limits.memory, the OS kills the app before the GC can reclaim space.

Critical Misconception: High memory usage isn't always a code leak. It is often a configuration mismatch where the Go runtime is unaware of the container's memory ceiling.

Step 1: Instrumentation with pprof

To differentiate between a configuration issue and a genuine memory leak (e.g., unclosed response bodies, growing maps, or goroutine leaks), you must inspect the heap. The standard tool for this is pprof. We need to expose the debug endpoint to capture a live profile.

Add the following import to your main entry file. This automatically registers the debug handlers under /debug/pprof/.

// main.go
import (
    "net/http"
    _ "net/http/pprof" // Implicitly registers pprof handlers
    "log"
)

func main() {
    // Start a separate server for profiling to avoid exposing it publicly
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    
    // Your application logic...
}
Security Note: Never expose pprof on a public port. Bind it to localhost or use a sidecar/forwarding mechanism to access it securely.

Step 2: Capturing and Analyzing the Heap

Once the application is running (and approaching its memory limit), you need to capture a snapshot. Use kubectl port-forward to tunnel into the pod.

# Port forward to the profiling port
kubectl port-forward pod/my-app-pod-12345 6060:6060

# Capture the heap profile
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

This command opens a web interface. Navigate to the "Alloc Space" view to see total allocations and "Inuse Space" to see what is currently holding memory. If "Inuse Space" is dominated by a specific object that continues to grow over time, you have a leak. If the memory is fragmented or high but stable, you likely have a tuning issue.

Step 3: The Fix (GOMEMLIMIT)

If your profile shows legitimate memory usage (not a leak) but you are still getting OOMKilled, the solution is the GOMEMLIMIT environment variable introduced in Go 1.19. This tells the GC to run more aggressively as usage approaches the limit, preventing the "Out of Memory" crash.

Best Practice Configuration: Set GOMEMLIMIT to roughly 90% of your Kubernetes memory limit.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-service
spec:
  template:
    spec:
      containers:
        - name: app
          image: my-go-app:latest
          resources:
            limits:
              memory: "1Gi"
            requests:
              memory: "512Mi"
          env:
            # 90% of 1Gi Limit
            - name: GOMEMLIMIT
              value: "900MiB"

Performance Verification

We applied GOMEMLIMIT to our production clusters and observed immediate stabilization. The "Sawtooth" pattern of memory usage became flattened as the GC triggered more predictably near the limit.

Metric Before Optimization After (GOMEMLIMIT)
Restart Count (24h) 12 (OOMKilled) 0
Peak Memory 1.1 GB (Crashed) 850 MB (Stable)
GC CPU Overhead Low (Lazy GC) Moderate (Adaptive)
Result: By making the Go Runtime "container-aware" via GOMEMLIMIT, we eliminated OOM crashes without increasing resource costs.

Conclusion

Kubernetes OOMKilled errors in Go are often a symptom of the runtime failing to recognize container constraints rather than malicious code leaks. By using pprof to rule out logic errors and applying GOMEMLIMIT to bound the heap, you can ensure your pods remain stable under load. Always validate your findings with a stress test before promoting changes to production.

Post a Comment