Help! My LLM Is a Resource Hog: How We Tamed Inference With Kubernetes and Open Source Muscle
Talk at KubeCon + CloudNativeCon North America 2025 on how to optimize LLM inference using Kubernetes and open-source tools.
Help! My LLM Is a Resource Hog: How We Tamed Inference With Kubernetes and Open Source Muscle
Speakers: Aditya Soni (Forrester Research) & Hrittik Roy (vCluster)
A client came to us with a problem we're seeing more and more, their large language model (LLM) was deployed, but inference was painfully slow, GPU usage was unpredictable, and costs were spiraling out of control. Kubernetes alone wasn't enough, they needed a production-ready, efficient, and scalable stack.
In this talk, we'll walk through how we diagnosed and solved the issue using open-source CNCF tools, turning a chaotic deployment into a well-oiled inference machine.
You'll learn how to:
- Use KServe and Kubeflow to serve LLMs reliably.
- Benchmark and auto-scale workloads using Volcano and KEDA while optimizing resource usage and latency.
- Track model performance and drift with Prometheus, Grafana, and OpenTelemetry.
We'll share benchmarks, architectures, and lessons from the field, all based on open-source tooling you can try today. Whether you're running LLMs at scale or just exploring GenAI, this talk is packed with real-world solutions to help you do more with less.