Abstract

LLM inference is the new resource hog. GPUs sit underutilised, model loading dominates cold-start time, and teams ship workloads that look fine in isolation but fall over the moment another tenant lands on the same node. This KubeCon NA 2025 session walks through how we tamed inference on Kubernetes using open source primitives — sensible scheduling, GPU sharing strategies, and tenant isolation patterns that prevent one model from starving another. Expect a tour of the trade-offs between vertical scaling, multi-instance GPUs, and tenant clusters, with examples drawn from production deployments.

Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle

Abstract

Resources

More Talks

Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform

Breaking Down Inference Optimization: The Three Different Layers

Stop the GPU Madness! Making LLM Inference Actually Efficient on K8s

Squeezing Every Millisecond: A Practical Guide to Optimizing Time To First Token with OSS Muscle