Meetup··Colombo, Sri Lanka

Breaking Down Inference Optimization: The Three Different Layers

CNCG Colombo

InferenceLLMGPUKubernetesCNCF

Abstract

Inference optimization gets discussed as one big problem, which is why teams end up tuning the wrong layer and wondering why latency or cost barely moved. This CNCG Colombo session splits the work into three layers and shows what each one actually controls. 1. Model layer. Quantization, distillation, speculative decoding, and the trade-offs against accuracy. 2. Runtime layer. Batching strategies, KV cache management, paged attention, and how serving engines like vLLM and TGI change the picture. 3. Infrastructure layer. GPU sharing, autoscaling on the right signal, tenant isolation, and the scheduling decisions that decide whether a node is full or wasted. For each layer: what to measure, what to change first, and where the diminishing returns kick in. Attendees leave with a mental model for diagnosing which layer is the bottleneck before reaching for the next optimization.

More Talks