A software engineering team implemented a custom Kubernetes admission controller named Veltrim to optimize autoscaling of pods running large language model (LLM) workloads. The controller uses a Lua policy engine to prevent premature scale-down of pods holding active LLM cache contexts, reducing latency spikes and worker churn. This approach improved p95 latency from 4.1 seconds to 57 ms, reduced worker churn from 180 pods/hour to 12, and sped up cluster-autoscaler scale-up events from 4.3 minutes to 1.7 minutes.
Use Case
Opening the operator briefing
Pulling the full operator breakdown, tooling context, and verification notes.
