Original article excerpt
Server-side extracted preview paragraphs from the original source.
Today, we’re excited to announce container image caching for Amazon SageMaker AI inference, the next major advancement in our faster scaling optimization journey. This speeds up end-to-end latency by up to 2x for generative AI models during scale-out events.
Today, we’re excited to announce container image caching for Amazon SageMaker AI inference, the next major advancement in our faster scaling optimization journey. This speeds up end-to-end latency by up to 2x for generative AI models during scale-out events.
Over the years, Amazon SageMaker AI has continued to reduce latency across these scaling stages: detecting the need to scale out, provisioning instances, downloading container images, fetching model weights, and starting containers. Amazon SageMaker AI previously introduced sub-minute Amazon CloudWatch metrics to help detect scale-out needs up to 6x faster than traditional mechanisms and launched an inference component data caching solution that stores container images and model artifacts on already running instances. This approach reduced the cold start latency for scaling inference component operations that reuse existing instances. Together, these features improved auto scaling responsiveness for scenarios where an inference component can be placed on an already provisioned instance and use the existing cache.
With container caching, Amazon SageMaker AI extends these scaling improvements to scenarios where new instances must be launched. Container caching removes container image download latency even when new instances must be launched, the scenario where our previous instance-store-based caching couldn’t help. In this post, we show how container caching addresses the container image download bottleneck and demonstrate the performance improvements you can expect.
The following diagram shows the steps during instance scaling when a new instance is launched.
Container image download is often a major contributor to endpoint scale-out latency, especially for generative AI workloads. These workloads use large containers such as SageMaker Large Model Inference (LMI, powered by vLLM), vLLM, and NVIDIA Triton. Caching the container removes the container image pull step during new instance scale-out events for the common endpoint patterns:
The following image shows how the scaling timeline changes for the Qwen3-8B (16 GB) model on an ml.g6.2xlarge instance using the LMI container (17.7 GB compressed).
