Original article excerpt
Server-side extracted preview paragraphs from the original source.
Amazon SageMaker AI provides fully managed real-time inference hosting for machine learning models. You deploy a model to a SageMaker endpoint backed by one or more compute instances, and SageMaker handles provisioning and scaling. SageMaker supports multiple endpoint architectures. This post focuses on the two most relevant to generative AI workloads with detailed observability: Single-model endpoints (SME) and Inference component (IC) endpoints.
Monitoring and troubleshooting generative AI inference endpoints operating at scale is challenging. When your large language model (LLM) endpoint’s P99 latency spikes, you must determine in minutes whether the root cause is GPU memory pressure, a saturated KV cache, unbalanced traffic across Availability Zones, or an auto scaling policy that hasn’t triggered. The shift from training to serving is reshaping how teams deploy LLMs and other generative AI models in production. Machine learning (ML) platform engineers, MLOps teams, and site reliability engineers (SREs) must keep inference endpoints healthy, responsive, and cost-efficient, often across dozens of models and hundreds of GPU instances.
Amazon SageMaker AI provides fully managed real-time inference hosting for machine learning models. You deploy a model to a SageMaker endpoint backed by one or more compute instances, and SageMaker handles provisioning and scaling. SageMaker supports multiple endpoint architectures. This post focuses on the two most relevant to generative AI workloads with detailed observability:
SageMaker endpoints emit metrics like invocation counts, model latency, and overhead latency to Amazon CloudWatch. These aggregate metrics are useful for understanding overall endpoint health. Because teams scale to multi-model deployments on GPU fleets, they need deeper signals. Amazon SageMaker AI now emits over 100 detailed inference metrics. These cover GPU health, token-level latency, KV cache pressure, traffic distribution across AZs, inference component placement, and cold start diagnostics. These metrics flow to a built-in SageMaker Insights dashboard in Amazon CloudWatch, a fully managed observability solution that removes the need for custom Grafana dashboards and Prometheus configuration. The SageMaker Insights dashboard supports both endpoint types and automatically shows IC-specific panels when inference components are detected.
For more details on SageMaker inference, see Deploy models for real-time inference.
SageMaker inference endpoints emit native OpenTelemetry metrics to CloudWatch. The SageMaker Insights dashboard is located in the CloudWatch console under Infrastructure Monitoring → SageMaker Insights. It queries these metrics using PromQL and renders visualizations at the fleet, endpoint, and inference-component level across three tabs: Performance, Capacity, and Reliability.
For background on the OpenTelemetry and PromQL support in CloudWatch, see Introducing OpenTelemetry PromQL support in Amazon CloudWatch.
