Deploying massive language fashions (LLMs) at scale on Amazon SageMaker AI Inference makes observability a crucial pillar of any manufacturing machine studying (ML) technique. Not like typical software program that returns deterministic outputs, LLMs generate variable, free-form responses which can be tough to validate with customary metrics. LLM output high quality can change over time as enter distributions shift, and high quality monitoring helps detect these adjustments early. For generative AI workloads, observability additionally contains the mannequin serving infrastructure, the place unpredictable token consumption, GPU reminiscence strain, and latency spikes make capability planning and value management a shifting goal.
A complete observability method for LLM inference should deal with two distinct however complementary dimensions: mannequin serving infrastructure (amount) and LLM high quality. Amount monitoring focuses on the operational well being of inference infrastructure, monitoring request throughput and useful resource utilization. These metrics assist detect bottlenecks, right-size compute assets, and management prices. High quality monitoring focuses on the efficiency of the LLMs themselves, evaluating response accuracy, compliance, and consistency over time.
Most groups construct LLM observability in phases. The primary stage establishes visibility into core operational metrics akin to latency, errors, and useful resource utilization. These alerts affirm the reliability of inference endpoints. The subsequent stage provides LLM high quality by way of sampling and analysis, which floor points akin to mannequin drift, degradation, or surprising habits in generated responses.
With each dimensions in place, you’ll be able to introduce thresholds and automatic alerts that mix infrastructure and high quality alerts. Over time, the apply extends to comparative evaluation throughout fashions and configurations so you’ll be able to repeatedly tune value, efficiency, and output high quality. Amount and high quality metrics are interdependent: an endpoint can seem operationally wholesome whereas producing poor or unsafe responses, or it could actually ship high-quality outputs whereas operating inefficiently on over-provisioned infrastructure. Manufacturing-grade LLM observability emerges when each dimensions are monitored, correlated, and optimized collectively.
This put up demonstrates a complete observability answer utilizing Amazon Managed Grafana dashboards that gives a holistic view of each high quality and amount for LLMs served on Amazon SageMaker AI endpoints with inference parts.
Workflow structure
For full visibility into LLMs throughout the 2 monitoring dimensions of amount and high quality, we constructed an answer utilizing three core AWS providers, every chosen for a particular function in LLM observability. The next high-level information circulation diagram reveals the three core parts: Amazon SageMaker AI endpoints with inference parts, Amazon CloudWatch, and Amazon Managed Grafana.
Amazon SageMaker AI Inference Parts function the mannequin internet hosting layer. A single SageMaker AI endpoint can host a number of inference parts, every operating a unique LLM (for instance, gpt-oss-20b and Qwen2.5-7B-Instruct as proven within the previous structure). Inference parts allow you to deploy, scale, and handle a number of fashions on shared infrastructure whereas conserving per-model isolation for site visitors routing, scaling insurance policies, and metric attribution.
Amazon CloudWatch serves because the centralized metrics retailer. It receives two distinct streams of information from every inference element: enhanced metrics and customized high quality metrics. Enhanced metrics are revealed mechanically by SageMaker AI once you allow them on the endpoint configuration. The metrics embrace instance-level, container-level, and per-GPU dimensions, supplying you with granular visibility into invocation counts, latency, error charges, and GPU/CPU utilization per mannequin. Enhanced metrics are logged to the /aws/sagemaker/InferenceComponents/ namespace (for instance, /aws/sagemaker/InferenceComponents/gpt-oss-20b). For particulars, see the Amazon SageMaker AI enhanced metrics documentation and the enhanced metrics deep-dive weblog put up.
Customized high quality metrics seize LLM output high quality, akin to composite high quality scores, security scores, and analysis latency. These are revealed to a separate user-configured CloudWatch namespace at /aws/sagemaker/inference-quality/, which retains high quality alerts cleanly separated from operational metrics. The next desk summarizes the 2 CloudWatch metric namespaces.
| CloudWatch Metric Namespace | Captures | Goal |
| /aws/sagemaker/InferenceComponents/ | Enhanced metrics: instance-level, container-level, and per-GPU dimensions | Supplies granular visibility into invocation counts, latency, error charges, and GPU/CPU utilization per mannequin |
| /aws/sagemaker/inference-quality/ | Customized high quality metrics: composite high quality scores, security scores, and analysis latency | Captures LLM output high quality alerts, stored cleanly separated from operational metrics |
Amazon Managed Grafana gives the visualization layer, utilizing CloudWatch as its native information supply. On this put up, we describe two devoted dashboards that floor SageMaker AI endpoint LLM amount and high quality metrics, as proven within the following screenshot.

The Grafana quantity-based dashboard shows GPU reminiscence utilization, CPU utilization, and invocation metrics per inference element. The standard-based Grafana dashboard shows composite high quality scores, security scores, and high quality analysis latency, in contrast throughout fashions, as proven within the following picture. You possibly can lengthen the Grafana dashboard by creating new dashboards based mostly on your online business or software use instances.

Monitoring amount
Amount monitoring provides you operational visibility into LLMs served on SageMaker AI endpoints. With out it, you’ll be able to lose monitor of site visitors patterns, useful resource saturation, value attribution, and scaling habits, all of which immediately influence availability and spend. For multi-model endpoints utilizing inference parts, amount monitoring solutions crucial operational questions: What number of requests is every mannequin serving? Are GPUs right-sized or over-provisioned? Which mannequin is driving value?
Past infrastructure metrics, amount monitoring helps you assess the operational well being and enterprise influence of your LLM inference parts throughout efficiency and reliability, useful resource utilization, and any enterprise metrics particular to your group. Collectively, these views present the place latency is happening, whether or not value will increase are pushed by site visitors development or inefficient GPU allocation, and whether or not scaling insurance policies are responding appropriately to demand.
The next Amazon Managed Grafana dashboard samples put these amount monitoring dimensions into apply throughout three key areas. The primary group of panels covers LLM invocations and latency. As proven within the following pattern Grafana dashboard output, panels show Mannequin Latency as a time-series development, Complete Invocations evaluating fashions (for instance, gpt-oss versus Qwen), and Per-Copy Invocations damaged down for every mannequin. These panels assist operators perceive request throughput patterns, establish latency spikes, and evaluate invocation distribution throughout mannequin copies.

The subsequent panel focuses on GPU compute and reminiscence utilization. The next Grafana dashboard samples current GPU Compute share and GPU Reminiscence share panels for each the fashions (for instance, Qwen and gpt-oss). This cross-model comparability helps ML engineers and website reliability engineers (SREs) rapidly decide whether or not a efficiency problem is GPU-compute-bound or memory-limited, and whether or not one mannequin is consuming disproportionate assets on shared infrastructure.

The third set of panels gives endpoint utilization and value particulars. The next Cluster Overview and Value Grafana dashboard pattern reveals Used GPUs versus Free GPUs and Complete Cases to visualise cluster capability, alongside per-model Value/hour panels (for instance, gpt-oss and Qwen). This view reveals which mannequin is driving value, whether or not GPUs are over-provisioned or saturated, and whether or not auto scaling insurance policies are responding to demand.

The next desk summarizes the three amount monitoring areas lined within the Grafana dashboard, together with their related metrics and goal:
| Metric Kind | Dashboard Metric Names | Captures | Goal |
| Mannequin Invocations & Latency | Mannequin Latency, Complete Invocations (gpt-oss vs Qwen), Per-Copy Invocations (gpt-oss), Per-Copy Invocations (Qwen) | Request throughput, response time, and per-copy invocation distribution | Determine latency spikes, evaluate mannequin throughput, and perceive invocation load balancing throughout copies |
| GPU Compute & Reminiscence Utilization | GPU Compute % (Qwen), GPU Compute % (gpt-oss), GPU Reminiscence % (Qwen), GPU Reminiscence % (gpt-oss) | Per-model GPU compute and reminiscence utilization percentages | Decide if points are GPU-compute-bound or memory-limited, and detect disproportionate useful resource consumption throughout fashions |
| Endpoint Utilization & Value | Used GPUs / Free GPUs / Cases, Value/hour (gpt-oss), Value/hour (Qwen) | Cluster capability, GPU allocation standing, and per-model hourly value attribution | Determine value drivers, detect over-provisioned or saturated GPUs, and validate auto scaling responsiveness |
Collectively, these dashboards give operators a single pane of glass to correlate value, capability, and utilization throughout fashions served on the endpoint. To arrange these dashboards in your atmosphere, observe the AWS samples GitHub repository pattern pocket book and lengthen the answer to create dashboards tailor-made to your group’s necessities.
Monitoring high quality
Whereas amount metrics inform you whether or not the LLM serving infrastructure is wholesome, high quality metrics inform you whether or not LLMs are nonetheless performing as anticipated. LLM efficiency can degrade silently over time due to adjustments in enter immediate distributions, idea drift, or shifts in real-world circumstances. Not like a latency spike or a 500 error, high quality degradation not often triggers conventional alerts.
High quality monitoring addresses this by evaluating mannequin outputs throughout dimensions that matter to the enterprise: response high quality (relevance to consumer queries, factual accuracy, completeness, and consistency), security and compliance (dangerous content material detection, bias monitoring, privateness compliance, and regulatory adherence), consumer expertise high quality (helpfulness, readability, acceptable tone, and multi-turn dialog coherence), and domain-specific high quality (technical accuracy for specialised domains, quotation high quality for Retrieval Augmented Technology (RAG) functions, and code correctness for programming assistants). Collectively, these dimensions assist governance groups implement guardrails, product homeowners monitor user-facing high quality over time, and information scientists pinpoint whether or not a top quality drop is brought on by a particular immediate sample, a mannequin replace, or a knowledge distribution shift.
The next Amazon Managed Grafana dashboard pattern output demonstrates high quality monitoring throughout the SageMaker AI endpoint inference parts (for instance, LLMs gpt-oss-20b and Qwen2.5-7B-Instruct). The instance dashboard tracks 4 high quality scores, every displayed as a time-series line chart with configurable alert thresholds (proven as dashed traces at roughly 85% and 95%). The primary panel reveals the Composite High quality Rating, an mixture well being indicator that mixes high quality dimensions. This metric shows the general high quality development over time, making it simple to identify sustained degradation versus intermittent high quality drops which will correlate with particular immediate varieties.

The second group of panels tracks particular LLM response high quality metrics: Security Rating, Relevance Rating, and Skilled Tone Rating. Security Rating screens dangerous or non-compliant content material detection. On the dashboard output, this rating stays essentially the most steady of all 4 metrics, persistently hovering throughout the goal threshold band, which signifies dependable security guardrails throughout each fashions. Relevance Rating measures how properly LLM responses deal with consumer intent, serving to groups establish immediate classes which will problem an LLM’s comprehension. Skilled Tone Rating evaluates whether or not outputs keep an acceptable tone for the deployment context.

These high quality scores are computed utilizing analysis metrics akin to an LLM-as-judge sample with configurable analysis rubrics. In these examples, we use Anthropic Claude Sonnet 4.6 served through Amazon Bedrock because the evaluator mannequin, which is permitted below customary Amazon Bedrock service phrases for LLM-as-judge use instances. You possibly can substitute your personal analysis system, offered you affirm the chosen mannequin’s phrases allow evaluating outputs from different fashions, you confirm the data-residency necessities are met, and also you pin the evaluator mannequin to a particular model so high quality scores stay comparable over time.
At a look, you’ll be able to evaluate high quality throughout LLMs aspect by aspect, figuring out which LLM is extra steady, which high quality dimension is the first threat driver, and whether or not high quality points are intermittent (suggesting sensitivity to particular immediate varieties) or sustained (suggesting mannequin degradation). Past visualization, threshold-based alert guidelines are deployed mechanically through Grafana Alerting, dimensioned by the inference element in order that alerts fireplace per inference element. When a top quality rating breaches its configured threshold, you’ll be able to obtain these notifications through Amazon Easy Notification Service (Amazon SNS), enabling speedy SRE triage. Fashionable SRE groups use their present automated triage processes, for instance by integrating these alerts with Slack, PagerDuty, or OpsGenie to chop response occasions to seconds by mechanically correlating logs, classifying alert severity, and prioritizing incidents for mitigation.
The next Grafana Alerting dashboard pattern output reveals threshold-based alert guidelines firing per inference element, with notifications routed to configured channels for rapid SRE triage.

This view provides governance and product groups the proof wanted to make choices about engineering changes, remediation actions, root trigger evaluation, mannequin swapping, or different refinements. To arrange this dashboard in your atmosphere and be taught extra concerning the high quality metrics, observe the AWS samples GitHub repository pocket book.
Conclusion
Observability of LLM inference stacks in manufacturing requires greater than monitoring uptime and error charges. As this put up demonstrated, a complete technique should deal with two complementary dimensions: amount and high quality. Amount covers the operational well being of your infrastructure, together with GPU utilization, value attribution, scaling habits, and request throughput. High quality covers the continued efficiency of your fashions, together with response relevance, security compliance, factual accuracy, {and professional} tone.
By combining Amazon SageMaker AI endpoint enhanced metrics, Amazon CloudWatch, and Amazon Managed Grafana, you’ll be able to construct a unified observability layer with out customized instrumentation. Enhanced metrics provide you with per-model, per-GPU granularity on shared infrastructure. CloudWatch gives a single metrics retailer for each operational and high quality alerts. Grafana brings it collectively in dashboards that serve completely different stakeholders: SREs monitoring useful resource saturation and scaling, governance groups monitoring security and compliance thresholds, and product homeowners evaluating mannequin high quality aspect by aspect.
To get began, take a look at the AWS samples GitHub repository, which incorporates pattern notebooks to configure enhanced metrics, publish customized high quality metrics and alerts, and arrange the Grafana dashboards proven on this put up.
Concerning the authors
