Organizations more and more deploy {custom} massive language fashions (LLMs) on Amazon SageMaker AI real-time endpoints utilizing their most well-liked serving frameworks—akin to SGLang, vLLM, or TorchServe—to assist achieve better management over their deployments, optimize prices, and align with compliance necessities. Nonetheless, this flexibility introduces a vital technical problem: response format incompatibility with Strands brokers. Whereas these {custom} serving frameworks usually return responses in OpenAI-compatible codecs to facilitate broad surroundings help, Strands brokers anticipate mannequin responses aligned with the Bedrock Messages API format.
The problem is especially important as a result of help for the Messages API will not be assured for the fashions hosted on SageMaker AI real-time endpoints. Whereas Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging codecs since December 2025, flexibility of SageMaker AI permits clients to host varied basis fashions—some requiring esoteric immediate and response codecs that don’t conform to straightforward APIs. This creates a spot between the serving framework’s output construction and what Strands expects, stopping seamless integration regardless of each methods being technically purposeful. The answer lies in implementing {custom} mannequin parsers that reach SageMakerAIModel and translate the mannequin server’s response format into what Strands expects, enabling organizations to leverage their most well-liked serving frameworks with out sacrificing compatibility with the Strands Brokers SDK.
This submit demonstrates methods to construct {custom} mannequin parsers for Strands brokers when working with LLMs hosted on SageMaker that don’t natively help the Bedrock Messages API format. We’ll stroll by means of deploying Llama 3.1 with SGLang on SageMaker utilizing awslabs/ml-container-creator, then implementing a {custom} parser to combine it with Strands brokers.
Strands Customized Parsers
Strands brokers anticipate mannequin responses in a selected format aligned with the Bedrock Messages API. Once you deploy fashions utilizing {custom} serving frameworks like SGLang, vLLM, or TorchServe, they usually return responses in their very own codecs—usually OpenAI-compatible for broad surroundings help. With no {custom} parser, you’ll encounter errors like:
TypeError: 'NoneType' object will not be subscriptable
This occurs as a result of the Strands Brokers default SageMakerAIModel class makes an attempt to parse responses assuming a selected construction that your {custom} endpoint doesn’t present. On this submit and the companion code base, we illustrate methods to lengthen the SageMakerAIModel class with {custom} parsing logic that interprets your mannequin server’s response format into what Strands expects.
Implementation Overview
Our implementation consists of three layers:
- Mannequin Deployment Layer: Llama 3.1 served by SGLang on SageMaker, returning OpenAI-compatible responses
- Parser Layer: Customized
LlamaModelProviderclass that extendsSageMakerAIModelto deal with Llama 3.1’s response format - Agent Layer: Strands agent that makes use of the {custom} supplier for conversational AI, appropriately parsing the mannequin’s response
We begin by utilizing awslabs/ml-container-creator, an AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (Convey Your Personal Container) deployment initiatives. It generates the artifacts wanted to construct LLM serving containers, together with Dockerfiles, CodeBuild configurations, and deployment scripts.
Set up ml-container-creator
Step one we have to take is to construct the serving container for our mannequin. We use an open-source venture to construct the container and generate deployment scripts for that container. The next instructions illustrate methods to set up awslabs/ml-container-creator and its dependencies, which embrace npm and Yeoman. For extra data, evaluate the venture’s README and Wiki to get began.
Generate Deployment Undertaking
As soon as put in and linked, the yo command permits you to run put in turbines, yo ml-container-creator permits you to run the generator we’d like for this train.
The generator creates an entire venture construction:
Construct and Deploy
Tasks constructed by awslabs/ml-container-creator embrace templatized construct and deployment scripts. The ./deploy/submit_build.sh and ./deploy/deploy.sh scripts are used to construct the picture, push the picture to Amazon Elastic Container Registry (ECR), and deploy to an Amazon SageMaker AI real-time endpoint.
The deployment course of:
- CodeBuild builds the Docker picture with SGLang and Llama 3.1
- Picture is pushed to Amazon ECR
- SageMaker creates a real-time endpoint
- SGLang downloads the mannequin from HuggingFace and masses it into GPU reminiscence
- Endpoint reaches InService standing (roughly 10-Quarter-hour)
We are able to check the endpoint by utilizing ./check/test_endpoint.sh, or with a direct invocation:
Understanding the Response Format
Llama 3.1 returns OpenAI-compatible responses. Strands expects mannequin responses to stick to the Bedrock Messages API format. Till late final yr, this was a normal compatibility mismatch. Since December 2025, the Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs:
Nonetheless, help for the Messages API will not be assured for the fashions hosted on SageMaker AI real-time endpoints. SageMaker AI permits clients to host many sorts of basis fashions on managed GPU-accelerated infrastructure, a few of which can require esoteric immediate/response codecs. For instance, the default SageMakerAIModel makes use of the legacy Bedrock Messages API format and makes an attempt to entry fields that don’t exist in the usual OpenAI Messages format, inflicting TypeError type failures.
Implementing a Customized Mannequin Parser
Customized mannequin parsers are a characteristic of the Strands Brokers SDK that gives sturdy compatibility and adaptability for purchasers constructing brokers powered by LLMs hosted on SageMaker AI. Right here, we describe methods to create a {custom} supplier that extends SageMakerAIModel:
The stream methodology overrides the conduct of the SageMakerAIModel and permits the agent to parse responses primarily based on the necessities of the underlying mannequin. Whereas the overwhelming majority of fashions do help OpenAI’s Message API protocol, this functionality allows power-users to leverage extremely specified LLMs on SageMaker AI to energy agent workloads utilizing Strands Brokers SDK. As soon as the {custom} mannequin response logic is constructed, Strands Brokers SDK makes it easy to initialize brokers with {custom} mannequin suppliers:
The whole implementation for this tradition parser, together with the Jupyter pocket book with detailed explanations and the ml-container-creator deployment venture, is accessible within the companion GitHub repository.
Conclusion
Constructing {custom} mannequin parsers for Strands brokers helps customers to leverage totally different LLM deployments on SageMaker, no matter its response format. By extending SageMakerAIModel and implementing the stream() methodology, you may combine custom-hosted fashions whereas sustaining the clear agent interface of Strands.
Key takeaways:
- awslabs/ml-container-creator simplifies SageMaker BYOC deployments with production-ready infrastructure code
- Customized parsers bridge the hole between mannequin server response codecs and Strands expectations
- The stream() methodology is the vital integration level for {custom} suppliers
In regards to the authors
Dan Ferguson is a Sr. Options Architect at AWS, primarily based in New York, USA. As a machine studying providers skilled, Dan works to help clients on their journey to integrating ML workflows effectively, successfully, and sustainably.
