Optimizing LLM inference on Amazon SageMaker AI with BentoML’s LLM- Optimizer

The rise of highly effective giant language fashions (LLMs) that may be consumed by way of API calls has made it remarkably easy to combine synthetic intelligence (AI) capabilities into functions. But regardless of this comfort, a major variety of enterprises are selecting to self-host their very own fashions—accepting the complexity of infrastructure administration, the price of GPUs within the serving stack, and the problem of preserving fashions up to date. The choice to self-host typically comes down to 2 important elements that APIs can’t deal with. First, there may be knowledge sovereignty: the necessity to make it possible for delicate info doesn’t depart the infrastructure, whether or not as a result of regulatory necessities, aggressive issues, or contractual obligations with clients. Second, there may be mannequin customization: the flexibility to effective tune fashions on proprietary knowledge units for industry-specific terminology and workflows or create specialised capabilities that general-purpose APIs can’t supply.

Amazon SageMaker AI addresses the infrastructure complexity of self-hosting by abstracting away the operational burden. By means of managed endpoints, SageMaker AI handles the provisioning, scaling, and monitoring of GPU assets, permitting groups to give attention to mannequin efficiency moderately than infrastructure administration. The system gives inference-optimized containers with well-liked frameworks like vLLM pre-configured for optimum throughput and minimal latency. For example, the Giant Mannequin Inference (LMI) v16 container picture makes use of vLLM v0.10.2, which makes use of the V1 engine and comes with assist for brand spanking new mannequin architectures and new {hardware}, such because the Blackwell/SM100 era. This managed method transforms what usually requires devoted machine studying operations (MLOps) experience right into a deployment course of that takes only a few traces of code.

Reaching optimum efficiency with these managed containers nonetheless requires cautious configuration. Parameters like tensor parallelism diploma, batch dimension, most sequence size, and concurrency limits can dramatically impression each latency and throughput—and discovering the correct steadiness to your particular workload and value constraints is an iterative course of that may be time-consuming.

BentoML’s LLM-Optimizer addresses this problem by enabling systematic benchmarking throughout totally different parameter configurations, changing handbook trial-and-error with an automatic search course of. The instrument means that you can outline constraints comparable to particular latency targets or throughput necessities, making it easy to establish configurations that meet your service degree aims. You should utilize LLM-Optimizer to seek out optimum serving parameters for vLLM domestically or in your improvement atmosphere, apply those self same configurations on to the SageMaker AI endpoint for a seamless transition to manufacturing. This put up illustrates this course of by discovering an optimum deployment for a Qwen-3-4B mannequin on an Amazon SageMaker AI endpoint.

This put up is written for working towards ML engineers, options architects, and system builders who already deploy fashions on Amazon SageMaker or related infrastructure. We assume familiarity with GPU situations, endpoints, and mannequin serving, and give attention to sensible efficiency optimization. The reasons of inference metrics are included not as a newbie tutorial, however to construct shared instinct. For particular parameters like batch dimension & tensor parallelism, and the way they immediately impression value and latency in manufacturing.

Answer overview

The step-by-step breakdown is as follows:

Outline constraints in Jupyter Pocket book: The method begins inside SageMaker AI Studio, the place customers open a Jupyter Pocket book to outline the deployment targets and constraints of the use case. These constraints can embrace goal latency, desired throughput, and output tokens.
Run theoretical and empirical benchmarks with the BentoML LLM-Optimizer: The LLM-Optimizer first runs a theoretical GPU efficiency estimate to establish possible configurations for the chosen {hardware} (on this instance, an ml.g6.12xlarge). It executes benchmark assessments utilizing the vLLM serving engine throughout a number of parameter combos comparable to tensor parallelism, batch dimension, and sequence size to empirically measure latency and throughput. Primarily based on these benchmarks, the optimizer robotically determines essentially the most environment friendly serving configuration that satisfies the supplied constraints.
Generate and deploy optimized configuration in a SageMaker endpoint: As soon as the benchmarking is full, the optimizer returns a JSON configuration file containing the optimum parameter values. This JSON is handed from the Jupyter Pocket book to the SageMaker Endpoint configuration, which deploys the LLM (on this instance, the Qwen/Qwen3-4B mannequin utilizing the vLLM-based LMI container) in a managed HTTP endpoint utilizing the optimum runtime parameters.

The next determine is an summary of the workflow performed all through the put up.

Earlier than leaping into the theoretical underpinnings of inference optimization, it’s price grounding why these ideas matter within the context of real-world deployments. When groups transfer from API-based fashions to self-hosted endpoints, they inherit the accountability for tuning efficiency parameters that immediately have an effect on value and consumer expertise. Understanding how latency and throughput work together by way of the lens of GPU structure and arithmetic depth allows engineers to make these trade-offs intentionally moderately than by trial and error.

Temporary overview of LLM efficiency

Earlier than diving into the sensible software of this workflow, we cowl key ideas that construct instinct for why inference optimization is important for LLM-powered functions. The next primer isn’t tutorial; it’s to supply the psychological mannequin wanted to interpret LLM-Optimizer’s outputs and perceive why sure configurations yield higher outcomes.

Key efficiency metrics

Throughput (requests/second): What number of requests your system completes per second. Larger throughput means serving extra customers concurrently.

Latency (seconds): The entire time from when a request arrives till the entire response is returned. Decrease latency means sooner consumer expertise.

Arithmetic depth: The ratio of computation carried out to knowledge moved. This determines whether or not your workload is:

Reminiscence-bound: Restricted by how briskly you possibly can transfer knowledge (low arithmetic depth)

Compute-bound: Restricted by uncooked GPU processing energy (excessive arithmetic depth)

The roofline mannequin

The roofline mannequin visualizes efficiency by plotting throughput towards arithmetic depth. For deeper content material on the roofline mannequin, go to the AWS Neuron Batching documentation. The mannequin reveals whether or not your software is bottlenecked by reminiscence bandwidth or computational capability. For LLM inference, this mannequin helps establish in the event you’re restricted by:

Reminiscence bandwidth: Information switch between GPU reminiscence and compute models (typical for small batch sizes)
Compute capability: Uncooked floating-point operations (FLOPS) out there on the GPU (typical for big batch sizes)

Roofline model

The throughput-latency trade-off

In observe, optimizing LLM inference follows a basic trade-off: as you improve throughput, latency rises. This occurs as a result of:

Bigger batch sizes → Extra requests processed collectively → Larger throughput
Extra concurrent requests → Longer queue wait instances → Larger latency
Tensor parallelism → Distributes mannequin throughout GPUs → Impacts each metrics in another way

The problem lies find the optimum configuration throughout a number of interdependent parameters:

Tensor parallelism diploma (what number of GPUs to make use of)
Batch dimension (most variety of tokens processed collectively)
Concurrency limits (most variety of simultaneous requests)
KV cache allocation (reminiscence for consideration states)

Every parameter impacts throughput and latency in another way whereas respecting {hardware} constraints like GPU reminiscence and compute bandwidth. This multi-dimensional optimization downside is exactly why LLM-Optimizer is efficacious—it systematically explores the configuration area moderately than counting on handbook trial-and-error.

Latency vs. batch size

For an summary on LLM Inference as a complete, BentoML has supplied useful assets of their LLM Inference Handbook.

Sensible software: Discovering an optimum deployment of Qwen3-4B on Amazon SageMaker AI

Within the following sections, we stroll by way of a hands-on instance of figuring out and making use of optimum serving configurations for LLM deployment. Particularly, we:

Deploy the Qwen/Qwen3-4B mannequin utilizing vLLM on an ml.g6.12xlarge occasion (4x NVIDIA L4 GPUs, 24GB VRAM every).
Outline lifelike workload constraints:
- Goal: 10 requests per second (RPS)
- Enter size: 1,024 tokens
- Output size: 512 tokens
Discover a number of serving parameter combos:
- Tensor parallelism diploma (1, 2, or 4 GPUs)
- Max batched tokens (4K, 8K, 16K)
- Concurrency ranges (32, 64, 128)
Analyze outcomes utilizing:
- Theoretical GPU reminiscence calculations
- Benchmarking knowledge
- Throughput vs. latency trade-offs

By the top, you’ll see how theoretical evaluation, empirical benchmarking, and managed endpoint deployment come collectively to ship a production-ready LLM setup that balances latency, throughput, and value.

Conditions

The next are the conditions wanted to run by way of this instance:

Entry to SageMaker Studio. This makes deployment & inference easy, or an interactive improvement atmosphere (IDE) comparable to PyCharm or Visible Studio Code.
To benchmark and deploy the mannequin, examine that the really helpful occasion varieties are accessible, primarily based on the mannequin dimension. To confirm the required service quotas, full the next steps:
- On the Service Quotas console, beneath AWS Companies, choose Amazon SageMaker.
- Confirm ample quota for the required occasion sort for “endpoint deployment” (within the appropriate area).
- If wanted, request a quota improve/contact AWS for assist.

The next code particulars learn how to set up the required packages:

pip set up vllm
pip set up git+https://github.com/bentoml/llm-optimizer.git

Run the LLM-Optimizer

To get began, instance constraints have to be outlined primarily based on the focused workflow.

Instance constraints:

Enter tokens: 1024
Output tokens: 512
E2E Latency: <= 60 seconds
Throughput: >= 5 RPS

Run the estimate

Step one with llm-optimizer is to run an estimation. Working an estimate analyzes the Qwen/Qwen3-4b mannequin on 4x L4 GPUs and estimate the efficiency for an enter size of 1024 tokens, and an output of 512 tokens. As soon as run, the theoretical bests for latency and throughput are calculated mathematically and returned. The roofline evaluation returned identifies the workloads bottlenecks, and a variety of server and shopper arguments are returned, to be used within the following step, working the precise benchmark.

Below the hood, LLM-Optimizer performs roofline evaluation to estimate LLM serving efficiency. It begins by fetching the mannequin structure from HuggingFace to extract parameters like hidden dimensions, variety of layers, consideration heads, and whole parameters. Utilizing these architectural particulars, it calculates the theoretical FLOPs required for each prefill (processing enter tokens) and decode (producing output tokens) phases, accounting for consideration operations, MLP layers, and KV cache entry patterns. It compares the arithmetic depth (FLOPs per byte moved) of every part towards the GPU’s {hardware} traits—particularly the ratio of compute capability (TFLOPs) to reminiscence bandwidth (TB/s)—to find out whether or not prefill and decode are memory-bound or compute-bound. From this evaluation, the instrument estimates TTFT (time-to-first-token), ITL (inter-token latency), and end-to-end latency at numerous concurrency ranges. It additionally calculates three theoretical concurrency limits: KV cache reminiscence capability, prefill compute capability, and decode throughput capability. Lastly, it generates tuning instructions that sweep throughout totally different tensor parallelism configurations, batch sizes, and concurrency ranges for empirical benchmarking to validate the theoretical predictions.

The next code particulars learn how to run an preliminary estimation primarily based on the chosen constraints:

llm-optimizer estimate  
--model Qwen/Qwen3-4B  
--input-len 1024  
--output-len 512  
--gpu L40  
--num-gpus 4

Anticipated output:

Auto-detected 4 GPU(s)
💡 Inferred precision from mannequin config: bf16

=== Configuration ===
Mannequin: Qwen/Qwen3-4B
GPU: 4x L40
Precision: bf16
Enter/Output: 1024/512 tokens
Goal: throughput

Fetching mannequin configuration...
Mannequin: 3668377600.0B parameters, 36 layers

=== Efficiency Evaluation ===
Finest Latency (concurrency=1):
  TTFT: 16.8 ms
  ITL: 1.4 ms
  E2E: 0.72 s

Finest Throughput (concurrency=1024):
  Output: 21601.0 tokens/s
  Enter: 61062.1 tokens/s
  Requests: 24.71 req/s
  Bottleneck: Reminiscence

=== Roofline Evaluation ===
{Hardware} Ops/Byte Ratio: 195.1 ops/byte
Prefill Arithmetic Depth: 31846.2 ops/byte
Decode Arithmetic Depth: 31.1 ops/byte
Prefill Section: Compute Certain
Decode Section: Reminiscence Certain

=== Concurrency Evaluation ===
KV Cache Reminiscence Restrict: 1258 concurrent requests
Prefill Compute Restrict: 21 concurrent requests
Decode Capability Restrict: 25 concurrent requests
Theoretical General Restrict: 21 concurrent requests
Empirical Optimum Concurrency: 16 concurrent requests

=== Tuning Instructions ===

--- VLLM ---
Easy (concurrency + TP/DP):
  llm-optimizer --framework vllm --model Qwen/Qwen3-4B --gpus 4 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 4), (2, 2), (4, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=3072;max_concurrency=[512, 1024, 1536]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
Superior (further parameters):
  llm-optimizer --framework vllm --model Qwen/Qwen3-4B --gpus 4 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 4), (2, 2), (4, 1)];max_num_batched_tokens=[16384, 24576, 32768]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=3072;max_concurrency=[512, 1024, 1536]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json

Run the benchmark

With the estimation outputs in hand, an knowledgeable resolution could be made on what parameters to make use of for benchmarking primarily based on the beforehand outlined constraints. Below the hood, LLM-Optimizer transitions from theoretical estimation to empirical validation by launching a distributed benchmarking loop that evaluates real-world serving efficiency on the goal {hardware}. For every permutation of server and shopper arguments, the instrument robotically spins up a vLLM occasion with the desired tensor parallelism, batch dimension, and token limits, then drives load utilizing an artificial or dataset-based request generator (e.g., ShareGPT). Every run captures low-level metrics—time-to-first-token (TTFT), inter-token latency (ITL), end-to-end latency, tokens per second, and GPU reminiscence utilization—throughout concurrent request patterns. These measurements are aggregated right into a Pareto frontier, permitting LLM-Optimizer to establish configurations that greatest steadiness latency and throughput throughout the consumer’s constraints. In essence, this step grounds the sooner theoretical roofline evaluation in actual efficiency knowledge, producing reproducible metrics that immediately inform deployment tuning.

The next code runs the benchmark, utilizing info from the estimate:

llm-optimizer 
  --framework vllm 
  --model Qwen/Qwen3-4B 
  --server-args "tensor_parallel_size=[1,2,4];max_num_batched_tokens=[4096,8192,16384]" 
  --client-args "max_concurrency=[32,64,128];num_prompts=1000;dataset_name=sharegpt" 
  --output-json vllm_results.json

This outputs the next permutations to the vLLM engine for testing. The next are easy calculations on the totally different combos of shopper & server arguments that the benchmark runs:

3 tensor_parallel_size x 3 max_num_batched_tokens settings = 9
3 max_concurrency x 1 num prompts = 3
9 * 3 = 27 totally different assessments

As soon as accomplished, three artifacts are generated:

An HTML file containing a Pareto dashboard of the outcomes: An interactive visualization that highlights the trade-offs between latency and throughput throughout the examined configurations.
A JSON file summarizing the benchmark outcomes: This compact output aggregates the important thing efficiency metrics (e.g., latency, throughput, GPU utilization) for every check permutation and is used for programmatic evaluation or downstream automation.
A JSONL file containing the total report of particular person benchmark runs: Every line represents a single check configuration with detailed metadata, enabling fine-grained inspection, filtering, or customized plotting.

Instance benchmark report output:

{"config": {"client_args": {"max_concurrency": 32, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 32, "period": 178.69010206999883, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195764, "request_throughput": 5.5962808707125085, "input_throughput": 1690.7371840979215, "output_throughput": 1095.6118874637414, "mean_e2e_latency_ms": 5516.473195931989, "median_e2e_latency_ms": 3601.3218250000136, "std_e2e_latency_ms": 6086.249975393793, "p95_e2e_latency_ms": 17959.23558074991, "p99_e2e_latency_ms": 23288.202798799084, "mean_ttft_ms": 134.24923809297798, "median_ttft_ms": 75.87540699933015, "std_ttft_ms": 219.7887602629944, "p95_ttft_ms": 315.9690581494033, "p99_ttft_ms": 1222.5397153301492, "mean_tpot_ms": 28.140094508604655, "median_tpot_ms": 27.28665116875758, "std_tpot_ms": 7.497764233364623, "p95_tpot_ms": 36.30593537913286, "p99_tpot_ms": 48.05242155004177, "mean_itl_ms": 27.641122410215683, "median_itl_ms": 21.38108600047417, "std_itl_ms": 28.983685761892183, "p95_itl_ms": 64.98022639971161, "p99_itl_ms": 133.48110956045272, "concurrency": 30.871733420192484, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}
{"config": {"client_args": {"max_concurrency": 64, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 64, "period": 151.1696548789987, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195768, "request_throughput": 6.615084229704922, "input_throughput": 1998.5360173099916, "output_throughput": 1295.068115070481, "mean_e2e_latency_ms": 8939.159275709007, "median_e2e_latency_ms": 6008.622306500911, "std_e2e_latency_ms": 9605.635172303826, "p95_e2e_latency_ms": 27139.969452801306, "p99_e2e_latency_ms": 37183.75254391998, "mean_ttft_ms": 251.3472756509782, "median_ttft_ms": 116.74506849976751, "std_ttft_ms": 491.6096066277092, "p95_ttft_ms": 1224.981592999029, "p99_ttft_ms": 2902.0978502906837, "mean_tpot_ms": 48.65581712437634, "median_tpot_ms": 45.59879392866151, "std_tpot_ms": 31.47685312628492, "p95_tpot_ms": 65.96288688333136, "p99_tpot_ms": 130.59083745436504, "mean_itl_ms": 44.61668980280019, "median_itl_ms": 33.35350599991216, "std_itl_ms": 44.581804322583615, "p95_itl_ms": 111.47860099845275, "p99_itl_ms": 222.5829249997332, "concurrency": 59.133291551563126, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}
{"config": {"client_args": {"max_concurrency": 128, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 128, "period": 133.0894289429998, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195771, "request_throughput": 7.513744765020255, "input_throughput": 2270.0375409183894, "output_throughput": 1471.0033813718405, "mean_e2e_latency_ms": 14910.240386960006, "median_e2e_latency_ms": 10384.713371499856, "std_e2e_latency_ms": 15223.620712896502, "p95_e2e_latency_ms": 43486.963950149395, "p99_e2e_latency_ms": 61421.81745829036, "mean_ttft_ms": 663.0696945789732, "median_ttft_ms": 189.89979050093098, "std_ttft_ms": 1407.5295299267668, "p95_ttft_ms": 4652.777336598592, "p99_ttft_ms": 7000.883197711337, "mean_tpot_ms": 91.83800469031593, "median_tpot_ms": 77.46479336456856, "std_tpot_ms": 94.19538916493616, "p95_tpot_ms": 125.3206487750731, "p99_tpot_ms": 500.0748501195875, "mean_itl_ms": 73.16857466775902, "median_itl_ms": 49.85373300041829, "std_itl_ms": 72.57371615955182, "p95_itl_ms": 172.3669967985188, "p99_itl_ms": 328.1056552407972, "concurrency": 112.03174065271433, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}

Unpacking the benchmark outcomes, we are able to use the metrics p99 e2e latency and request throughput at numerous ranges of concurrency to make an knowledgeable resolution. The benchmark outcomes revealed that tensor parallelism of 4 throughout the out there GPUs persistently outperformed decrease parallelism settings, with the optimum configuration being tensor_parallel_size=4, max_num_batched_tokens=8192, and max_concurrency=128, attaining 7.51 requests/second and a couple of,270 enter tokens/second—a 2.7x throughput enchancment over the naive single-GPU baseline (2.74 req/s).Whereas this configuration delivered peak throughput, it got here with elevated p99 end-to-end latency of 61.4 seconds beneath heavy load; for latency-sensitive workloads, the candy spot was tensor_parallel_size=4 with max_num_batched_tokens=4096 at average concurrency (32), which maintained sub-24-second p99 latency whereas nonetheless delivering 5.63 req/s—greater than double the baseline throughput. The info demonstrates that transferring from a naive single-GPU setup to optimized 4-way tensor parallelism with tuned batch sizes can unlock substantial efficiency positive factors, with the particular configuration alternative relying on whether or not the deployment prioritizes most throughput or latency assurances.

To visualise the outcomes, LLM-Optimizer gives a handy perform to view the outputs plotted in a Pareto dashboard. The Pareto dashboard could be displayed with the next line of code:

llm-optimizer visualize --data-file vllm_results.json --port 8080 --serve

Bento output

With the proper artifacts now in hand, the mannequin with the proper configurations could be deployed.

Deploying to Amazon SageMaker AI

With the optimum serving parameters recognized by way of LLM-Optimizer, the ultimate step is to deploy the tuned mannequin into manufacturing. Amazon SageMaker AI gives a great atmosphere for this transition, abstracting away the infrastructure complexity of distributed GPU internet hosting whereas preserving fine-grained management over inference parameters. By utilizing LMI containers, builders can deploy open-source frameworks like vLLM at scale, with out managing CUDA dependencies, GPU scheduling, or load balancing manually.

SageMaker AI LMI containers are high-performance Docker photographs particularly designed for LLM inference. These containers combine natively with frameworks comparable to vLLM and TensorRT, and supply built-in assist for multi-GPU tensor parallelism, steady batching, streaming token era, and different optimizations important to low-latency serving. The LMI v16 container used on this instance contains vLLM v0.10.2 and the V1 engine, supporting new mannequin architectures and enhancing each latency and throughput in comparison with earlier variations.

Now that the most effective quantitative values for inference serving have been decided, these configurations could be handed on to the container as atmosphere variables. (please refer right here for in-depth steerage):

env = {
    "HF_MODEL_ID": "Qwen/Qwen3-4B",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_MAX_ROLLING_BATCH_PREFILL_TOKENS": "8192",
    "OPTION_TENSOR_PARALLEL_DEGREE": "4",
}

When these atmosphere variables are utilized, SageMaker robotically injects them into the container’s runtime configuration layer, which initializes the vLLM engine with the specified arguments. Throughout startup, the container downloads the mannequin weights from Hugging Face, configures the GPU topology for tensor parallel execution throughout the out there units (on this case, on the ml.g6.12xlarge occasion), and registers the mannequin with the SageMaker Endpoint Runtime. This makes certain that the mannequin runs with the identical optimized settings validated by LLM-Optimizer, bridging the hole between experimentation and manufacturing deployment.

The next code demonstrates learn how to bundle and deploy the mannequin for real-time inference on SageMaker AI:

image_uri = f"763104351884.dkr.ecr.{area}.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128"
model_name = name_from_base("qwen3-4b-stateful")

create_model = sm_client.create_model(
ModelName = model_name,
ExecutionRoleArn = position,
PrimaryContainer = {
"Picture": image_uri,
"Setting": env,
},
)
model_arn = create_model["ModelArn"]

As soon as the mannequin assemble is created, you possibly can create and activate the endpoint:

create_endpoint = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

After deployment, the endpoint is able to deal with dwell visitors and could be invoked immediately for inference:

request = {
    "messages": [
            {"role": "user", "content": "What is Amazon SageMaker?"}
            ],
            "max_tokens": 50,
            "temperature": 0.75,
            "cease": None
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Physique=json.dumps(request),
    ContentType="software/json",
)
response = response_model["Body"].learn()
response = “Amazon SageMaker is AWS's absolutely managed machine studying service that allows builders and knowledge scientists to construct, prepare, and deploy machine studying fashions at scale.”

These code snippets reveal the deployment circulate conceptually. For a whole end-to-end pattern on deploying an LMI container for actual time inference on SageMaker AI, consult with this instance.

Conclusion

The journey from mannequin choice to manufacturing deployment not must depend on trial and error. By combining BentoML’s LLM-Optimizer with Amazon SageMaker AI, organizations can now transfer from speculation to deployment by way of a data-driven, automated optimization loop. This workflow replaces handbook parameter tuning with a repeatable course of that quantifies efficiency trade-offs, aligns with business-level latency and throughput aims, and deploys the most effective configuration immediately right into a managed inference atmosphere. This workflow addresses a important problem in manufacturing LLM deployment: with out systematic optimization, groups face an costly guessing recreation between over-provisioning GPU assets and risking degraded consumer expertise. As demonstrated on this walkthrough, the efficiency variations are substantial—misconfigured setups can require 2-4x extra GPUs whereas delivering 2-3x increased latency. What might historically take an engineer days or even weeks of handbook trial-and-error testing turns into a couple of hours of automated benchmarking. By combining LLM-Optimizer’s clever configuration search with SageMaker AI’s managed infrastructure, groups could make data-driven deployment selections that immediately impression each cloud prices and consumer satisfaction, focusing their efforts on constructing differentiated AI experiences moderately than tuning inference parameters.

The mix of automated benchmarking and managed large-model deployment represents a major step ahead in making enterprise AI each accessible and economically environment friendly. By leveraging LLM-Optimizer for clever configuration search and SageMaker AI for scalable, fault-tolerant internet hosting, groups can give attention to constructing differentiated AI experiences moderately than managing infrastructure or tuning inference stacks manually. In the end, the most effective LLM configuration isn’t simply the one which runs quickest—it’s the one which meets particular latency, throughput, and value targets in manufacturing. With BentoML’s LLM-Optimizer and Amazon SageMaker AI, that steadiness could be found systematically, reproduced persistently, and deployed confidently.

Further assets

In regards to the authors

Josh Longenecker is a Generative AI/ML Specialist Options Architect at AWS, partnering with clients to architect and deploy cutting-edge AI/ML options. He’s a part of the Neuron Information Science Professional TFC and enthusiastic about pushing boundaries within the quickly evolving AI panorama. Exterior of labor, you’ll discover him on the fitness center, outside, or having fun with time together with his household.

Mohammad Tahsin is a Generative AI/ML Specialist Options Architect at AWS, the place he works with clients to design, optimize, and deploy trendy AI/ML options. He’s enthusiastic about steady studying and staying on the frontier of recent capabilities within the subject. In his free time, he enjoys gaming, digital artwork, and cooking.