Internet hosting NVIDIA speech NIM fashions on Amazon SageMaker AI: Parakeet ASR

This put up was written with NVIDIA and the authors want to thank Adi Margolin, Eliuth Triana, and Maryam Motamedi for his or her collaboration.

Organizations immediately face the problem of processing giant volumes of audio information–from buyer calls and assembly recordings to podcasts and voice messages–to unlock precious insights. Automated Speech Recognition (ASR) is a vital first step on this course of, changing speech to textual content in order that additional evaluation may be carried out. Nonetheless, operating ASR at scale is computationally intensive and may be costly. That is the place asynchronous inference on Amazon SageMaker AI is available in. By deploying state-of-the-art ASR fashions (like NVIDIA Parakeet fashions) on SageMaker AI with asynchronous endpoints, you’ll be able to deal with giant audio recordsdata and batch workloads effectively. With asynchronous inference, long-running requests may be processed within the background (with outcomes delivered later); it additionally helps auto-scaling to zero when there’s no work and handles spikes in demand with out blocking different jobs.

On this weblog put up, we’ll discover the right way to host the NVIDIA Parakeet ASR mannequin on SageMaker AI and combine it into an asynchronous pipeline for scalable audio processing. We’ll additionally spotlight the advantages of Parakeet’s structure and the NVIDIA Riva toolkit for speech AI, and focus on the right way to use NVIDIA NIM for deployment on AWS.

NVIDIA speech AI applied sciences: Parakeet ASR and Riva Framework

NVIDIA provides a complete suite of speech AI applied sciences, combining high-performance fashions with environment friendly deployment options. At its core, the Parakeet ASR mannequin household represents state-of-the-art speech recognition capabilities, reaching industry-leading accuracy with low phrase error charges (WERs) . The mannequin’s structure makes use of the Quick Conformer encoder with the CTC or transducer decoder, enabling 2.4× sooner processing than commonplace Conformers whereas sustaining accuracy.

NVIDIA speech NIM is a group of GPU-accelerated microservices for constructing customizable speech AI functions. NVIDIA Speech fashions ship correct transcription accuracy and pure, expressive voices in over 36 languages–ultimate for customer support, contact facilities, accessibility, and international enterprise workflows. Builders can fine-tune and customise fashions for particular languages, accents, domains, and vocabularies, supporting accuracy and model voice alignment.

Seamless integration with LLMs and the NVIDIA Nemo Retriever make NVIDIA fashions ultimate for agentic AI functions, serving to your group stand out with safer, high-performing, voice AI. The NIM framework delivers these companies as containerized options, making deployment simple by Docker containers that embody the required dependencies and optimizations.

This mixture of high-performance fashions and deployment instruments gives organizations with a whole resolution for implementing speech recognition at scale.

Answer overview

The structure illustrated within the diagram showcases a complete asynchronous inference pipeline designed particularly for ASR and summarization workloads. The answer gives a strong, scalable, and cost-effective processing pipeline.

Structure parts

The structure consists of 5 key parts working collectively to create an environment friendly audio processing pipeline. At its core, the SageMaker AI asynchronous endpoint hosts the Parakeet ASR mannequin with auto scaling capabilities that may scale to zero when idle for price optimization.

The info ingestion course of begins when audio recordsdata are uploaded to Amazon Easy Storage Service (Amazon S3), triggering AWS Lambda features that course of metadata and provoke the workflow.
For occasion processing, the SageMaker endpoint routinely sends out Amazon Easy Notification Service (Amazon SNS) success and failure notifications by separate queues, enabling correct dealing with of transcriptions.
Efficiently transcribed content material on Amazon S3 strikes to Amazon Bedrock LLMs for clever summarization and extra processing like classification and insights extraction.
Lastly, a complete monitoring system utilizing Amazon DynamoDB shops workflow standing and metadata, enabling real-time monitoring and analytics of the whole pipeline.

Detailed implementation walkthrough

On this part, we’ll present the detailed walkthrough of the answer implementation.

SageMaker asynchronous endpoint stipulations

To run the instance notebooks, you want an AWS account with an AWS Identification and Entry Administration (IAM) function with least-privilege permissions to handle assets created. For particulars, check with Create an AWS account. You may have to request a service quota improve for the corresponding SageMaker async internet hosting situations. On this instance, we’d like one ml.g5.xlarge SageMaker async internet hosting occasion and a ml.g5.xlarge SageMaker pocket book occasion. You too can select a distinct built-in improvement atmosphere (IDE), however make sure that the atmosphere accommodates GPU compute assets for native testing.

SageMaker asynchronous endpoint configuration

While you deploy a customized mannequin like Parakeet, SageMaker has a few choices:

Use a NIM container supplied by NVIDIA
Use a big mannequin inference (LMI) container
Use a prebuilt PyTorch container

We’ll present examples for all three approaches.

Utilizing an NVIDIA NIM container

NVIDIA NIM gives a streamlined method to deploying optimized AI fashions by containerized options. Our implementation takes this idea additional by making a unified SageMaker AI endpoint that intelligently routes between HTTP and gRPC protocols to assist maximize each efficiency and capabilities whereas simplifying the deployment course of.

Progressive dual-protocol structure

The important thing innovation is the mixed HTTP + gRPC structure that exposes a single SageMaker AI endpoint with clever routing capabilities. This design addresses the widespread problem of selecting between protocol effectivity and have completeness by routinely choosing the optimum transport technique. The HTTP route is optimized for easy transcription duties with recordsdata underneath 5MB, offering sooner processing and decrease latency for widespread use circumstances. In the meantime, the gRPC route helps bigger recordsdata (SageMaker AI real-time endpoints help a max payload of 25MB) and superior options like speaker diarization with exact word-level timing info. The system’s auto-routing performance analyzes incoming requests to find out file dimension and requested options, then routinely selects probably the most applicable protocol with out requiring guide configuration. For functions that want specific management, the endpoint additionally helps compelled routing by /invocations/http for easy transcription or /invocations/grpc when speaker diarization is required. This flexibility permits each automated optimization and fine-grained management based mostly on particular software necessities.

Superior speech recognition and speaker diarization capabilities

The NIM container permits a complete audio processing pipeline that seamlessly combines speech recognition with speaker identification by the NVIDIA Riva built-in capabilities. The container handles audio preprocessing, together with format conversion and segmentation, whereas ASR and speaker diarization processes run concurrently on the identical audio stream. Outcomes are routinely aligned utilizing overlapping time segments, with every transcribed section receiving applicable speaker labels (for instance, Speaker_0, Speaker_1). The inference handler processes audio recordsdata by the whole pipeline, initializing each ASR and speaker diarization companies, operating them in parallel, and aligning transcription segments with speaker labels. The output consists of the total transcription, timestamped segments with speaker attribution, confidence scores, and complete speaker rely in a structured JSON format.

Implementation and deployment

The implementation extends NVIDIA parakeet-1-1b-ctc-en-us NIM container as the muse, including a Python aiohttp server that seamlessly manages the whole NIM lifecycle by routinely beginning and monitoring the service. The server handles protocol adaptation by translating SageMaker inference requests to applicable NIM APIs, implements the clever routing logic that analyzes request traits, and gives complete error dealing with with detailed error messages and fallback mechanisms for sturdy manufacturing deployment. The containerized resolution streamlines deployment by commonplace Docker and AWS CLI instructions, that includes a pre-configured Docker file with the required dependencies and optimizations. The system accepts a number of enter codecs together with multipart form-data (really helpful for max compatibility), JSON with base64 encoding for easy integration eventualities, and uncooked binary uploads for direct audio processing.

For detailed implementation directions and dealing examples, groups can reference the full implementation and deployment pocket book within the AWS samples repository, which gives complete steering on deploying Parakeet ASR with NIM on SageMaker AI utilizing the carry your personal container (BYOC) method. For organizations with particular architectural preferences, separate HTTP-only and gRPC-only implementations are additionally out there, offering easier deployment fashions for groups with well-defined use circumstances whereas the mixed implementation provides most flexibility and automated optimization.

AWS clients can deploy these fashions both as production-grade NVIDIA NIM containers straight from SageMaker Market or JumpStart, or open supply NVIDIA fashions out there on Hugging Face, which may be deployed by customized containers on SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS). This permits organizations to decide on between absolutely managed, enterprise-tier endpoints with auto-scaling and safety, or versatile open-source improvement for analysis or constrained use circumstances.

Utilizing an AWS LMI container

LMI containers are designed to simplify internet hosting giant fashions on AWS. These containers embody optimized inference engines like vLLM, FasterTransformer, or TensorRT-LLM that may routinely deal with issues like mannequin parallelism, quantization, and batching for big fashions. The LMI container is basically a pre-configured Docker picture that runs an inference server (for instance a Python server with these optimizations) and permits you to specify mannequin parameters through the use of atmosphere variables.

To make use of the LMI container for Parakeet, we’d sometimes:

Select the suitable LMI picture: AWS gives totally different LMI photographs for various frameworks. For Parakeet , we would use the DJLServing picture for environment friendly inference. Alternatively, NVIDIA Triton Inference Server (which Riva makes use of) is an possibility if we package deal the mannequin in ONNX or TensorRT format.
Specify the mannequin configuration: With LMI, we regularly present a model_id (if pulling from Hugging Face Hub) or a path to our mannequin, together with configuration for the right way to load it (variety of GPUs, tensor parallel diploma, quantization bits). The container then downloads the mannequin and initializes it with the required settings. We will additionally obtain our personal mannequin recordsdata from Amazon S3 as a substitute of utilizing the Hub.
Outline the inference handler: The LMI container may require a small handler script or configuration to inform it the right way to course of requests. For ASR, this may contain studying the audio enter, passing it to the mannequin, and returning textual content.

AWS LMI containers ship excessive efficiency and scalability by superior optimization strategies, together with steady batching, tensor parallelism, and state-of-the-art quantization strategies. LMI containers combine a number of inference backends (vLLM, TensorRT-LLM by a single unified configuration), serving to customers seamlessly experiment and change between frameworks to search out the optimum efficiency stack in your particular use case.

Utilizing a SageMaker PyTorch container

SageMaker provides PyTorch Deep Studying Containers (DLCs) that include PyTorch and plenty of widespread libraries pre-installed. In this instance, we demonstrated the right way to prolong our prebuilt container to put in needed packages for the mannequin. You’ll be able to obtain the mannequin straight from Hugging Face throughout the endpoint creation or obtain the Parakeet mannequin artifacts, packaging it with needed configuration recordsdata right into a mannequin.tar.gz archive, and importing it to Amazon S3. Together with the mannequin artifacts, an inference.py script is required because the entry level script to outline mannequin loading and inference logic, together with audio preprocessing and transcription dealing with. When utilizing the SageMaker Python SDK to create a PyTorchModel, the SDK will routinely repackage the mannequin archive to incorporate the inference script underneath /decide/ml/mannequin/code/inference.py, whereas preserving mannequin artifacts in /decide/ml/mannequin/ on the endpoint. As soon as the endpoint is deployed efficiently, it may be invoked by the predict API by sending audio recordsdata as byte streams to get transcription outcomes.

For the SageMaker real-time endpoint, we at the moment permit a most of 25MB for payload dimension. Be sure to have arrange the container to additionally permit the utmost request dimension. Nonetheless, in case you are planning to make use of the identical mannequin for the asynchronous endpoint, the utmost file dimension that the async endpoint helps is 1GB and the response time is as much as 1 hour. Accordingly, it is best to setup the container to be ready for this payload dimension and timeout. When utilizing the PyTorch containers, listed here are some key configuration parameters to think about:

SAGEMAKER_MODEL_SERVER_WORKERS: Set the variety of torch employees that may load the variety of fashions copied into GPU reminiscence.
TS_DEFAULT_RESPONSE_TIMEOUT: Set the outing setting for Torch server employees; for lengthy audio processing, you’ll be able to set it to the next quantity
TS_MAX_REQUEST_SIZE: Set the byte dimension values for requests to 1G for async endpoints.
TS_MAX_RESPONSE_SIZE: Set the byte dimension values for response.

Within the instance pocket book, we additionally showcase the right way to leverage the SageMaker native session supplied by the SageMaker Python SDK. It helps you create estimators and run coaching, processing, and inference jobs regionally utilizing Docker containers as a substitute of managed AWS infrastructure, offering a quick solution to take a look at and debug your machine studying scripts earlier than scaling to manufacturing.

CDK pipeline stipulations

Earlier than deploying this resolution, be sure to have:

AWS CLI configured with applicable permissions – Set up Information
AWS Cloud Growth Equipment (AWS CDK) put in – Set up Information
Node.js 18+ and Python 3.9+ put in
Docker – Set up Information
SageMaker endpoint deployed along with your ML mannequin (Parakeet ASR fashions or related)
Amazon SNS subjects created for achievement and failure notifications

CDK pipeline setup

The answer deployment begins with provisioning the required AWS assets utilizing Infrastructure as Code (IaC) rules. AWS CDK creates the foundational parts together with:

DynamoDB Desk: Configured for on-demand capability to trace invocation metadata, processing standing, and outcomes
S3 Buckets: Safe storage for enter audio recordsdata, transcription outputs, and summarization outcomes
SNS subjects: Separate queues for achievement and failure occasion dealing with
Lambda features: Serverless features for metadata processing, standing updates, and workflow orchestration
IAM roles and insurance policies: Acceptable permissions for cross-service communication and useful resource entry

Atmosphere setup

Clone the repository and set up dependencies:

# Set up degit, a library for downloading particular sub directories
npm set up -g degit

# Clone simply the particular folder
npx degit aws-samples/genai-ml-platform-examples/infrastructure/automated-speech-recognition-async-pipeline-sagemaker-ai/sagemaker-async-batch-inference-cdk sagemaker-async-batch-inference-cdk

# Navigate to folder
cd sagemaker-async-batch-inference-cdk

# Set up Node.js dependencies
npm set up

# Arrange Python digital atmosphere
python3 -m venv .venv
supply .venv/bin/activate

# On Home windows:
.venvScriptsactivate
pip set up -r necessities.txt

Configuration

Replace the SageMaker endpoint configuration in bin/aws-blog-sagemaker.ts:

vim bin/aws-blog-sagemaker.ts 

# Change the endpoint identify 
sageMakerConfig: { 
    endpointName: 'your-sagemaker-endpoint-name',     
    enableSageMakerAccess: true 
}

When you have adopted the pocket book to deploy the endpoint, it is best to have created the 2 SNS subjects. In any other case, be sure to create the right SNS subjects utilizing CLI:

# Create SNS subjects
aws sns create-topic --name success-inf
aws sns create-topic --name failed-inf

Construct and deploy

Earlier than you deploy the AWS CloudFormation template, make sure that Docker is operating.

# Compile TypeScript to JavaScript
npm run construct

# Bootstrap CDK (first time solely)
npx cdk bootstrap

# Deploy the stack
npx cdk deploy

Confirm deployment

After profitable deployment, be aware the output values:

DynamoDB desk identify for standing monitoring
Lambda operate ARNs for processing and standing updates
SNS matter ARNs for notifications

Submit audio file for processing

Processing Audio Recordsdata

Replace the upload_audio_invoke_lambda.sh

LAMBDA_ARN="YOUR_LAMBDA_FUNCTION_ARN"
S3_BUCKET="YOUR_S3_BUCKET_ARN"

Run the Script:

AWS_PROFILE=default ./scripts/upload_audio_invoke_lambda.sh

This script will:

Obtain a pattern audio file
Add the audio file to your s3 bucket
Ship the bucket path to Lambda and set off the transcription and summarization pipeline

Monitoring progress

You’ll be able to test the end in DynamoDB desk utilizing the next command:

aws dynamodb scan --table-name YOUR_DYNAMODB_TABLE_NAME

Verify processing standing within the DynamoDB desk:

submitted: Efficiently queued for inference
accomplished: Transcription accomplished efficiently
failed: Processing encountered an error

Audio processing and workflow orchestration

The core processing workflow follows an event-driven sample:

Preliminary processing and metadata extraction: When audio recordsdata are uploaded to S3, the triggered Lambda operate analyzes the file metadata, validates format compatibility, and creates detailed invocation data in DynamoDB. This facilitates complete monitoring from the second audio content material enters the system.

Asynchronous Speech Recognition: Audio recordsdata are processed by the SageMaker endpoint utilizing optimized ASR fashions. The asynchronous course of can deal with varied file sizes and durations with out timeout considerations. Every processing request is assigned a singular identifier for monitoring functions.

Success path processing: Upon profitable transcription, the system routinely initiates the summarization workflow. The transcribed textual content is shipped to Amazon Bedrock, the place superior language fashions generate contextually applicable summaries based mostly on configurable parameters reminiscent of abstract size, focus areas, and output format.

Error dealing with and restoration: Failed processing makes an attempt set off devoted Lambda features that log detailed error info, replace processing standing, and might provoke retry logic for transient failures. This sturdy error dealing with ends in minimal information loss and gives clear visibility into processing points.

Actual-world functions

Customer support analytics: Organizations can course of 1000’s of customer support name recordings to generate transcriptions and summaries, enabling sentiment evaluation, high quality assurance, and insights extraction at scale.

Assembly and convention processing: Enterprise groups can routinely transcribe and summarize assembly recordings, creating searchable archives and actionable summaries for members and stakeholders.

Media and content material processing: Media firms can course of podcast episodes, interviews, and video content material to generate transcriptions and summaries for improved accessibility and content material discoverability.

Compliance and authorized documentation: Authorized and compliance groups can course of recorded depositions, hearings, and interviews to create correct transcriptions and summaries for case preparation and documentation.

Cleanup

After you have used the answer, take away the SageMaker endpoints to stop incurring further prices. You should use the supplied code to delete real-time and asynchronous inference endpoints, respectively:

# Delete real-time inference
endpointreal_time_predictor.delete_endpoint()

# Delete asynchronous inference
endpointasync_predictor.delete_endpoint()

You must also delete all of the assets created by the CDK stack.

# Delete CDK Stack
cdk destroy

Conclusion

The combination of highly effective NVIDIA speech AI applied sciences with AWS cloud infrastructure creates a complete resolution for large-scale audio processing. By combining Parakeet ASR’s industry-leading accuracy and pace with NVIDIA Riva’s optimized deployment framework on the Amazon SageMaker asynchronous inference pipeline, organizations can obtain each high-performance speech recognition and cost-effective scaling. The answer leverages the managed companies of AWS (SageMaker AI, Lambda, S3, and Bedrock) to create an automatic, scalable pipeline for processing audio content material. With options like auto scaling to zero, complete error dealing with, and real-time monitoring by DynamoDB, organizations can concentrate on extracting enterprise worth from their audio content material reasonably than managing infrastructure complexity. Whether or not processing customer support calls, assembly recordings, or media content material, this structure delivers dependable, environment friendly, and cost-effective audio processing capabilities. To expertise the total potential of this resolution, we encourage you to discover the answer and attain out to us when you’ve got any particular enterprise necessities and want to customise the answer in your use case.

Concerning the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Options Architect at AWS based mostly in Sydney, Australia, the place her focus is on working with clients to construct options utilizing state-of-the-art AI/ML instruments. She has been actively concerned in a number of generative AI initiatives throughout APJ, harnessing the facility of LLMs. Previous to becoming a member of AWS, Dr. Li held information science roles within the monetary and retail industries.

Tony Trinh is a Senior AI/ML Specialist Architect at AWS. With 13+ years of expertise within the IT {industry}, Tony focuses on architecting scalable, compliance-driven AI and ML options—significantly in generative AI, MLOps, and cloud-native information platforms. As a part of his PhD, he’s doing analysis in Multimodal AI and Spatial AI. In his spare time, Tony enjoys mountaineering, swimming and experimenting with dwelling enchancment.

Alick Wong is a Senior Options Architect at Amazon Net Providers, the place he helps startups and digital-native companies modernize, optimize, and scale their platforms within the cloud. Drawing on his expertise as a former startup CTO, he works carefully with founders and engineering leaders to drive development and innovation on AWS.

Andrew Smith is a Sr. Cloud Assist Engineer within the SageMaker, Imaginative and prescient & Different crew at AWS, based mostly in Sydney, Australia. He helps clients utilizing many AI/ML companies on AWS with experience in working with Amazon SageMaker. Exterior of labor, he enjoys spending time with family and friends in addition to studying about totally different applied sciences.

Derrick Choo is a Senior AI/ML Specialist Options Architect at AWS who accelerates enterprise digital transformation by cloud adoption, AI/ML, and generative AI options. He focuses on full-stack improvement and ML, designing end-to-end options spanning frontend interfaces, IoT functions, information integrations, and ML fashions, with a selected concentrate on laptop imaginative and prescient and multi-modal methods.

Tim Ma is a Principal Specialist in Generative AI at AWS, the place he collaborates with clients to design and deploy cutting-edge machine studying options. He additionally leads go-to-market methods for generative AI companies, serving to organizations harness the potential of superior AI applied sciences.

Curt Lockhart is an AI Options Architect at NVIDIA, the place he helps clients deploy language and imaginative and prescient fashions to construct finish to finish AI workflows utilizing NVIDIA’s tooling on AWS. He enjoys making complicated AI really feel approachable and spending his time exploring the artwork, music, and outdoor of the Pacific Northwest.

Francesco Ciannella is a senior engineer at NVIDIA, the place he works on conversational AI options constructed round giant language fashions (LLMs) and audio language fashions (ALMs). He holds a M.S. in engineering of telecommunications from the College of Rome “La Sapienza” and an M.S. in language applied sciences from the Faculty of Laptop Science at Carnegie Mellon College.