Run NVIDIA Nemotron and OpenAI GPT OSS fashions on Amazon Bedrock in AWS GovCloud (US)

July 2, 2026

2

Authorities businesses working workloads in AWS GovCloud (US) want AI capabilities that hold tempo with the industrial sector. On the identical time, they will’t compromise the safety and compliance controls their missions require. As open-weight basis fashions (FMs) transfer from experimentation into mission methods, two necessities form each mannequin resolution. First, the mannequin should ship the aptitude the mission calls for. Second, the inference surroundings should fulfill the company’s safety, compliance, and information residency obligations. For U.S. authorities businesses, the protection and intelligence group and the contractors that serve them, these necessities are non-negotiable. Entry to superior open-weight fashions is crucial for work similar to intelligence evaluation, mission planning, acquisition and contract doc overview, safety log evaluation, and compliance automation. This entry should not require shifting delicate information exterior the boundary that governs it.

We’re excited to introduce US-based frontier open-weight fashions in AWS GovCloud (US). With this launch, Amazon Bedrock now helps OpenAI’s open-weight GPT OSS fashions (120B and 20B) and NVIDIA Nemotron (Nano 9B v2, Nano 12B v2, Nano 30B, Tremendous 120B) fashions. With these new fashions, you’ll be able to construct and scale generative AI purposes with various, high-performance FMs. This provides the flexibleness to make use of OpenAI’s and NVIDIA’s newest fashions alongside different main AI fashions by way of a single, unified API. You should utilize this unified API to pick the appropriate mannequin for every particular use case with out altering your software code.

AWS GovCloud (US) supplies an remoted set of AWS Areas designed to host delicate information and controlled workloads. Areas are bodily positioned in the USA and administered completely by U.S. residents. They assist prospects meet compliance frameworks together with FedRAMP Excessive (Provisional Authority to Function) and DoD Cloud Computing Safety Necessities Information (SRG) Affect Ranges 2, 4, and 5. Extra frameworks embrace Worldwide Visitors in Arms Rules (ITAR) and Prison Justice Data Providers (CJIS).

Amazon Bedrock is a completely managed service for accessing FMs from impartial mannequin suppliers, with inference working completely on AWS-operated infrastructure.

With Amazon Bedrock, inference runs contained in the AWS GovCloud (US) isolation boundary, on infrastructure operated by U.S. residents on U.S. soil. For particulars on how Amazon Bedrock handles your information, consult with Knowledge safety in Amazon Bedrock.

OpenAI’s open-weight GPT OSS fashions and NVIDIA Nemotron open-weight fashions at the moment are out there on Amazon Bedrock in AWS GovCloud (US). This launch delivers two open-weight mannequin households into the AWS GovCloud (US) Areas: OpenAI gpt-oss-120b and gpt-oss-20b, and the NVIDIA Nemotron 3 household, together with Nemotron 3 Tremendous 120B alongside the Nemotron 3 Nano fashions. With these fashions, you’ll be able to construct agentic purposes and mission workflows similar to automated safety management assessments, multi-document intelligence synthesis, contract and acquisition evaluation, and coverage compliance checking. All of this runs throughout the AWS GovCloud (US) compliance boundary.

On this submit, we cowl the fashions presently out there in AWS GovCloud (US) and their capabilities, the inference choices for information residency, the out there service tiers and find out how to get began.

In regards to the fashions

This part introduces the 2 open-weight mannequin households now out there in AWS GovCloud (US) and the capabilities that set every aside.

NVIDIA Nemotron

The NVIDIA Nemotron household delivers each small language mannequin (SLM) and enormous language mannequin (LLM) capabilities, constructed for compute effectivity and accuracy in specialised agentic AI methods. NVIDIA describes the 2 fashions as follows:

NVIDIA Nemotron 3 Tremendous is a 120B open hybrid mixture-of-experts (MoE) mannequin for advanced multi-agent workloads with 120 billion complete parameters that prompts solely 12 billion parameters per token. This MoE design delivers as much as 5 occasions increased throughput than the earlier technology for cost-efficient inference, and its 1-million-token context window provides brokers the long-term reminiscence to remain targeted throughout lengthy, multi-step duties.
NVIDIA Nemotron 3 Nano is a 30-billion-parameter open mannequin that prompts roughly 3 billion parameters per token, delivering 4 occasions increased throughput than the earlier technology and lowering reasoning-token technology by as much as 60 p.c. Its 1-million-token context window helps long-running, multi-step agent workflows.

For the total record of NVIDIA Nemotron fashions out there in AWS GovCloud (US), consult with NVIDIA fashions on Amazon Bedrock.

OpenAI GPT OSS

OpenAI’s GPT OSS fashions are open-weight, text-to-text fashions designed for reasoning, agentic, and developer duties, with adjustable reasoning effort and assist for exterior software integration. This submit focuses on two variants:

gpt-oss-120b is OpenAI’s 120-billion parameter open-weight mannequin, designed for manufacturing, general-purpose, and high-reasoning use circumstances.

gpt-oss-20b is the 20-billion parameter mannequin, designed for decrease latency and native or specialised use circumstances.

Each fashions present a 128K-token context window and as much as 16K output tokens, and each settle for textual content enter and generate textual content output. As a result of the weights are open, organizations can independently consider the mannequin structure, overview the revealed mannequin card, and run their very own benchmarks on consultant workloads. For presidency groups, this transparency helps organizational danger assessments, permits buyer safety groups to guage mannequin conduct earlier than deployment, and aligns with the zero-trust rules many U.S. authorities businesses are adopting.

For the total record of OpenAI fashions out there in AWS GovCloud (US), consult with OpenAI fashions on Amazon Bedrock.

Serverless inference inside your compliance boundary

NVIDIA Nemotron and GPT OSS fashions on Amazon Bedrock are served by the next-generation inference engine in Amazon Bedrock. To grasp the structure, it helps to differentiate between the engine and the endpoint: the engine is the underlying serving infrastructure, designed with Mannequin Deployment Account isolation and 0 operator entry, whereas the bedrock-mantle endpoint is the OpenAI-compatible HTTPS API that purposes name to ship requests to the engine. For businesses, there’s no infrastructure to provision, no GPUs to handle, and no model-deployment experience required.

The subsequent-generation inference engine is constructed on a zero operator entry design. No operator, whether or not from AWS, the shopper, or a mannequin supplier, can entry buyer information, similar to inference prompts or completions. Mixed with the AWS GovCloud (US) isolation boundary, this offers authorities groups a robust data-protection basis. For the technical particulars, consult with Exploring the zero operator entry design of Mantle.

Amazon Bedrock supplies two endpoints for invoking these fashions. The bedrock-mantle endpoint is the OpenAI-compatible API for the next-generation inference engine, so you’ll be able to name it with the OpenAI Python and TypeScript SDKs. It makes use of the Chat Completions and Responses APIs. The bedrock-runtime endpoint makes use of the Converse and InvokeModel APIs by way of the AWS SDK, with entry to native Amazon Bedrock options similar to Guardrails. Code samples for each are within the Getting began part.

Regional availability and information residency

Amazon Bedrock provides a number of choices for the place your inference requests are processed. In-Area retains each request inside a single Area, and Geographic Cross-Area inference routes requests throughout Areas inside a geography for increased throughput, so your information stays inside that geographic boundary. For NVIDIA Nemotron and GPT OSS fashions in AWS GovCloud (US), the choices are as follows:

In-Area inference is offered in us-gov-west-1 (AWS GovCloud (US-West)).
Geo cross-Area inference is offered by way of a devoted AWS GovCloud (US) cross-Area inference ID that routes requests throughout us-gov-west-1 and us-gov-east-1. Visitors stays throughout the AWS GovCloud (US) boundary, when you acquire resilience throughout each Areas.

All inference for these fashions stays throughout the AWS GovCloud (US) boundary. International cross-Area inference, which routes requests throughout industrial AWS Areas worldwide, isn’t out there in AWS GovCloud (US). You possibly can select between single-Area and Geo cross-Area primarily based in your necessities.

Service tiers

Amazon Bedrock provides a number of service tiers to match totally different workload necessities. For all three fashions, the Customary, Precedence, and Flex tiers are supported.

Service tier	Description	Supported
Customary	Pay-per-token entry with no dedication	Sure
Precedence	Larger throughput for latency-sensitive visitors	Sure
Flex	Decrease-cost entry for versatile, non-time-sensitive workloads	Sure
Reserved	Devoted throughput with a time period dedication	Not presently out there

By default, requests use on-demand inference on the Customary tier, the place you pay per token with out reserving capability upfront. For latency-sensitive, customer-facing workloads, you’ll be able to route particular person requests to the Precedence tier. For non-time-sensitive work similar to mannequin evaluations or batch summarization, the Flex tier provides a lower-cost possibility. For scaling steerage and find out how to deal with throttling at manufacturing quantity, consult with Scaling and throughput finest practices and the Getting began part.

Getting began in AWS GovCloud (US)

This part walks by way of invoking the fashions, beginning with the really helpful bedrock-mantle endpoint. The examples use the us-gov-west-1 Area, the place in-Area inference is offered.

Console playground

Navigate to the Amazon Bedrock console in your AWS GovCloud (US) account.
Select Playground from the left menu beneath the Take a look at part.
Select Choose mannequin.
Select the supplier (NVIDIA or OpenAI) from the class record, then choose the mannequin (for instance, NVIDIA Nemotron 3 Tremendous or 120B gpt-oss-120b).
Select Apply to load the mannequin.
Enter a immediate to check the mannequin.

Utilizing the bedrock-mantle endpoint (really helpful)

To make use of these fashions, you want an AWS account in AWS GovCloud (US) with permissions to invoke Amazon Bedrock fashions. For the bedrock-mantle endpoint, you want an Amazon Bedrock API key or commonplace AWS credentials. The next is a pattern coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "BedrockMantleInference",
            "Effect": "Allow",
            "Action": [
                "bedrock-mantle:CreateInference",
                "bedrock-mantle:Get*",
                "bedrock-mantle:List*"
            ],
            "Useful resource": "arn:aws-us-gov:bedrock-mantle:us-gov-west-1:111122223333:mission/*"
        },
        {
            "Sid": "BedrockMantleCallWithBearerToken",
            "Impact": "Permit",
            "Motion": "bedrock-mantle:CallWithBearerToken",
            "Useful resource": "*"
        }
    ]
}

Exchange 111122223333 along with your AWS account ID and scope the Area to the AWS GovCloud (US) Areas you employ. The code examples on this submit authenticate with a Bedrock API key, which requires bedrock-mantle:CallWithBearerToken. This motion have to be scoped to "Useful resource": "*", as proven within the second assertion. To manage which identities can generate or use Amazon Bedrock API keys, consult with Management permissions for producing and utilizing Amazon Bedrock API keys. To limit your group to accredited fashions solely, use a service management coverage (SCP).

The next instance makes use of the OpenAI Python SDK to name the bedrock-mantle endpoint. For manufacturing workloads, use short-term API keys, which expire robotically (most 12 hours) and inherit the permissions of the IAM function that generated them.

import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.shopper("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

shopper = OpenAI(
    # Use the AWS GovCloud (US) Area within the base URL, e.g. us-gov-west-1
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

response = shopper.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Explain the benefits of open-weight models for regulated workloads."}
    ],
    reasoning_effort="medium",  # low | medium | excessive
    max_completion_tokens=512,
)

print(response.selections[0].message.content material)

Notice: These examples retrieve the Bedrock API key from AWS Secrets and techniques Supervisor. For native improvement, you’ll be able to as a substitute learn the important thing from an surroundings variable, however keep away from that sample in manufacturing. Use AWS Secrets and techniques Supervisor or one other secrets and techniques retailer.

To name NVIDIA Nemotron 3 Tremendous 120B as a substitute, change the mannequin parameter to nvidia.nemotron-super-3-120b and take away the reasoning_effort parameter (reasoning effort management is particular to GPT OSS). No different code adjustments are required.

Controlling reasoning effort

GPT OSS fashions are reasoning fashions that expose an adjustable reasoning effort. Set the reasoning_effort parameter on the Chat Completions name to low, medium, or excessive to commerce response latency in opposition to reasoning depth. Use low for high-volume, latency-sensitive visitors, and excessive for advanced, multi-step reasoning or agentic planning. For reasoning fashions, desire max_completion_tokens to sure the response size (the older max_tokens subject continues to be accepted).

Utilizing the Responses API

Along with Chat Completions, GPT OSS fashions assist the Responses API, OpenAI’s interface for reasoning-style interactions. It takes a single enter somewhat than a messages array. NVIDIA Nemotron 3 Tremendous 120B doesn’t assist the Responses API. Use Chat Completions, Converse, or Invoke for that mannequin.

import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.shopper("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

shopper = OpenAI(
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

response = shopper.responses.create(
    mannequin="openai.gpt-oss-120b",
    enter="Clarify the advantages of open-weight fashions for regulated workloads.",
)

print(response)

Streaming responses

For chat and agent use circumstances the place you need to floor tokens to the consumer as they’re generated, set stream=True. The response turns into an iterator of incremental delta occasions:

stream = shopper.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Write a short summary of mixture-of-experts architectures."}
    ],
    stream=True,
)

for chunk in stream:
    delta = chunk.selections[0].delta.content material
    if delta:
        print(delta, finish="", flush=True)
print()

On the bedrock-runtime endpoint, the equal functionality requires the bedrock:InvokeModelWithResponseStream permission, which the minimal coverage proven later already grants.

Device calling

NVIDIA Nemotron and GPT OSS open-weight fashions are designed for agentic workflows, making them actionable for tool-calling eventualities. In a tool-calling workflow, you outline features (instruments) that the mannequin can invoke, the mannequin decides when to name them primarily based on the consumer’s request, and your software runs the operate and returns the consequence for the mannequin to include into its ultimate response.

The next instance demonstrates this sample finish to finish. We outline a get_weather software, ship a consumer message, let the mannequin request the software name, run the operate with mock information, and cross the consequence again so the mannequin can generate a natural-language reply.

import json
import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.shopper("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

shopper = OpenAI(
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

instruments = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country (e.g., Seattle, US)"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Step 1: Ship the consumer request with software definitions
messages = [
    {"role": "user", "content": "What's the weather like in Seattle?"}
]

response = shopper.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=messages,
    instruments=instruments,
    tool_choice="auto",
)

assistant_message = response.selections[0].message

# Step 2: Test if the mannequin needs to name a software
if assistant_message.tool_calls:
    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        function_name = tool_call.operate.title
        arguments = json.hundreds(tool_call.operate.arguments)

        # Step 3: Validate operate title and run it
        if function_name == "get_weather":
            location = arguments.get("location", "Unknown")
            unit = arguments.get("unit", "fahrenheit")
            consequence = {
                "location": location,
                "temperature": 18 if unit == "celsius" else 64,
                "unit": unit,
                "situation": "Partly cloudy",
                "humidity": 72,
            }
        else:
            consequence = {"error": f"Unknown operate: {function_name}"}

        # Step 4: Return the operate consequence to the mannequin
        messages.append({
            "function": "software",
            "tool_call_id": tool_call.id,
            "content material": json.dumps(consequence),
        })

    # Step 5: Get the ultimate response incorporating software outcomes
    final_response = shopper.chat.completions.create(
        mannequin="openai.gpt-oss-120b",
        messages=messages,
        instruments=instruments,
    )

    print(final_response.selections[0].message.content material)
else:
    print(assistant_message.content material)

The instance proven right here demonstrates client-side software calling: the mannequin returns a software name, your software runs the operate, and also you cross the consequence again. On bedrock-mantle, GPT OSS fashions assist each client-side and server-side software calling, whereas NVIDIA Nemotron 3 Tremendous 120B helps client-side software calling solely. Each mannequin households additionally assist software calling on the bedrock-runtime endpoint by way of the Converse API (utilizing toolConfig). Refer to every mannequin’s mannequin card for the total function matrix.

Utilizing the bedrock-runtime endpoint (boto3)

For the bedrock-runtime endpoint, you want AWS credentials configured (AWS Identification and Entry Administration (IAM) consumer or function) with permission to invoke the mannequin. The next is a pattern coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Useful resource": "arn:aws-us-gov:bedrock:us-gov-west-1::foundation-model/openai.gpt-oss-120b-1:0"
        }
    ]
}

For manufacturing deployments, scope the Useful resource to the precise AWS GovCloud (US) Areas and mannequin IDs that you simply use.

The next instance sends a single-turn request utilizing the AWS SDK for Python (boto3) with the Converse API. On the bedrock-runtime endpoint, the GPT OSS mannequin IDs embrace a model suffix (for instance, openai.gpt-oss-120b-1:0). Use the precise mannequin ID from every mannequin’s mannequin card. The response incorporates a reasoning block adopted by a textual content block, so the instance selects the textual content block when printing the reply.

import boto3

shopper = boto3.shopper("bedrock-runtime", region_name="us-gov-west-1")

response = shopper.converse(
    modelId="openai.gpt-oss-120b-1:0",
    messages=[{
        "role": "user",
        "content": [{"text": "What is a mixture-of-experts architecture?"}]
    }],
    inferenceConfig={"maxTokens": 2048, "temperature": 1.0, "topP": 0.95},
)

content_blocks = response["output"]["message"]["content"]
response_text = subsequent(
    (block["text"] for block in content_blocks if "textual content" in block),
    None
)

if response_text:
    print(response_text)
else:
    print("No textual content response.")

To name NVIDIA Nemotron 3 Tremendous 120B by way of bedrock-runtime, use the mannequin ID nvidia.nemotron-super-3-120b (this mannequin ID doesn’t carry a model suffix).

You can even entry these fashions out of your terminal utilizing the AWS Command Line Interface (AWS CLI):

aws bedrock-runtime converse 
--model-id openai.gpt-oss-120b-1:0 
--messages '[{"role":"user","content":[{"text":"Type_Your_Prompt_Here"}]}]' 
--inference-config '{"maxTokens":512}' 
--region us-gov-west-1

Scaling on-demand inference

On-demand capability on the Customary tier is shared and allotted per AWS Area, so during times of excessive regional demand a request may be briefly queued or throttled. On the bedrock-mantle endpoint, there isn’t any requests-per-minute quota. Throughput is ruled by token-based limits. These open-weight fashions don’t presently have per-account token quotas revealed within the Service Quotas console, so use retry logic with exponential backoff to deal with transient throttling. Amazon Bedrock surfaces two HTTP error codes that point out when a request can’t be served:

Error code	Which means	Really useful motion
429	The request was denied as a result of it exceeded the account quotas for Amazon Bedrock.	Request a quota enhance by way of the Service Quotas console, and apply client-side throttling.
503	The service is experiencing excessive demand or momentary capability constraints.	Retry with exponential backoff and jitter. If throttling is sustained, scale back the request fee and ramp again up progressively.

For transient 503 responses, configure automated retries in your SDK:

import boto3
from botocore.config import Config

config = Config(retries={"total_max_attempts": 6, "mode": "commonplace"})
shopper = boto3.shopper("bedrock-runtime", config=config)

When ramping again up after sustained throttling, maintain at a gentle state for about quarter-hour between will increase somewhat than stepping straight to the goal quantity. For extra detailed ramp-up process and extra finest practices, see Scaling and throughput finest practices within the Amazon Bedrock Consumer Information.

Clear up

These fashions use on-demand inference, which incurs costs solely once you invoke a mannequin, so there’s no endpoint or infrastructure to tear down. To keep away from unintended costs after testing:

In the event you generate short-term Bedrock API keys, they expire robotically (most 12 hours). To revoke one sooner, delete it within the Amazon Bedrock console.

In the event you opted in to the Precedence tier for testing, return to Customary pricing for non-latency-sensitive visitors by eradicating the service_tier parameter out of your invocations.

In the event you saved a Bedrock API key in AWS Secrets and techniques Supervisor for testing, delete the key to keep away from storage costs.

For pricing particulars by mannequin and tier, consult with Amazon Bedrock pricing.

Pricing and availability

OpenAI GPT OSS and NVIDIA Nemotron fashions can be found as we speak on Amazon Bedrock in AWS GovCloud (US). In-Area inference is offered in AWS GovCloud (US-West) (us-gov-west-1), and Geo cross-Area inference routes requests throughout AWS GovCloud (US-West) and AWS GovCloud (US-East) (us-gov-east-1) whereas maintaining visitors throughout the AWS GovCloud (US) boundary.

Pricing is per token and varies by mannequin and repair tier. On-demand inference on the Customary tier incurs costs once you invoke a mannequin, with no capability to order and no infrastructure to tear down. For present charges, consult with Amazon Bedrock pricing.

Conclusion

OpenAI GPT OSS and NVIDIA Nemotron fashions at the moment are out there on Amazon Bedrock in AWS GovCloud (US), giving authorities prospects entry to superior open-weight fashions inside their compliance boundary. On this submit, we lined the out there fashions and their capabilities, the 2 endpoints for invoking them, the out there service tiers, and scaling steerage. Authorities groups can run these open-weight fashions for mission workloads whereas maintaining inference contained in the AWS GovCloud (US) boundary, on AWS-operated infrastructure.

To get began:

Open the Amazon Bedrock console in your AWS GovCloud (US) account and take a look at the fashions within the Playground.
Run the bedrock-mantle Python pattern from this submit in opposition to your individual information.
Consider gpt-oss-120b, gpt-oss-20b, and NVIDIA Nemotron 3 Tremendous 120B in your workloads to decide on the mannequin that matches your price and latency profile.
For manufacturing deployment, overview Scaling and throughput finest practices and think about the Precedence tier for latency-sensitive visitors.

Sources

For extra data, consult with the next assets: