Wednesday, June 24, 2026

Construct a protein analysis copilot with Amazon Bedrock AgentCore


Protein researchers face a time-consuming problem: manually looking out via hundreds of peptide sequences to seek out structurally related candidates is gradual, error-prone, and requires deep area experience to interpret outcomes. Constructing a protein analysis copilot can remodel how researchers seek for structurally related peptides throughout massive datasets — enabling pure language queries, automated embedding technology, and AI-powered consequence summarization in a single conversational interface.

This submit reveals you how you can construct a conversational protein analysis assistant that mixes three capabilities:

  1. Pure language question parsing to extract structured search parameters.
  2. Vector similarity search over protein embeddings utilizing a specialised language mannequin.
  3. AI-generated scientific summaries of search outcomes.

The system makes use of the Strands Brokers SDK to orchestrate three specialised instruments inside one agent, deploys to Amazon Bedrock AgentCore for manufacturing serving, and shops peptide embeddings in Amazon Aurora PostgreSQL-Appropriate Version with pgvector.

By the tip of this submit, you should have constructed an end-to-end agent software that demonstrates how you can:

  • Parse pure language person enter like “Discover 10 related peptides to the dengue virus peptide LPAIVREAI”, into structured instrument parameters utilizing the Strands Brokers SDK’s tool-use sample.
  • Deploy a customized ML mannequin (ESM-C 300M) as Amazon SageMaker AI serverless endpoint with bundled weights for quick chilly begins.
  • Mix vector similarity search (pgvector on Amazon Aurora PostgreSQL) with metadata filtering in a single question.
  • Orchestrate a number of specialised instruments — together with nested LLM brokers — inside a single Bedrock AgentCore runtime and generate scientific summaries of search outcomes.

Conditions

To observe together with this submit, you want:

  • An AWS account with entry to Amazon Bedrock basis fashions (Anthropic Claude Sonnet 4.6).
  • Python 3.12 or later.
  • The AWS Command Line Interface (AWS CLI) configured with applicable credentials.
  • IAM permissions for Amazon Bedrock, Amazon SageMaker AI, Amazon Aurora, Amazon Elastic Container Service (Amazon ECS), and AWS CodeBuild.
  • bedrock-agentcore-starter-toolkit put in (pip set up bedrock-agentcore-starter-toolkit).
  • The IEDB virus epitope dataset.
  • Estimated deployment time: 30–45 minutes; assessment the AWS pricing pages for Bedrock, SageMaker AI, Aurora Serverless v2, and AWS Fargate for value estimates.

Resolution overview

The copilot follows a tool-use sample the place a single Strands agent orchestrates three specialised instruments to deal with the entire analysis workflow. When a researcher submits a pure language question, the agent parses it into structured parameters, searches for related peptides utilizing protein embeddings, and summarizes the outcomes with scientific context.

The next diagram illustrates the structure:

This structure has 5 elements:

  1. A Streamlit frontend operating on AWS Fargate gives the conversational interface. It sends queries to the AgentCore runtime and shows ends in a structured format with downloadable tables.
  2. A Strands agent operating inside a single Amazon Bedrock AgentCore runtime orchestrates the workflow. The agent makes use of Anthropic Claude Sonnet 4.6 by way of the Bedrock Converse API and has entry to a few instruments outlined with the @instrument decorator.
  3. A parser instrument that makes use of a devoted Strands agent (LLM-as-parser sample) to extract structured search parameters — sequence, species filter, consequence restrict — from pure language queries.
  4. A searcher instrument that generates protein embeddings by way of Amazon SageMaker AI serverless endpoint operating ESM-C 300M, then performs cosine similarity search in opposition to Amazon Aurora PostgreSQL with pgvector.
  5. A summarizer instrument that makes use of one other devoted Strands agent to research search outcomes and produce concise scientific summaries with ideas for additional investigation.

This single-runtime, multi-tool design retains the deployment easy whereas sustaining clear separation of issues. Every instrument encapsulates a definite functionality, and the orchestrator agent decides when and how you can invoke them primarily based on the person’s question.

Protein embeddings with ESM-C 300M

The core of the similarity search is ESM-C 300M, a protein language mannequin from EvolutionaryScale (Constructed with ESM) that produces 960-dimensional embeddings capturing structural and practical properties of amino acid sequences. Two peptides with related organic operate produce embeddings which are shut in vector area, enabling similarity search with out requiring sequence alignment.

ESM-C 300M is deployed as an Amazon SageMaker AI serverless endpoint, which scales to zero when idle and incurs no value between invocations. The mannequin weights are bundled into the deployment artifact to keep away from downloading from HuggingFace at inference time — essential for serverless endpoints the place chilly begin latency issues.

The inference handler constructs the mannequin structure instantly and masses pre-packaged weights:

from esm.fashions.esmc import ESMC
from esm.tokenization import get_esmc_model_tokenizers

def model_fn(model_dir):
    weights_path = os.path.be a part of(model_dir, "weights", "esmc_300m.pt")
    mannequin = ESMC(
        d_model=960,
        n_heads=15,
        n_layers=30,
        tokenizer=get_esmc_model_tokenizers(),
        use_flash_attn=False,
    )
    state_dict = torch.load(weights_path, map_location="cpu")
    mannequin.load_state_dict(state_dict)
    mannequin.eval()
    return mannequin

The predict_fn handler takes a protein sequence, encodes it, and returns the mean-pooled embedding:

def predict_fn(input_data, mannequin):
    sequence = input_data["sequence"]
    protein = ESMProtein(sequence=sequence)
    protein_tensor = mannequin.encode(protein)
    logits_output = mannequin.logits(
        protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
    )
    embeddings = logits_output.embeddings
    mean_embeddings = embeddings[:, 1:-1, :].imply(dim=1)
    return mean_embeddings[0].detach().cpu().tolist()

The endpoint is deployed as a serverless configuration with 6144 MB reminiscence and a max concurrency of 5, utilizing the PyTorch 2.6.0 CPU inference container. The mannequin packaging script downloads weights as soon as by way of from_pretrained, saves the state dict, and bundles it with the inference code right into a mannequin.tar.gz with the required code/ listing construction for SageMaker AI.

Vector search with Aurora PostgreSQL and pgvector

Peptide embeddings are saved in Amazon Aurora PostgreSQL-Appropriate Version Serverless v2 with the pgvector extension. The database schema is simple:

CREATE TABLE peptides (
    id SERIAL PRIMARY KEY,
    sequence TEXT NOT NULL,
    embedding vector(960),
    properties JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX peptides_embedding_idx
ON peptides USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The properties JSONB column shops organic metadata — species, supply organism, supply molecule, epitope positions — enabling mixed vector and metadata filtering. For instance, a question like “Discover peptides just like LPAIVREAI from dengue virus” triggers each a cosine similarity search on the embedding column and a filter on properties->>'species'.

The information loading pipeline reads from the IEDB virus epitope dataset, generates embeddings for every peptide sequence by way of the SageMaker AI endpoint, and inserts them into the database utilizing the Amazon RDS Information API. The preliminary load samples 1,000 linear peptides:

def import_peptides(df):
    for i, row in tqdm(df.iterrows(), whole=len(df)):
        sequence = row["Epitope_Name"]
        embedding = get_embedding(sequence)  # SageMaker AI endpoint name
        properties = {
            "species": row["Epitope_Species"],
            "source_organism": row["Epitope_Source Organism"],
            "source_molecule": row["Epitope_Source Molecule"],
            # ... extra metadata
        }
        run_statement(
            "INSERT INTO peptides (sequence, embedding, properties) "
            "VALUES (:sequence, :embedding::vector, :properties::jsonb)",
            params=[...]
        )

Database entry goes via the Amazon Relational Database Service (Amazon RDS) Information API, which implies the agent runtime doesn’t want direct community connectivity to the database — it communicates over HTTPS, simplifying the networking necessities for AgentCore deployment.

Constructing the agent with Strands Brokers SDK

The Strands Brokers SDK gives a clear abstraction for constructing tool-using brokers. Every instrument is a Python operate embellished with @instrument, and the agent robotically generates instrument descriptions for the LLM from the operate’s docstring and kind hints.

Instrument definitions

The parser instrument delegates to a devoted Strands agent that acts as a structured output extractor:

from strands import Agent, instrument
from strands.fashions import BedrockModel

parser_agent = Agent(
    mannequin=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6",
                       region_name="us-east-1", streaming=False),
    system_prompt="""You're a peptide question parser. Extract structured search
    parameters from pure language queries. Return ONLY a legitimate JSON object."""
)

@instrument
def parse_peptide_query(question: str) -> str:
    """Parse a pure language peptide question into structured search parameters.

    Args:
        question: The person's pure language question about peptides.

    Returns:
        JSON string with extracted parameters like sequence, species, restrict.
    """
    consequence = parser_agent(f"Parse this question: {question}")
    parsed = json.masses(str(consequence))
    return json.dumps(parsed)

The searcher instrument combines SageMaker AI embedding technology with pgvector similarity search:

@instrument
def search_similar_peptides(sequence: str, species: str = "", restrict: int = 20) -> str:
    """Seek for peptides just like the given sequence utilizing ESM embeddings.

    Args:
        sequence: The peptide amino acid sequence (e.g., "LPAIVREAI").
        species: Non-obligatory species filter (e.g., "Dengue virus").
        restrict: Most variety of outcomes to return.

    Returns:
        JSON string with checklist of comparable peptides and their properties.
    """
    # Get embedding from SageMaker AI
    resp = sagemaker_client.invoke_endpoint(
        EndpointName=endpoint, ContentType="software/json",
        Physique=json.dumps({"sequence": sequence}))
    embedding = json.masses(resp["Body"].learn().decode())["embedding"]

    # Vector similarity search with non-obligatory metadata filter
    sql = "SELECT sequence, properties, "
    sql += "(embedding <=> :query_embedding::vector) AS cosine_distance "
    sql += "FROM peptides"
    if species:
        sql += " WHERE properties->>'species' = :species"
    sql += " ORDER BY cosine_distance LIMIT :restrict"

    outcomes = run_sql(sql, params)
    return json.dumps({"outcomes": peptides, "depend": len(peptides)})

The summarizer instrument makes use of one other devoted Strands agent for scientific evaluation:

summarizer_agent = Agent(
    mannequin=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6",
                       region_name="us-east-1", streaming=False),
    system_prompt="""You're a peptide analysis knowledgeable offering concise,
    high-level summaries. Analyze search outcomes and supply a short,
    insightful abstract specializing in key findings and concepts for additional
    investigation."""
)

@instrument
def summarize_results(original_query: str, search_results_json: str) -> str:
    """Summarize peptide search outcomes with scientific insights.

    Args:
        original_query: The unique person question.
        search_results_json: JSON string of search outcomes.

    Returns:
        A concise scientific abstract of the search outcomes.
    """
    outcomes = json.masses(search_results_json)
    abstract = summarizer_agent(f"Authentic question: {original_query}"
                               f"Outcomes: {outcomes}")
    return str(abstract)

Orchestrator agent

The orchestrator ties all the things collectively. It receives the person’s question and decides which instruments to name and in what order:

SYSTEM_PROMPT = """You're a peptide analysis assistant. You've got three instruments:
1. parse_peptide_query - Parse a pure language question into structured parameters
2. search_similar_peptides - Seek for related peptides utilizing ESM embeddings
3. summarize_results - Summarize search outcomes with scientific insights

For each person question, observe this workflow:
1. First, use parse_peptide_query to extract the sequence and parameters
2. Then, use search_similar_peptides with the extracted sequence
3. Lastly, use summarize_results to offer insights

At all times full the three steps."""

strands_agent = Agent(
    mannequin=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6",
                       region_name="us-east-1", streaming=False),
    instruments=[parse_peptide_query, search_similar_peptides, summarize_results],
    system_prompt=SYSTEM_PROMPT
)

This design makes use of the “agents-as-tools” sample: the parser and summarizer are themselves Strands brokers, however they’re wrapped in @instrument decorators and uncovered to the orchestrator as callable instruments. The orchestrator doesn’t know or care that these instruments internally use LLMs — it calls them as capabilities. This retains the orchestration logic clear whereas permitting every instrument to leverage LLM capabilities the place wanted.

Deploying to Amazon Bedrock AgentCore

Amazon Bedrock AgentCore gives a managed runtime for internet hosting AI brokers. The agent code runs in a containerized surroundings constructed and deployed by way of AWS CodeBuild — no native Docker set up is required.

Agent entrypoint

The AgentCore runtime expects an entrypoint operate that receives a payload and context:

from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload, context):
    question = payload.get("question") or payload.get("immediate")
    consequence = strands_agent(question)
    return {
        "standing": "success",
        "original_query": question,
        "parsed_query": _tool_outputs.get("parsed_query", {}),
        "search_results": _tool_outputs.get("search_results", []),
        "abstract": _tool_outputs.get("abstract", str(consequence)),
        "session_id": context.session_id
    }

if __name__ == '__main__':
    app.run()

The entrypoint captures instrument outputs in a shared dictionary in order that the response contains structured knowledge (parsed question, search outcomes desk, abstract textual content) as a substitute of the agent’s closing textual content output alone. This structured response is what the Streamlit frontend makes use of to render tables and expandable sections.

Infrastructure as code

The deployment makes use of AWS CloudFormation for all infrastructure. The VPC stack creates non-public subnets with NAT gateways and VPC endpoints for Amazon Bedrock, Amazon RDS Information API, and AWS Secrets and techniques Supervisor — serving to to make sure the agent runtime can attain all required companies with out traversing the general public web.

Amazon Aurora PostgreSQL-Appropriate Version Serverless v2 database shall be required with automated scaling from 0.5 to 4 ACUs (1–8 GB RAM). An AWS Lambda-backed customized useful resource initializes the pgvector extension and creates the peptides desk throughout stack creation:

DBCluster:
  Sort: AWS::RDS::DBCluster
  Properties:
    Engine: aurora-postgresql
    EnableHttpEndpoint: true  # Amazon RDS Information API
    ServerlessV2ScalingConfiguration:
      MinCapacity: 0.5
      MaxCapacity: 4

Deploy the answer

The answer requires the next elements, deployed so as:

Warning: Full the deployment steps so as. Skipping steps might end in deployment failures.

  1. VPC and networking — Personal subnets with NAT gateways and VPC endpoints for Amazon Bedrock, the Amazon RDS Information API, and AWS Secrets and techniques Supervisor, so the agent runtime can attain all required companies with out traversing the general public web.
  2. Aurora PostgreSQL database — An Amazon Aurora PostgreSQL-Appropriate Version Serverless v2 cluster with the pgvector extension enabled and the peptides desk initialized by way of a Lambda-backed AWS CloudFormation customized useful resource.
  3. SageMaker AI endpoint — A serverless endpoint operating ESM-C 300M with 6144 MB reminiscence and a max concurrency of 5, utilizing the PyTorch 2.6.0 CPU inference container.
  4. Peptide knowledge — The IEDB virus epitope dataset is loaded into the database by producing embeddings for every sequence by way of the SageMaker AI endpoint and inserting them utilizing the Amazon RDS Information API.
  5. AgentCore runtime and Streamlit UI — The Strands agent is deployed to an Amazon Bedrock AgentCore runtime by way of AWS CodeBuild (no native Docker required), and the Streamlit frontend is deployed to AWS Fargate.

Streamlit frontend

The frontend is a light-weight Streamlit software that communicates with the AgentCore runtime by way of the bedrock-agentcore boto3 shopper. It runs on AWS Fargate with a minimal container picture that features solely streamlit, pandas, and boto3 — no ML libraries.

shopper = boto3.shopper('bedrock-agentcore', region_name="us-east-1")

response = shopper.invoke_agent_runtime(
    agentRuntimeArn=HOST_RUNTIME_ARN,
    runtimeSessionId=session_id,
    payload=json.dumps({"immediate": question}).encode()
)

The UI shows ends in three sections: the parsed question parameters (expandable), a sortable desk of comparable peptides with cosine distances and metadata, and the AI-generated scientific abstract. Customers can obtain outcomes as CSV for additional evaluation.

The next screenshot reveals the search question and the outcomes.

Streamlit frontend showing a peptide similarity search query and results table with cosine distances and metadata

Issues

Earlier than deploying this resolution to manufacturing, maintain the next design and operational trade-offs in thoughts:

Chilly begin latency. The SageMaker AI serverless endpoint takes 2–3 minutes on the primary invocation after an idle interval whereas the container initializes and masses mannequin weights. Subsequent invocations inside the keep-alive window full in seconds. For latency-sensitive workloads, take into account a provisioned endpoint or setting the next provisioned concurrency on the serverless configuration.

Embedding mannequin selection. We use ESM-C 300M for its stability of embedding high quality and inference velocity on CPU. For increased accuracy on structural similarity duties, ESM-C 600M or ESM2 fashions supply bigger embedding dimensions at the price of elevated reminiscence and latency. The 960-dimensional embeddings from ESM-C 300M present sturdy efficiency for peptide similarity search in testing.

Scaling the dataset. The preliminary load makes use of 1,000 sampled peptides from the IEDB dataset. For manufacturing use with bigger datasets, take into account batch-loading embeddings, rising the IVFFlat index lists parameter proportionally, and scaling Aurora ACUs accordingly. The Amazon RDS Information API has a 1 MB response measurement restrict, so queries returning massive consequence units might have pagination.

Price. The serverless elements (SageMaker AI serverless endpoint, Aurora Serverless v2, AgentCore runtime) scale to near-zero when idle, making this structure cost-effective for analysis workloads with intermittent utilization patterns. The first ongoing prices throughout lively use are Bedrock LLM inference (three calls per question: parser, orchestrator, summarizer) and SageMaker AI endpoint invocations.

Cleansing up

To keep away from ongoing prices, delete the sources within the following order:

Warning: Delete sources in reverse order to keep away from dependency errors.

  1. Streamlit UI — Delete the AWS Fargate stack by way of the AWS CloudFormation console or AWS CLI.
  2. SageMaker AI endpoint — Delete the endpoint, endpoint configuration, and mannequin by way of the Amazon SageMaker AI console or AWS CLI.
  3. Database — Delete the IEDB dataset after which the Aurora PostgreSQL database stack, by way of the AWS CloudFormation console.
  4. VPC — Delete the VPC stack by way of the AWS CloudFormation console.
  5. AgentCore runtime — Delete the runtime by way of the Amazon Bedrock AgentCore console.

Conclusion

This submit confirmed you how you can construct a protein analysis copilot that mixes protein language mannequin embeddings with LLM-powered evaluation in a single conversational interface.

What historically requires a researcher to manually question sequence databases, run alignment instruments, and interpret outcomes throughout a number of purposes — a course of that may take hours per search — is decreased to a single pure language question that returns ranked, summarized ends in underneath a minute (or 2–3 minutes on chilly begin). This consolidation of parsing, embedding-based search, and scientific summarization into one conversational workflow can considerably speed up the early levels of peptide analysis and candidate screening.

The Strands Brokers SDK’s tool-use sample gives a clear option to compose specialised capabilities — parsing, looking out, summarizing — right into a coherent workflow, whereas Amazon Bedrock AgentCore handles the operational complexity of internet hosting and scaling the agent.

The identical structure generalizes past peptide analysis. Domains the place researchers want to go looking over specialised embeddings, filter by structured metadata, and synthesize outcomes — genomics, drug design, supplies science — can profit from this sample of mixing domain-specific embedding fashions with LLM orchestration. The important thing design choices that make this sensible are: bundling mannequin weights to keep away from cold-start downloads, utilizing the Amazon RDS Information API to simplify networking, and automating the deployment with infrastructure as code.

As subsequent steps, take into account exploring bigger ESM fashions for increased embedding accuracy, including help for batch queries, or extending the metadata schema to incorporate extra organic annotations from the IEDB dataset.

References

Vita R, Blazeska N, Marrama D; IEDB Curation Group Members; Duesing S, Bennett J, Greenbaum J, De Almeida Mendes M, Mahita J, Wheeler DK, Cantrell JR, Overton JA, Natale DA, Sette A, Peters B. The Immune Epitope Database (IEDB): 2024 replace. Nucleic Acids Res. 2025 Jan 6;53(D1):D436-D443. doi: 10.1093/nar/gkae1092. PMID: 39558162; PMCID: PMC11701597.

ESM Group. “ESM Cambrian: Revealing the mysteries of proteins with unsupervised studying.” EvolutionaryScale, 2024. https://evolutionaryscale.ai/weblog/esm-cambrian


Concerning the authors

Yuan Tian

Yuan Tian

Yuan is an Utilized Scientist on the AWS Generative AI Innovation Heart, the place he architects and implements generative AI options, from information retrieval to voice AI and agentic methods, for enterprise prospects spanning healthcare, life sciences, vitality, finance, and extra. He brings an interdisciplinary background combining AI/ML with computational biology, and holds a Ph.D. in Immunology from the College of Alabama at Birmingham.

Ganesh Kaliaperoumal

Ganesh Kaliaperoumal

Ganesh is a Senior Cloud Architect at AWS, the place he guides enterprise prospects via advanced cloud migrations and modernization initiatives. His experience spans containers, serverless architectures, and generative AI options. As an AWS Golden Jacket holder who has achieved all lively AWS certifications, Ganesh brings complete technical depth to assist organizations scale cloud-native purposes.

Subhasish Bhaumik

Subhasish Bhaumik

Subhasish is a Senior Information Architect, Information Lake at Amazon Internet Companies (AWS). He companions with enterprise prospects to design and implement high-performance, extremely accessible, cost-effective, resilient, and safe options spanning generative AI, knowledge mesh, knowledge lake, and analytics platforms on AWS. Subhasish allows prospects to unlock the total worth of their knowledge — empowering data-driven decision-making that delivers measurable enterprise outcomes — whereas guiding them via their digital and knowledge transformation journeys.

Muhammad Zahid Ali

Muhammad Zahid Ali

Muhammad is a Senior Supply Marketing consultant at AWS Skilled Companies. He helps enterprise-level prospects in healthcare and life sciences modernize advanced scientific knowledge platforms, construct scalable knowledge lakes, and implement real-time analytics options on AWS that speed up regulatory submissions and drive measurable enterprise outcomes. He makes a speciality of generative AI, machine studying, knowledge analytics, and options structure, guiding prospects via their digital and knowledge transformation journeys. In his spare time, he enjoys mentoring aspiring cloud engineers and exploring rising AI applied sciences.

Related Articles

Latest Articles