an area LLM. Good.
However after the primary few chats, you is perhaps questioning: what else can I do with it?
Properly, how about making the native LLM agentic with some instrument use?
On this put up, we’ll discover find out how to flip an area LLM right into a tool-using agent. Particularly, we’ll use
- Gemma 4 mannequin (edge-friendly variants) as our native LLM
- Ollama for serving the native LLM
- OpenAI Brokers SDK for the agent runtime
- Tavily internet search MCP as one instance of the exterior instrument
We’ll construct a mini deep analysis agent that may search the net, collect the proof, and synthesize a solution with citations, given a person query.
By the top of the put up, you’d have a working native deep analysis agent and a reusable implementation sample for turning an area mannequin into an area AI agent.
If you’re keen on an area coding-agent setup, I beforehand coated Gemma 4 + OpenCode. On this put up, we deal with the extra common sample of connecting an area mannequin to an agent runtime and exterior instruments.
1. Set Up the Native Agent Stack
We have to put together 4 items earlier than we write the code: Ollama, Gemma 4 (particularly the Gemma 4 E4B mannequin), OpenAI Brokers SDK, and Tavily MCP.
First, let’s set up Ollama.
On Home windows, you possibly can obtain the installer from the official Ollama web site:
https://ollama.com/obtain
Or use winget in PowerShell:
winget set up Ollama.Ollama
On Linux, Ollama will be put in with:
"curl -fsSL https://ollama.com/set up.sh | sh"
After set up, please verify:
ollama --version
On Home windows, bear in mind to launch Ollama from the Begin menu. As soon as it’s working, the native API endpoint is accessible.
Subsequent, we pull the native mannequin. Right here, we use Gemma 4 E4B variant:
ollama pull gemma4:e4b
Gemma 4 has a number of variants. The E4B mannequin is an efficient match for our goal, as it’s designed with edge/native agentic workflows in thoughts. My machine has an NVIDIA RTX 2000 Ada Laptop computer GPU with about 8 GB VRAM. In case your machine is extra constrained, you possibly can strive the lighter E2B variant:
ollama pull gemma4:e2b
Subsequent, we want the agent runtime library. For that, we use OpenAI Brokers SDK:
pip set up openai-agents
You’d additionally want the OpenAI-compatible shopper:
pip set up openai
One thing to notice right here: later, we’ll level the shopper to Ollama’s native endpoint, so this doesn’t imply we’re sending mannequin calls to OpenAI.
Lastly, we want a Tavily MCP endpoint. In case you haven’t used it earlier than, Tavily is a search API designed for LLM purposes. On this put up, we use its MCP server so the agent can search the net.
You’d have to first create a Tavily account and get an API key. On the Tavily platform, you possibly can instantly generate a MCP hyperlink with the next form:
https://mcp.tavily.com/mcp/?tavilyApiKey=
Now we’re prepared.
Utilizing Tavily right here just isn’t a sponsored selection; it’s used right here as one handy MCP instrument, the identical sample can work with different MCP-compatible instruments as properly.
Actually, the entire stack right here just isn’t the one choice. As an alternative of utilizing Ollama, you can serve the native mannequin with LM Studio or llama.cpp. As an alternative of Gemma 4 fashions, you can even strive with different fashions from, e.g., Qwen household. For agent framework, we even have choices from Google or Anthropic. You possibly can additionally join completely different MCP instruments as a substitute of Tavily. I take advantage of this mixture just because I’m conversant in that stack. However the necessary takeaway on this case research is the final native agentic sample.
2. Configure the Native Analysis Agent
With OpenAI Brokers SDK, that is the ultimate Agent object we have to compose:
from brokers import Agent
agent = Agent(
identify="Native Analysis Agent",
directions=RESEARCH_AGENT_INSTRUCTIONS,
mannequin=mannequin,
mcp_servers=[tavily_server],
mcp_config={"include_server_in_tool_names": True},
)
Let’s unpack every half.
2.1 The Mannequin
First, the mannequin.
from openai import AsyncOpenAI
from brokers import OpenAIChatCompletionsModel
MODEL_NAME = "gemma4:e4b"
OLLAMA_BASE_URL = "http://localhost:11434/v1"
shopper = AsyncOpenAI(
api_key="ollama",
base_url=OLLAMA_BASE_URL,
)
mannequin = OpenAIChatCompletionsModel(
mannequin=MODEL_NAME,
openai_client=shopper,
)
We begin by making a shopper that factors at Ollama’s native OpenAI-compatible endpoint.
Then, we use OpenAIChatCompletionsModel to wrap the Gemma mannequin right into a mannequin object. This permits the Brokers SDK to make use of that mannequin contained in the agent loop.
Be aware that the api_key="ollama" worth is only a placeholder. Ollama doesn’t actually need an actual OpenAI API key. We use it as a result of the shopper expects this discipline.
2.2 The Instruction
Subsequent, we outline the instruction for the agent with the specified analysis habits:
from datetime import datetime
CURRENT_DATE = datetime.now().strftime("%B %d, %Y")
# Be aware that this instruction is iterated with AI
RESEARCH_AGENT_INSTRUCTIONS = f"""
[Role]
You're a concise analysis assistant.
[Task]
Reply the person's query by turning it right into a small internet analysis activity.
Use the present date when decoding time-sensitive questions: {CURRENT_DATE}.
[Research behavior]
Begin with one focused search question.
For advice or comparability questions, full this analysis loop earlier than answering:
first establish the primary choices, then seek for comparability context, then synthesize a advice.
Use follow-up searches when the primary outcomes are inadequate, conflicting, or solely cowl a part of the query.
Choose related and credible sources, and monitor which supply helps every necessary declare.
Earlier than answering, verify whether or not the gathered proof is sufficient to assist the conclusion.
[Expected output]
Give a direct reply first, then briefly clarify the proof behind it.
Embody supply hyperlinks for key factual claims.
[Rules]
Don't depend on reminiscence for details that will have modified.
Don't invent lacking particulars.
Hold the reply concise.
""".strip()
2.3 The Instruments
Now we equip the agent with the net search instrument. On this case, we use the Tavily search engine by means of MCP:
from brokers import Agent, Runner
from brokers.mcp import MCPServerStreamableHttp
TAVILY_MCP_URL = "YOUR_TAVILY_MCP_URL"
async with MCPServerStreamableHttp(
identify="tavily",
params={"url": TAVILY_MCP_URL},
) as tavily_server:
instruments = await tavily_server.list_tools()
print("Accessible Tavily instruments:")
for instrument in instruments:
description = (instrument.description or "").exchange("n", " ")
print(f"- {instrument.identify}: {description[:120]}")
agent = Agent(
identify="Native Analysis Agent",
directions=RESEARCH_AGENT_INSTRUCTIONS,
mannequin=mannequin,
mcp_servers=[tavily_server],
mcp_config={"include_server_in_tool_names": True},
)
end result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS)
This code block does three issues:
- It opens a connection to Tavily’s MCP server with
async with MCPServerStreamableHttp(...) as tavily_server:As soon as linked, Tavily would expose its out there instruments to the Brokers SDK. - We create the Agent object contained in the MCP context. Be aware that now we have
mcp_servers=[tavily_server], which attaches Tavily’s MCP instruments to the agent. - We lastly run the agent with
end result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS). The context supervisor issues right here as a result of the MCP connection is just energetic contained in theasync withblock.
mcp_config={"include_server_in_tool_names": True}is especially for readability within the hint. With out it, the instrument identify will solely seem astavily_search. With it, the instrument identify will present asmcp_tavily__tavily_search. This makes it clearer that the instrument name got here by means of the Tavily MCP server.
3. Run a Analysis Query
Now that the agent is configured, let’s take a look at it with one concrete query:
“Which June 23, 2026 World Cup match had the largest group-stage stakes, and why?”
To examine what occurred, I print a compact hint:
def compact(worth: object, restrict: int = 220) -> str:
textual content = str(worth).exchange("n", " ")
return textual content if len(textual content) <= restrict else textual content[:limit] + "..."
for step, merchandise in enumerate(end result.new_items, begin=1):
raw_item = getattr(merchandise, "raw_item", None)
raw_type = getattr(raw_item, "kind", "")
raw_name = getattr(raw_item, "identify", "")
raw_output = getattr(raw_item, "output", "")
print(
f"{step:02d} | {kind(merchandise).__name__} | "
f"{raw_type or raw_name} | {compact(raw_output or raw_item)}"
)
In my run, the hint seemed like this:
01 | ToolCallItem | function_call | ResponseFunctionToolCall(arguments='{"question":"World Cup 2026 group stage matches June 23, 2026 stakes"}', identify='mcp_tavily__tavily_search', ...)
02 | ToolCallOutputItem | | {'call_id': ..., 'output': ...}
03 | MessageOutputItem | message | ResponseOutputMessage(... remaining reply ...)
This permits us to see the agentic habits instantly. On this run, the native Gemma mannequin determined to name the Tavily search instrument, the Brokers SDK executed that instrument name, and handed the outcomes again to the mannequin. Then, the mannequin produced the ultimate reply.
To see the ultimate response, we are able to print:
print(end result.final_output)
That is what the agent produced:
The match with the largest group-stage stakes on June 23, 2026, was Colombia vs. DR Congo.
Why:
In response to FIFA reporting, this particular match was highlighted as a essential second the place Colombia superior into the knockout part of the match.
The article notes that Daniel Munoz scored the primary objective for Colombia throughout this Group Okay fixture, which instantly contributed to their development within the competitors.
Proof
- FIFA: An article titled "Colombia v Congo DR Group Okay FIFA World Cup 2026" particularly studies on a key second from this match, stating that Munoz's objective helped hearth Colombia into the knockout part.
Supply: https://digitalhub.fifa.com/remodel/450614d3-72d7-4c1f-85ff-ea0fbee6f28d/Colombia-v-Congo-DR-Group-Okay-FIFA-World-Cup-2026?focuspoint=0.51
- Yahoo Sports activities: Confirms the fixture and end result for that date: Colombia defeated DR Congo.
Supply: https://sports activities.yahoo.com/soccer/article/2026-world-cup-results-standings-and-schedule-live-scores-group-stage-updates-and-how-to-watch-050724193.html
Discover that the agent solely made one search spherical on this run, because the search outcomes already contained sufficient proof for the mannequin to reply. For extra advanced questions, a number of rounds of search and reasoning can be needed, and our present framework naturally helps that.
4. Wrapping Up
An area LLM doesn’t have to remain as a chat mannequin.
On this put up, we deployed a Gemma 4 E4B mannequin domestically by means of Ollama, then we put the mannequin inside an agent runtime offered by OpenAI Brokers SDK, and we gave the agent an internet search instrument in order that it might probably discover info on-line to reply customers’ questions.
From right here, you possibly can simply prolong this sample with stronger analysis directions or construct a extra express planning-reflection workflow, if you wish to hold working within the course of deep analysis, or you possibly can join the agent to extra MCP instruments for a lot of different use instances.
Glad constructing!
Reference
Ollama: https://ollama.com/
Gemma mannequin household: https://ai.google.dev/gemma
OpenAI Brokers SDK: https://openai.github.io/openai-agents-python/
Brokers SDK MCP docs: https://openai.github.io/openai-agents-python/mcp/
Tavily MCP docs: https://docs.tavily.com/documentation/mcp
