# Introduction
In a current article on Machine Studying Mastery, we constructed a tool-calling agent that reached outward, that’s pulling climate, information, foreign money charges, and time from public APIs. That article lined the synthesis half of the sample properly, but it surely left the extra fascinating half on the desk: an agent that causes about its personal setting, inspects its personal machine, and offloads logic it does not belief itself to carry out. It could possibly be argued that that is nearer to really “agentic.”
This text picks up the place that one left off. We’ll give Gemma 4 two new instruments — a sandboxed native filesystem explorer and a restricted Python interpreter — and watch the mannequin resolve, by itself, when to go searching and when to compute.
Matters we’ll cowl embody:
- Why “agentic” instrument calling wants greater than internet APIs to be fascinating
- construct a filesystem inspection instrument with arduous path-traversal guards
- wire a Python interpreter instrument to the mannequin with out handing it the keys to your machine
- How the identical orchestration loop from earlier than generalizes to those new capabilities
I extremely suggest that you simply first learn this text earlier than persevering with on.
# From Dialog to Company
When the one instruments you give a language mannequin are read-only internet APIs, primarily you continue to actually have a chatbot, albeit one with potential entry to higher data. The mannequin receives a immediate, decides which API to ping, and stitches the JSON response right into a paragraph. There isn’t a actual notion of setting, no state to examine, no consequence to purpose about; it is a state of affairs extra akin to retrieval augmented technology than true company.
Company, within the sensible sense practitioners use the phrase, exhibits up when a mannequin begins interacting with the system it’s operating on. That may imply studying from a neighborhood filesystem, executing code, modifying information, calling different processes, or any mixture of these. The second a instrument can do one thing aside from return a clear string from a distant service, the mannequin has to begin asking about itself: what information exist, what does this quantity really equal, what’s on this folder earlier than I declare it incorporates something.
The Gemma 4 household, and particularly the gemma4:e2b edge variant we’ve got been utilizing, is sufficiently small to run regionally on a laptop computer whereas being competent sufficient at structured output to drive this sort of loop reliably. That mixture is what makes the local-agentic sample fascinating within the first place. The entire code for this tutorial could be discovered right here.
# The Architectural Reuse
The orchestration loop from the earlier tutorial doesn’t change. We outline Python capabilities, expose them by way of JSON schema, go the registry to Ollama alongside the person immediate, intercept any tool_calls block on the response, execute the requested perform regionally, append the end result as a instrument-role message, and re-query the mannequin so it will probably synthesize a remaining reply. The identical call_ollama helper, the identical TOOL_FUNCTIONS dictionary, the identical available_tools schema array from the earlier tutorial all make appearances.
What modifications is the character of the instruments themselves. The place the earlier batch have been all skinny shoppers over distant APIs, these we’ll construct now each run code on the machine. That shifts the design drawback from “how do I parse this response” to “how do I make certain the mannequin can not, even by chance, do one thing it shouldn’t be allowed to do.”
# Instrument 1: A Sandboxed Filesystem Explorer
The primary instrument, list_directory_contents, offers the mannequin the power to see what information exist in a given folder. This sounds trivial till you keep in mind that os.listdir accepts any string, together with /, ~, and ../../and so on. A naive implementation may fortunately stroll the mannequin’s “curiosity” straight to your API keys.
The design selection right here is to pin a secure base listing at script begin and reject any request that resolves exterior of it:
# Safety: confine list_directory_contents to this base listing and its descendants
# Set to the present working listing when the script begins
SAFE_BASE_DIR = os.path.abspath(os.getcwd())
def list_directory_contents(path: str = ".") -> str:
"""Lists information and directories inside a path, constrained to the secure base listing."""
attempt:
# Resolve to an absolute path and confirm it sits inside SAFE_BASE_DIR
# This blocks traversal makes an attempt like '../../and so on' or absolute paths like "https://www.kdnuggets.com/"
requested = os.path.abspath(os.path.be part of(SAFE_BASE_DIR, path))
if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
return (
f"Error: Entry denied. The trail '{path}' resolves exterior the "
f"permitted workspace ({SAFE_BASE_DIR})."
)
...
The sample is easy however value contemplating additional. We by no means belief the string the mannequin produced. We be part of it onto the bottom listing, resolve it completely (so .. will get normalized away), after which confirm the resolved path nonetheless begins with the bottom. Each /and so on/passwd and ../../someplace collapse into paths that fail that prefix verify and are rejected earlier than os.listdir is ever referred to as.
The remainder of the perform is housekeeping: verify the trail exists and is a listing, record its contents, and format every entry as both [DIR] or [FILE] with a byte dimension. The returned string is obvious English with construction the mannequin can parse on the second go:
entries = sorted(os.listdir(requested))
if not entries:
return f"The listing '{path}' is empty."
traces = [f"Contents of '{path}' ({len(entries)} item(s)):"]
for title in entries:
full = os.path.be part of(requested, title)
if os.path.isdir(full):
traces.append(f" [DIR] {title}/")
else:
attempt:
dimension = os.path.getsize(full)
traces.append(f" [FILE] {title} ({dimension} bytes)")
besides OSError:
traces.append(f" [FILE] {title}")
return "n".be part of(traces)
The JSON schema we hand to the mannequin is intentionally permissive on the parameter facet — path is elective, defaulting to the workspace root, as a result of most helpful first questions are concerning the present folder:
{
"sort": "perform",
"perform": {
"title": "list_directory_contents",
"description": (
"Lists information and subdirectories inside a path throughout the person's workspace. "
"Use this to examine the setting earlier than answering questions on native information."
),
"parameters": {
"sort": "object",
"properties": {
"path": {
"sort": "string",
"description": (
"A relative path contained in the workspace, e.g. '.', 'knowledge', or 'src/utils'. "
"Defaults to the workspace root."
)
}
},
"required": []
}
}
}
Observe the outline does a small quantity of immediate engineering: “Use this to examine the setting earlier than answering questions on native information.” That sentence pushes Gemma 4 towards calling the instrument when the person asks a obscure query about “my information” somewhat than guessing at what could be there.
# Instrument 2: A Restricted Python Interpreter
The second instrument, execute_python_code, is the extra harmful and the extra pedagogically fascinating of the 2. The premise is that language fashions, particularly small ones, are unreliable at exact arithmetic, precise string manipulation, and something involving greater than a few steps of branching logic. A instrument that lets the mannequin write and run a deterministic snippet is a a lot better reply to these issues than asking it to purpose by them in pure language.
The implementation makes use of exec() with a intentionally stripped-down builtins namespace:
def execute_python_code(code: str) -> str:
"""Executes a snippet of Python code and returns no matter was printed to stdout.
It is a learning-only sandbox. exec() is essentially unsafe; don't expose this instrument
to untrusted customers or networks. The restrictions under cease the informal circumstances, not a
decided attacker.
"""
attempt:
# A minimal restricted setting. We strip __builtins__ right down to a small
# whitelist in order that, e.g., open(), eval(), and __import__ are usually not immediately
# out there from the snippet's international scope.
safe_builtins = {
"abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
"divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
"int": int, "len": len, "record": record, "map": map, "max": max, "min": min,
"pow": pow, "print": print, "vary": vary, "repr": repr, "reversed": reversed,
"spherical": spherical, "set": set, "sorted": sorted, "str": str, "sum": sum,
"tuple": tuple, "zip": zip,
}
# Pre-import a few secure, helpful modules so the mannequin does not need to.
import math, statistics
restricted_globals = {
"__builtins__": safe_builtins,
"math": math,
"statistics": statistics,
}
A number of selections value calling out. We exchange __builtins__ totally somewhat than blacklisting particular person capabilities, which suggests open, eval, exec, compile, __import__, enter, and anything not in our whitelist merely doesn’t exist contained in the snippet. We pre-import math and statistics into the snippet’s globals as a result of the mannequin will attain for them continuously and we’d somewhat not pressure it to struggle __import__ restrictions. We seize stdout with contextlib.redirect_stdout so the mannequin will get again precisely what its snippet printed:
# Seize stdout so we are able to hand the printed output again to the mannequin
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
exec(code, restricted_globals, {})
output = buffer.getvalue().strip()
if not output:
return "Code executed efficiently however produced no output. Use print() to return a price."
return f"Output:n{output}"
The empty-output department issues greater than it seems. Small fashions will routinely write expressions like x = sum(vary(101)) and overlook the print(x). Returning a selected error telling them to make use of print() offers the orchestration loop the choice to retry; with out it, the mannequin would synthesize a remaining reply based mostly on an empty string and confidently invent a price.
A remaining phrase on security, for the reason that script’s docstring is blunt about it: this can be a studying sandbox, not a hardened one. A decided adversary can escape of a Python exec sandbox in a dozen methods, most of them involving object introspection by ().__class__.__mro__. For a single-user agent operating by yourself laptop computer by yourself prompts, the whitelist is loads. For anything, you’d need an actual isolation layer — a subprocess with seccomp, a container, or RestrictedPython.
# The Orchestration Loop
The principle loop is unchanged in construction from the earlier tutorial. The mannequin is queried with the person immediate and the instrument registry, and if it responds with tool_calls, every name is dispatched towards TOOL_FUNCTIONS:
if "tool_calls" in message and message["tool_calls"]:
print("[TOOL EXECUTION]")
messages.append(message)
num_tools = len(message["tool_calls"])
for i, tool_call in enumerate(message["tool_calls"]):
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
...
if function_name in TOOL_FUNCTIONS:
func = TOOL_FUNCTIONS[function_name]
attempt:
end result = func(**arguments)
...
messages.append({
"position": "instrument",
"content material": str(end result),
"title": function_name
})
The CLI formatting is value a small tweak for this script. The execute_python_code instrument’s code argument is usually a multi-line string with newlines in it, which is able to wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the show solely; the mannequin nonetheless receives the complete string when the perform runs:
def _short(v):
if isinstance(v, str):
flat = v.exchange("n", "n")
if len(flat) > 60:
flat = flat[:57] + "..."
return f"'{flat}'"
return str(v)
args_str = ", ".be part of(f"{okay}={_short(v)}" for okay, v in arguments.objects())
As soon as every instrument result’s appended again into the message historical past as a "position": "instrument" entry, we re-call Ollama with the enriched payload and the mannequin produces its grounded remaining reply. Identical two-pass sample, identical logic.
# Testing the Instruments
And now we check our instrument calling. Pull gemma4:e2b with ollama pull gemma4:e2b if in case you have not already, then run the script from a folder you don’t thoughts the mannequin peeking at.
Let’s begin with the filesystem instrument. From the venture listing:
What scripts are in my present folder, and which one seems prefer it ought to be used to course of CSVs?
End result:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What scripts are in my present folder, and which one seems prefer it ought to be used to course of CSVs?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: list_directory_contents
├─ Args: path="."
└─ End result: Contents of '.' (5 merchandise(s)):
[FILE] README.md (412 bytes)
[FILE] csv_cleaner.py (1834 bytes)
[FILE] most important.py (10786 bytes)
[FILE] notes.txt (88 bytes)
[FILE] sales_report.py (2210 bytes)
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
Your present folder incorporates 5 information. The one that appears supposed for CSV
processing is csv_cleaner.py — its title strongly suggests it handles CSV enter.
sales_report.py may contact CSV knowledge, however its title is extra about output than
ingestion.
The mannequin referred to as the instrument, appeared on the precise filenames, and made an affordable inference grounded within the itemizing somewhat than in its weights. That’s the distinction between hallucination and remark.
Subsequent, the Python interpreter. A small activity that small fashions reliably get flawed if requested to do it of their head:
What’s the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
End result:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What's the customary deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: execute_python_code
├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
└─ End result: Output:
11.4659
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The usual deviation of these numbers, rounded to 4 decimal locations, is 11.4659.
The mannequin offloaded the calculation totally; it wrote a snippet, referred to as statistics.stdev, rounded the end result, and reported what the interpreter mentioned. No psychological arithmetic, no approximation, no fabricated important digits.
Lastly, the extra fascinating case: a immediate that requires each instruments in sequence. The mannequin has to examine the folder and compute one thing about what it finds:
Take a look at the information within the present folder and inform me the full dimension in kilobytes, rounded to 2 decimal locations.
Output:
[SYSTEM]
○ Instrument: execute_python_code......................[LOADED]
○ Instrument: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
Take a look at the information within the present folder and inform me the full dimension in kilobytes, rounded to 2 decimal locations.
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
┌── Calling: list_directory_contents
│ ├─ Args: path="."
│ └─ End result: Contents of '.' (5 merchandise(s)):
│ [FILE] README.md (412 bytes)
│ [FILE] csv_cleaner.py (1834 bytes)
│ [FILE] most important.py (10786 bytes)
│ [FILE] notes.txt (88 bytes)
│ [FILE] sales_report.py (2210 bytes)
│
└── Calling: execute_python_code
├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(spherical(sum(siz..."
└─ End result: Output:
15.33
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The 5 information within the present folder whole 15.33 KB.
Two instruments, in the precise order, with the output of 1 feeding the argument of the opposite — produced by a 2-billion-parameter mannequin operating on a laptop computer with no GPU. The filesystem instrument grounds the mannequin in what is definitely there; the interpreter instrument grounds the reply in what is definitely true. The mannequin contributes the half it’s genuinely good at, which is deciding which query to ask of which instrument.
It’s value poking on the security guards too, simply to substantiate they maintain. Asking the mannequin “record the contents of /and so on” produces the anticipated denial message within the instrument end result, which the mannequin then experiences again gracefully somewhat than fabricating a listing itemizing. Asking it to run open('/and so on/passwd').learn() contained in the interpreter produces a NameError, since open just isn’t within the whitelisted builtins. Each failures degrade into helpful error strings as an alternative of silent compromises, which is strictly what you need at this layer.
# Conclusion
The sooner tutorial confirmed that Gemma 4 can attain throughout the web in your behalf. This one exhibits it will probably attain into the machine you’re sitting at, fastidiously, when you’ve gotten constructed the carefulness in. Upon getting a working tool-calling loop, the fascinating query stops being “can the mannequin name a perform” and begins being “what ought to I let it contact.”
A filesystem-aware instrument and a code-execution instrument collectively get you many of the method to one thing that genuinely earns the time period agent: it will probably observe its setting, resolve what calculation issues, and run that calculation deterministically somewhat than guessing. The sample generalizes from there. Database queries, shell instructions, git operations, doc parsing; every one in every of these is identical JSON schema, the identical dispatch desk, the identical two-pass synthesis, with no matter security perimeter is acceptable for the blast radius of the underlying name.
Construct the perimeter first. Then hand the mannequin the keys to no matter sits inside it.
Matthew Mayo (@mattmayo13) holds a grasp’s diploma in pc science and a graduate diploma in knowledge mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Studying Mastery, Matthew goals to make advanced knowledge science ideas accessible. His skilled pursuits embody pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize information within the knowledge science group. Matthew has been coding since he was 6 years outdated.
