# Introducing MCP
Each developer constructing with native AI hits the identical wall ultimately. The mannequin works. It causes effectively, writes strong code, and solutions complicated questions. But it surely can not do the whole lot. It can not question your database, open a GitHub problem, or name your inner API. You might be left writing customized Python wrappers for each device you want, hardcoding the glue between mannequin output and gear execution, and sustaining these wrappers each time an API modifications.
The Mannequin Context Protocol (MCP) was designed to resolve precisely this. It’s an open commonplace by Anthropic: a common, pluggable protocol for AI device connectivity. Outline a device as soon as as an MCP server. Any MCP-compatible consumer, any mannequin, any framework, can uncover and name it with zero customized integration code per mannequin.
Qwen3.6-35B-A3B is probably the most succesful native mannequin for this sort of work proper now. It has a 262,144-token context window, a Combination of Specialists (MoE) structure that prompts solely 3B of its 35B parameters per ahead move (which is why it matches on {hardware} that shouldn’t be in a position to run a 35B mannequin), and was explicitly educated and evaluated on MCP-based agentic duties.
This text builds a neighborhood GitHub developer assistant: an agent that reads a repository’s open points, searches the related code, drafts a repair, and creates a pull request. The entire thing runs in your {hardware}, by means of MCP servers, with no cloud dependency.
# Understanding Qwen3.6-35B-A3B
Understanding the structure issues right here as a result of it immediately explains what {hardware} you want and why the mannequin performs the best way it does on agentic duties.
The identify encodes the important thing reality: 35B whole parameters, A3B which means 3B activated per ahead move. It’s an MoE mannequin with 256 consultants per layer, routing 8 plus 1 shared consultants per token. You get the information capability of a 35B mannequin on the inference compute value of a 3B mannequin. That trade-off is why it matches on {hardware} that may collapse beneath a dense 35B.
The hidden structure is the place Qwen3.6 diverges most from different MoE fashions. Every block within the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Consideration layers. DeltaNet is a linear consideration mechanism; it processes sequences extra effectively than full quadratic consideration, particularly at lengthy context lengths. The interleaved full Gated Consideration layers present the deep relational reasoning that linear consideration alone misses. For an agent working by means of a 500-file repository, that mixture issues: environment friendly processing at size mixed with exact reasoning on the related sections.
The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context size just isn’t a consolation function; it’s an operational constraint. An agent studying supply information, sustaining device name historical past, monitoring a multi-step plan, and injecting device outcomes again into context wants actual headroom. Most 7B and 13B fashions cap at 8k or 32k tokens. Working out of context mid-task means the agent loses its personal historical past and begins hallucinating device outcomes.
Qwen3.6 was explicitly educated and evaluated on MCP-based agentic benchmarks. Two headline options got here out of that coaching:
- Agentic Coding. Frontend workflows and repository-level reasoning — the mannequin handles multi-file refactoring duties with coherent reasoning throughout information, not simply single-file edits in isolation.
- Pondering Preservation. A
preserve_thinkingflag that retains reasoning traces from prior turns in a multi-turn dialog. When an agent causes by means of a plan in flip one after which executes device calls in turns two by means of 5,preserve_thinking=Trueretains the turn-one reasoning obtainable within the KV cache. Every subsequent flip advantages from that prior reasoning with out paying the price of re-deriving it.
# System Necessities
There are three life like deployment paths, and which one you employ relies upon fully in your {hardware}.
- GPU inference (really useful for manufacturing agent workloads). Qwen3.6-35B-A3B in bfloat16 requires roughly 70 GB VRAM. In This fall quantization, it matches in roughly 20–24 GB. A single RTX 4090 (24 GB) handles This fall. Two RTX 3090s with tensor parallelism deal with This fall as effectively. An A100 80 GB handles the complete bfloat16 mannequin.
- CPU/Hybrid through KTransformers. KTransformers is the accessible path for builders with no 24 GB GPU. It offloads compute-heavy layers to GPU when obtainable and runs the remainder on CPU. With 64 GB system RAM, you possibly can run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency will probably be 30–120 seconds per flip relying in your CPU, which is workable for an agent doing background repository evaluation however not for interactive coding periods.
- Smaller fashions for tutorial testing. The complete MCP integration sample on this article is equivalent no matter mannequin dimension. If you wish to comply with alongside with out the {hardware} for the complete 35B mannequin, use
Qwen/Qwen2.5-7B-Instructthrough Ollama (ollama pull qwen2.5:7b) or the Qwen3-8B mannequin. The serving API is identical, the code is equivalent, and you may swap within the 35B mannequin when {hardware} permits.
Software program necessities:
# Python 3.11+ required
python --version
python -m venv qwen-mcp-env
supply qwen-mcp-env/bin/activate # macOS / Linux
qwen-mcp-envScriptsactivate # Home windows
# Core packages
pip set up
"openai>=1.30.0"
"qwen-agent>=0.0.10"
"mcp>=1.0.0"
"httpx>=0.27.0"
# Serving framework -- select one
pip set up "vllm>=0.19.0" # NVIDIA GPU
pip set up "sglang>=0.5.10" # NVIDIA GPU (quicker prefill for lengthy context)
pip set up "ktransformers" # CPU/hybrid
# Node.js 18+ is required for pre-built MCP servers put in through npx
node --version
# Serving Qwen3.6 Domestically with an OpenAI-Appropriate API
Earlier than wiring in any MCP servers, you want a operating inference server. Each SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the identical API floor, simply pointed at localhost as an alternative of api.openai.com.
// SGLang (Advisable for Lengthy-Context Agent Workloads)
# Set up SGLang with full dependencies
pip set up "sglang[all]>=0.5.10"
# Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled.
# --reasoning-parser qwen3 accurately handles the ... blocks.
# --tool-call-parser qwen3_coder routes device name outputs to the fitting format.
# --enable-prefix-caching is important for agent workloads -- allows KV cache reuse
# throughout turns, which is what makes preserve_thinking environment friendly in follow.
python -m sglang.launch_server
--model-path Qwen/Qwen3.6-35B-A3B
--host 0.0.0.0
--port 30000
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-prefix-caching
--tp 2 # tensor parallel throughout 2 GPUs; take away if utilizing single GPU
// vLLM
pip set up "vllm>=0.19.0"
# vLLM equal with the identical important flags
vllm serve Qwen/Qwen3.6-35B-A3B
--host 0.0.0.0
--port 8000
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-prefix-caching-v2
--tensor-parallel-size 2
// Smaller Mannequin through Ollama
ollama pull qwen2.5:7b
ollama serve
# Ollama's API is OpenAI-compatible at http://localhost:11434/v1
As soon as the server is operating, confirm it earlier than going any additional:
# Well being test -- ought to return {"standing": "okay"} or related
curl http://localhost:30000/well being
# Take a look at the chat completions endpoint with a easy question
curl http://localhost:30000/v1/chat/completions
-H "Content material-Kind: software/json"
-d '{
"mannequin": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Reply with: ready"}],
"max_tokens": 10
}'
If you happen to get a JSON response with a decisions array, the server is prepared. Don’t proceed to MCP setup till this works. Each integration failure you’ll encounter later is simpler to debug when you realize the serving layer is strong.
# Understanding MCP and Why It Modifications the Agent Structure
Earlier than writing any agent code, it helps to know what MCP really does on the protocol stage, as a result of that understanding prevents a class of bugs that come from treating MCP as only a fancier function-calling API.
MCP is a JSON-RPC 2.0 protocol operating over stdio or HTTP transport. When an MCP consumer connects to a server, the very first thing it does is name instruments/record to find what instruments the server exposes. Every device comes again with a reputation, an outline, and an enter schema outlined in JSON Schema. The mannequin reads this schema. It’s the mannequin’s contract with the device.
When the mannequin desires to name a device, it emits a structured device name object. The MCP consumer — not the mannequin — really executes the decision by sending a instruments/name request to the server. The server handles execution and returns a end result. The consumer injects that end result again into the dialog as a device function message. The mannequin reads the end result and decides the following step.
This separation is vital. The mannequin decides what to name and with what arguments. The consumer handles execution. The server handles the precise work. Your code by no means hardwires a device to a mannequin; you simply inform the consumer which servers can be found.
There are two methods to make use of MCP with Qwen3.6:
- Through Qwen-Agent: the official
qwen_agentlibrary handles device discovery, name parsing, end result injection, and multi-turn dialog administration routinely. Much less code, much less management. Proper for many use instances. - Through the MCP Python SDK immediately: you deal with the agentic loop your self utilizing
mcp.ClientSession. Extra code, full visibility into each message, full management over error dealing with and retry logic. Proper for manufacturing programs the place you must monitor each step.
This text covers each, beginning with Qwen-Agent.
# Constructing the Native GitHub Developer Assistant
The agent does 4 issues in sequence: reads open points from a GitHub repository, finds the related code, drafts a repair, and opens a pull request. All regionally, all by means of MCP.
// Half 1: Setting and MCP Server Setup
# Set your GitHub private entry token
# Required by the GitHub MCP server for API calls
export GITHUB_TOKEN=ghp_your_token_here
# Pre-built MCP servers set up through npx -- no separate set up step
# npx handles this on first use when the agent begins the servers
# Confirm npx is out there:
npx --version
Create a challenge listing:
mkdir qwen-github-agent
cd qwen-github-agent
// Half 2: Qwen-Agent Implementation
The quickest path to a working agent. Qwen-Agent handles the complete loop routinely.
# github_agent_qwenagent.py
# Stipulations: pip set up qwen-agent openai
# npm / npx have to be put in for the MCP servers
# GITHUB_TOKEN env var have to be set
# Native serving endpoint have to be operating (see earlier part)
#
# How one can run:
# python github_agent_qwenagent.py
from qwen_agent.brokers import Assistant
# ── Server configuration ──────────────────────────────────────────────────────
# Level at your native serving endpoint.
# Change the base_url to match whichever server you began:
# SGLang: http://localhost:30000/v1
# vLLM: http://localhost:8000/v1
# Ollama: http://localhost:11434/v1
LLM_CONFIG = {
"mannequin": "Qwen/Qwen3.6-35B-A3B",
"model_server": "http://localhost:30000/v1",
"api_key": "EMPTY", # Native servers don't require an actual key
# Pondering mode sampling params (from the official mannequin card greatest practices)
"generate_cfg": {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"thought_in_history": True, # That is the preserve_thinking flag in Qwen-Agent
},
}
# ── MCP server configuration ──────────────────────────────────────────────────
# Every server key names the server; the worth is the stdio launch command.
# Qwen-Agent begins every server as a subprocess and manages the MCP periods.
MCP_SERVERS = {
"mcpServers": {
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
# Grant the agent access to the current working directory
# In production, restrict to the specific repository path
"."
]
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
# The GitHub MCP server reads this env var for API authentication
"GITHUB_TOKEN": "${GITHUB_TOKEN}"
}
},
}
}
# ── System immediate ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """You're a senior software program engineer with full entry to a GitHub repository
through MCP instruments.
When given a repository and activity:
1. Record open points to know what wants fixing
2. Use filesystem instruments to learn related supply information and assessments
3. Establish the foundation trigger primarily based on the code and the difficulty description
4. Write a focused repair -- minimal modifications, no refactoring unrelated to the bug
5. Create a pull request with a transparent title and outline referencing the difficulty
All the time clarify your reasoning at every step. Assume by means of edge instances earlier than writing code.
If you're unsure a few file's objective, learn it earlier than modifying it."""
# ── Agent setup ───────────────────────────────────────────────────────────────
agent = Assistant(
llm=LLM_CONFIG,
identify="GitHub Developer Assistant",
description="Reads points, fixes bugs, opens pull requests -- regionally through MCP.",
system_message=SYSTEM_PROMPT,
mcp_servers=MCP_SERVERS,
)
# ── Run the agent ─────────────────────────────────────────────────────────────
def run_agent(activity: str):
"""
Run the agent on a activity description and stream the output.
The agent will make device calls routinely; Qwen-Agent handles
the complete loop together with device execution and end result injection.
"""
messages = [{"role": "user", "content": task}]
print(f"Process: {activity}n{'─' * 70}")
# Qwen-Agent's run() is a generator that yields intermediate steps
# Every yielded message exhibits a device name, a device end result, or the ultimate reply
for response in agent.run(messages=messages):
# response is an inventory of messages representing the dialog thus far
# The final message accommodates the latest output
final = response[-1]
function = final.get("function", "")
content material = final.get("content material", "")
if function == "assistant" and content material:
# Strip and show the considering block individually for readability
import re
considering = re.search(r"(.*?) ", content material, re.DOTALL)
if considering:
print(f"[thinking] {considering.group(1).strip()[:200]}...")
clear = re.sub(r".*? ", "", content material, flags=re.DOTALL).strip()
if clear:
print(f"[agent] {clear}")
elif function == "device":
tool_name = final.get("identify", "unknown_tool")
print(f"[tool:{tool_name}] end result obtained")
if __name__ == "__main__":
run_agent(
"Within the repository myorg/my-api-project, discover the open problem about "
"the login endpoint returning 200 for invalid tokens. Learn the related "
"code and assessments, repair the bug, and open a pull request."
)
How one can run:
python github_agent_qwenagent.py
// Half 3: Uncooked MCP SDK Implementation
For groups who want full management over each protocol message, customized error dealing with, per-tool retry logic, and audit logging of each device name and end result:
# github_agent_raw.py
# Stipulations: pip set up mcp openai httpx
# GITHUB_TOKEN env var have to be set, native server have to be operating
#
# How one can run:
# python github_agent_raw.py
import asyncio
import json
import os
import re
from openai import AsyncOpenAI
from mcp import ClientSession, StdioServerParameters
from mcp.consumer.stdio import stdio_client
# ── Native serving consumer ───────────────────────────────────────────────────────
consumer = AsyncOpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
MODEL = "Qwen/Qwen3.6-35B-A3B"
# ── Response processing ───────────────────────────────────────────────────────
def strip_thinking(textual content: str) -> str:
"""Take away ... blocks. Used after we solely want the motion."""
return re.sub(r".*? ", "", textual content, flags=re.DOTALL).strip()
def extract_thinking(textual content: str) -> str:
"""Extract the content material of the considering block for logging."""
m = re.search(r"(.*?) ", textual content, re.DOTALL)
return m.group(1).strip() if m else ""
def process_response(response, preserve_thinking: bool = True) -> dict:
"""
Course of a chat completion response from Qwen3.6.
Handles two output codecs:
1. Device name through the API's function_call / tool_calls area (when --tool-call-parser is energetic)
2. Device name embedded within the message content material as JSON
Args:
response: The OpenAI-compatible completion response
preserve_thinking: If True, preserve considering content material in output for
the following flip's KV cache profit
Returns:
dict with considering, tool_calls, final_answer, has_tool_calls, is_terminal
"""
selection = response.decisions[0]
message = selection.message
# Path 1: Device calls within the structured area (most well-liked -- requires tool-call-parser flag)
if message.tool_calls:
tool_calls = [
{
"name": tc.function.name,
"arguments": json.loads(tc.function.arguments),
"call_id": tc.id,
}
for tc in message.tool_calls
]
considering = extract_thinking(message.content material or "")
return {
"considering": considering if preserve_thinking else "",
"tool_calls": tool_calls,
"final_answer": "",
"has_tool_calls": True,
"is_terminal": False,
}
# Path 2: Device calls embedded in content material textual content (fallback)
content material = message.content material or ""
tag_matches = re.findall(r"(.*?) ", content material, re.DOTALL)
tool_calls = []
for m in tag_matches:
strive:
tool_calls.append(json.hundreds(m.strip()))
besides json.JSONDecodeError:
move
considering = extract_thinking(content material)
final_answer = re.sub(r".*? ", "", content material, flags=re.DOTALL)
final_answer = re.sub(r".*? ", "", final_answer, flags=re.DOTALL).strip()
return {
"considering": considering if preserve_thinking else "",
"tool_calls": tool_calls,
"final_answer": final_answer,
"has_tool_calls": len(tool_calls) > 0,
"is_terminal": len(tool_calls) == 0 and bool(final_answer),
}
# ── Core agent loop ───────────────────────────────────────────────────────────
async def run_github_agent(activity: str, repo: str, max_turns: int = 20):
"""
Run the GitHub developer assistant agent.
Connects to filesystem and GitHub MCP servers, discovers their instruments,
and runs the Qwen3.6 agent loop till the duty is full or max_turns reached.
"""
# Begin each MCP servers and set up periods
fs_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "."],
)
gh_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-github"],
env={**os.environ, "GITHUB_TOKEN": os.environ.get("GITHUB_TOKEN", "")},
)
async with stdio_client(fs_params) as (fs_read, fs_write),
ClientSession(fs_read, fs_write) as fs_session,
stdio_client(gh_params) as (gh_read, gh_write),
ClientSession(gh_read, gh_write) as gh_session:
# Initialize each periods
await fs_session.initialize()
await gh_session.initialize()
# Uncover all obtainable instruments from each servers
fs_tools_result = await fs_session.list_tools()
gh_tools_result = await gh_session.list_tools()
# Construct the OpenAI-format device record for the mannequin
all_tools = []
tool_to_session = {} # Maps device identify to the MCP session that owns it
for device in fs_tools_result.instruments:
all_tools.append({
"kind": "operate",
"operate": {
"identify": device.identify,
"description": device.description,
"parameters": device.inputSchema,
}
})
tool_to_session[tool.name] = fs_session
for device in gh_tools_result.instruments:
all_tools.append({
"kind": "operate",
"operate": {
"identify": device.identify,
"description": device.description,
"parameters": device.inputSchema,
}
})
tool_to_session[tool.name] = gh_session
print(f"Instruments obtainable: {len(all_tools)} ({len(fs_tools_result.instruments)} filesystem, "
f"{len(gh_tools_result.instruments)} GitHub)")
# Construct dialog historical past
system_prompt = f"""You're a senior software program engineer with entry to the repository {repo}.
Use the obtainable instruments to research points, learn code, write fixes, and create pull requests.
Assume step-by-step. Learn earlier than you modify. Minimal modifications solely."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": task},
]
# ── Agent loop ─────────────────────────────────────────────────────────
for flip in vary(max_turns):
print(f"n[Turn {turn + 1}]")
# Name the mannequin
response = await consumer.chat.completions.create(
mannequin=MODEL,
messages=messages,
instruments=all_tools if all_tools else None,
tool_choice="auto",
# Pondering mode sampling params from the official greatest practices
temperature=0.6,
top_p=0.95,
top_k=20,
min_p=0.0,
max_tokens=4096,
extra_body={
# preserve_thinking retains reasoning context throughout turns
# for KV cache effectivity on lengthy agent periods
"preserve_thinking": True,
}
)
end result = process_response(response, preserve_thinking=True)
if end result["thinking"]:
print(f"[thinking] {end result['thinking'][:200]}...")
# Terminal state -- agent has produced a ultimate reply
if end result["is_terminal"]:
print(f"n[DONE]n{end result['final_answer']}")
return end result["final_answer"]
# Device name state -- execute every device and inject outcomes
if end result["has_tool_calls"]:
# Append the assistant's message with device calls to historical past
messages.append({
"function": "assistant",
"content material": response.decisions[0].message.content material or "",
"tool_calls": response.decisions[0].message.tool_calls or [],
})
for name in end result["tool_calls"]:
tool_name = name["name"]
tool_args = name.get("arguments", {})
call_id = name.get("call_id", "")
print(f"[tool] {tool_name}({json.dumps(tool_args)[:80]}...)")
session = tool_to_session.get(tool_name)
if not session:
result_content = f"Error: device '{tool_name}' not discovered"
else:
strive:
tool_result = await session.call_tool(tool_name, tool_args)
result_content = str(tool_result.content material)
# Truncate very lengthy outcomes to guard context finances
if len(result_content) > 12000:
result_content = result_content[:12000] + "n...[truncated]"
besides Exception as e:
result_content = f"Error: {e}"
print(f"[result] {result_content[:150]}...")
messages.append({
"function": "device",
"content material": result_content,
"tool_call_id": call_id,
"identify": tool_name,
})
print(f"[WARNING] max_turns ({max_turns}) reached with out terminal state")
# ── Entry level ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
asyncio.run(run_github_agent(
activity=(
"Discover the open problem concerning the login endpoint returning 200 for invalid tokens. "
"Learn src/auth.py and assessments/test_auth.py to know the bug. "
"Repair the verify_token operate and open a pull request together with your modifications."
),
repo="myorg/my-api-project",
))
How one can run:
python github_agent_raw.py
The uncooked SDK path offers you what Qwen-Agent abstracts: you possibly can see each device name, each end result, and each message injected into the dialog historical past. The tool_to_session routing dict is the important thing mechanism; it maps every device identify to the MCP session that owns it, so the agent can name any device from any related server with out realizing which server gives it.
# Writing a Customized MCP Server
Pre-built MCP servers deal with the filesystem and GitHub. While you want one thing that doesn’t exist — querying an inner database, wrapping a CI/CD API, operating code evaluation instruments — you write an MCP server. Here’s a full code_quality server that exposes ruff and pytest as MCP instruments.
# code_quality_server.py
# A customized MCP server exposing code high quality instruments to Qwen3.6.
#
# Stipulations:
# pip set up mcp ruff pytest
#
# How one can run standalone (for testing):
# python code_quality_server.py
#
# So as to add to the Qwen-Agent config:
# "code_quality": {
# "command": "python",
# "args": ["/absolute/path/to/code_quality_server.py"]
# }
import asyncio
import json
import subprocess
import sys
from mcp.server.fastmcp import FastMCP
# FastMCP is a high-level MCP server framework -- reduces boilerplate considerably
mcp = FastMCP("code_quality")
@mcp.device()
def run_linter(file_path: str, repair: bool = False) -> str:
"""
Run ruff linter on a Python file and return structured lint outcomes.
Use this earlier than modifying a file to know its present high quality state,
and after making modifications to confirm the repair didn't introduce new points.
Args:
file_path: Absolute or relative path to the Python file to lint.
repair: If true, routinely repair protected points in place.
Returns:
JSON string with points record, problem depend, and information modified.
"""
cmd = ["python", "-m", "ruff", "check", file_path, "--output-format=json"]
if repair:
cmd.append("--fix")
strive:
end result = subprocess.run(cmd, capture_output=True, textual content=True, timeout=30)
# ruff returns exit code 1 when points are discovered -- not an error
output = end result.stdout or end result.stderr
# Parse ruff's JSON output
strive:
points = json.hundreds(output) if output.strip() else []
besides json.JSONDecodeError:
points = []
formatted = [
{
"line": issue.get("location", {}).get("row", 0),
"col": issue.get("location", {}).get("column", 0),
"code": issue.get("code", ""),
"message": issue.get("message", ""),
"fix_available": issue.get("fix") is not None,
}
for issue in issues
if isinstance(issue, dict)
]
return json.dumps({
"file": file_path,
"points": formatted,
"total_issues": len(formatted),
"mounted": "auto-fix utilized" if repair else "no auto-fix",
}, indent=2)
besides subprocess.TimeoutExpired:
return json.dumps({"error": "Linter timed out after 30s", "file": file_path})
besides FileNotFoundError:
return json.dumps({"error": "ruff not discovered -- set up with: pip set up ruff"})
@mcp.device()
def run_tests(goal: str, verbose: bool = False) -> str:
"""
Run pytest on a module or listing and return structured move/fail outcomes.
Use this after writing a repair to confirm the repair makes failing assessments move
with out breaking different assessments.
Args:
goal: Path to the check file or listing to run (e.g. assessments/, assessments/test_auth.py)
verbose: If true, embody full pytest output within the end result.
Returns:
JSON string with move depend, fail depend, failure particulars, and length.
"""
cmd = ["python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q"]
if verbose:
cmd.append("-v")
strive:
end result = subprocess.run(cmd, capture_output=True, textual content=True, timeout=120)
output = end result.stdout
# Parse pytest-json-report output if obtainable
strive:
report = json.hundreds(output)
abstract = report.get("abstract", {})
failures = [
{
"test": t["nodeid"],
"message": t.get("name", {}).get("longrepr", "")[:500],
}
for t in report.get("assessments", [])
if t.get("consequence") == "failed"
]
return json.dumps({
"goal": goal,
"handed": abstract.get("handed", 0),
"failed": abstract.get("failed", 0),
"errors": abstract.get("error", 0),
"whole": abstract.get("whole", 0),
"length": abstract.get("length", 0),
"failures": failures,
"stdout": end result.stdout[:2000] if verbose else "",
}, indent=2)
besides json.JSONDecodeError:
# Fallback: return uncooked output if JSON report not obtainable
return json.dumps({
"goal": goal,
"stdout": end result.stdout[:3000],
"stderr": end result.stderr[:1000],
"exit_code": end result.returncode,
})
besides subprocess.TimeoutExpired:
return json.dumps({"error": f"Exams timed out after 120s for goal: {goal}"})
besides FileNotFoundError:
return json.dumps({"error": "pytest not discovered -- set up with: pip set up pytest"})
if __name__ == "__main__":
mcp.run(transport="stdio")
Add it to both agent implementation’s server config:
# In Qwen-Agent MCP_SERVERS dict:
"code_quality": {
"command": "python",
"args": ["/absolute/path/to/code_quality_server.py"]
}
# Within the uncooked SDK, add a 3rd StdioServerParameters:
cq_params = StdioServerParameters(
command="python",
args=["/absolute/path/to/code_quality_server.py"],
)
Take a look at the server standalone earlier than connecting the agent:
# Take a look at the server in MCP inspector mode
npx @modelcontextprotocol/inspector python code_quality_server.py
# Opens a browser UI the place you possibly can name run_linter and run_tests immediately
# Tuning Pondering Mode and Preserving Reasoning
The considering mode choice impacts latency considerably sufficient that it’s value treating as an specific structure selection, not an afterthought.
In considering mode, Qwen3.6 generates a chain-of-thought reasoning hint inside tags earlier than producing its motion. For a 5-step agent activity, that hint provides 1,000 to five,000 tokens per flip relying on activity complexity. These tokens take time to generate and eat context finances.
When that value is value paying:
- Planning steps the place the agent decides what to do subsequent.
- Debugging periods the place the issue is genuinely ambiguous.
- Multi-file refactoring the place the agent must purpose about negative effects throughout information.
The reasoning hint catches errors earlier than they turn into device calls with fallacious arguments. When it isn’t value paying: mechanical tool-call loops the place every step is unambiguous — record listing → learn file → write file → commit. The mannequin doesn’t must assume arduous about these steps. Non-thinking mode is quicker and produces the identical high quality output.
Swap modes per-request, not globally:
# Pondering mode (planning, debugging, complicated multi-file duties)
THINKING_PARAMS = {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
}
# Non-thinking mode (mechanical loops, quick standing checks)
# Move enable_thinking=False within the chat template, or use system immediate:
# Add "/no_think" to the system immediate to suppress considering mode.
NON_THINKING_PARAMS = {
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
}
The preserve_thinking flag — the Qwen3.6-specific functionality that retains reasoning context throughout turns — immediately impacts inference effectivity when prefix caching is energetic. Right here is why it issues virtually: in a 10-turn agent session, every flip shares a prefix of the dialog historical past. When preserve_thinking=True, the complete reasoning hint from prior turns stays within the historical past. The KV cache on the server aspect acknowledges the shared prefix throughout turns and avoids recomputing it. The efficient tokens-per-second price for lengthy periods is meaningfully increased than with out it, significantly when serving infrastructure like SGLang with --enable-prefix-caching is operating.
The sensible rule: use preserve_thinking=True for agent periods that may run for greater than 5 turns. Use preserve_thinking=False (or non-thinking mode) for single-turn queries and quick pipelines the place the overhead is a waste.
# Conclusion
Qwen3.6-35B-A3B’s MoE structure offers you 35B mannequin high quality at 3B activation value. Its 262k context window offers you room to carry a whole code assessment session in context. Its specific coaching on MCP-based agentic benchmarks means it is aware of use instruments accurately, not simply name them.
MCP gives the connective tissue. Outline a device as soon as as an MCP server. Each Qwen3.6 session and each different MCP-compatible mannequin can uncover and name it with out customized glue. The GitHub and filesystem servers on this article are two of a whole bunch of pre-built servers within the MCP ecosystem. The customized code_quality server exhibits the sample for something that doesn’t exist already.
The GitHub developer assistant on this article is one software of the sample. The identical structure — native mannequin, MCP instruments, and agentic loop — works for a analysis assistant that searches educational databases and drafts literature evaluations, a DevOps agent that reads CloudWatch logs and opens incident tickets, or an information pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is rising quick. The native mannequin functionality is already there.
Shittu Olumide is a software program engineer and technical author captivated with leveraging cutting-edge applied sciences to craft compelling narratives, with a eager eye for element and a knack for simplifying complicated ideas. You can even discover Shittu on Twitter.
