Sunday, June 28, 2026

Claude Code on Ollama: Learn how to Run a Native Coding Agent With out Burning API Credit


Learn how to Run Claude Code on Ollama

  1. Set up Ollama through Homebrew or the official set up script and begin the server.
  2. Pull a coding mannequin akin to qwen2.5-coder:14b with ollama pull.
  3. Confirm Ollama’s OpenAI-compatible endpoint responds at localhost:11434/v1.
  4. Set up Claude Code globally through npm set up -g @anthropic-ai/claude-code.
  5. Unset any current ANTHROPIC_API_KEY to forestall unintended API billing.
  6. Export atmosphere variables pointing Claude Code to the native Ollama endpoint.
  7. Launch Claude Code in your venture listing and ensure the native mannequin title seems.
  8. Verify native routing by checking for lively connections to port 11434 throughout a session.

Operating Claude Code towards Anthropic’s API will get costly quick. Run Claude Code towards a neighborhood mannequin by Ollama and also you pay zero marginal price per question—this tutorial walks by the entire setup, from putting in Ollama and pulling an applicable coding mannequin, to configuring Claude Code’s atmosphere variables, to operating actual coding duties towards a React and Node.js venture.

Desk of Contents

Why Your Claude Code API Invoice Is a Downside

Operating Claude Code towards Anthropic’s API will get costly quick. Builders on the Anthropic subreddit and varied boards have reported spending between $100 and $200 in a single day of heavy agentic coding classes. One broadly cited, self-reported neighborhood account described burning by $175 in simply 4 hours whereas refactoring a medium-sized codebase (outcomes will fluctuate considerably by process kind and codebase dimension). Even conservative utilization patterns, involving periodic prompts for code evaluations, check technology, and debugging, can simply generate month-to-month payments exceeding $500 in response to comparable anecdotal stories. The token-intensive nature of agentic workflows, the place Claude Code reads complete information, causes throughout a number of steps, and writes again modifications, compounds the associated fee far past what a single chat-style API name would.

Run Claude Code towards a neighborhood mannequin by Ollama and also you pay zero marginal price per question. The mannequin runs on {hardware} already sitting on the desk.

This tutorial walks by the entire setup, from putting in Ollama and pulling an applicable coding mannequin, to configuring Claude Code’s atmosphere variables, to operating actual coding duties towards a React and Node.js venture. The goal reader is a developer with intermediate familiarity with CLI instruments, Node.js, and native improvement environments.

Claude Code model compatibility: Claude Code is underneath speedy improvement and its configuration interface, together with supported atmosphere variables, could change between releases. This information paperwork one method to native mannequin routing through OpenAI-compatible endpoints. After set up, run claude --version and seek the advice of Anthropic’s present documentation or claude --help to verify the precise atmosphere variable names supported by your put in model. If variable names have modified, adapt the directions accordingly.

What Is Claude Code and Why Go Native?

Claude Code in 60 Seconds

Claude Code is Anthropic’s agentic command-line coding software. Not like GitHub Copilot, which operates primarily as an inline autocomplete engine, or Cursor, which embeds AI inside a customized IDE fork, Claude Code features as a standalone CLI agent. It reads venture information, causes about codebases, writes and edits code throughout a number of information, runs shell instructions, and iterates by itself output. Its default working mannequin requires an Anthropic API key, routing all requests to Claude Sonnet 4 or Claude Opus, with prices decided by token consumption. A typical multi-step agentic process can eat tens of hundreds of tokens per interplay.

The Case for Native Fashions

Operating Claude Code towards a neighborhood mannequin solves three issues. Privateness and information sovereignty come first: supply code by no means leaves the developer’s machine, which issues for proprietary codebases and organizations with strict information dealing with insurance policies. You additionally eradicate per-query prices after the one-time {hardware} funding. And the setup works with out an web connection, so you retain working when connectivity drops.

The trade-offs deserve sincere acknowledgment. Native fashions, even one of the best open-weight coding fashions within the 7B to 16B parameter vary, don’t match Claude Sonnet 4 or Opus in complicated multi-file reasoning, nuanced architectural selections, or large-context understanding. For easy duties like boilerplate technology, refactoring, and check scaffolding, native fashions produce usable output on first try for single-file edits. For duties requiring deep contextual reasoning throughout hundreds of strains, the standard hole stays vital.

Understanding the Structure: Claude Code + Ollama + OpenAI-Suitable APIs

How the Items Match Collectively

Claude Code helps third-party mannequin suppliers by OpenAI-compatible API endpoints. That is the mechanism that makes native utilization attainable. Ollama, a neighborhood mannequin server, exposes precisely such an endpoint at localhost:11434/v1. Once you configure the fitting atmosphere variables, Claude Code sends its requests to this native endpoint as a substitute of Anthropic’s servers.

The request stream is simple:

Claude Code CLI  →  http://localhost:11434/v1/chat/completions  →  Ollama Server  →  Native LLM (e.g., qwen2.5-coder:14b)
     [prompt]           [OpenAI-compatible API]                      [inference]         [response]

Claude Code constructs its prompts and tool-use payloads within the OpenAI chat completions format. Ollama receives these, runs inference on the required native mannequin, and returns the completion. From Claude Code’s perspective, it talks to an OpenAI-compatible supplier. From the mannequin’s perspective, it handles customary chat completion requests.

Stipulations and System Necessities

{Hardware} Concerns

Native LLM inference is memory-bound. The RAM figures under confer with accessible (free) RAM, not complete put in RAM. For 7B parameter fashions at This autumn quantization, you want at the very least 16GB of obtainable RAM. Operating 13B or 14B parameter fashions comfortably requires 32GB or extra, and fashions with 30B+ parameters usually demand 64GB of obtainable RAM or a GPU with substantial VRAM. Larger quantization ranges (e.g., Q8) roughly double the RAM requirement in comparison with This autumn variants.

For GPU acceleration, Ollama helps NVIDIA GPUs through CUDA, Apple Silicon through Metallic (automated on macOS), and AMD GPUs through ROCm on Linux. Disk house necessities fluctuate by mannequin: count on 4GB to 10GB per quantized mannequin file.

Software program Necessities

The setup requires Node.js 18 or later (with npm), Ollama put in and operating as a neighborhood server, and the Claude Code CLI put in globally through npm.

Step 1: Set up and Configure Ollama

Putting in Ollama

On macOS and Linux, Ollama installs with a single command. Home windows customers can obtain the installer from the Ollama web site.


brew set up ollama




curl -fsSL https://ollama.com/set up.sh | sh


ollama --version






ollama serve

On macOS, Ollama usually launches as a background service routinely after Homebrew set up. On Linux, ollama serve begins the server course of. Confirm it’s operating by checking that port 11434 is listening.

Pulling the Proper Mannequin

Not all fashions deal with code technology equally. The next fashions are well-suited for coding duties by Claude Code:

  • For one of the best steadiness of high quality and useful resource utilization, pull qwen2.5-coder:14b. It handles multi-file edits in Python, TypeScript, and Go together with fewer syntax errors than different fashions on this parameter vary.
  • deepseek-coder-v2:16b generates syntactically legitimate Python and JavaScript in single-file duties (efficiency varies by process; consider towards your individual workload).
  • Meta’s codellama:13b is a purpose-built coding mannequin primarily based on Llama 2 (launched 2023; primarily based on the older Llama 2 structure, so the newer alternate options above usually produce higher outcomes).
  • When RAM is tight, llama3.1:8b gives a lighter-weight general-purpose choice.

Mannequin selection immediately impacts output high quality. Goal-built coding fashions like Qwen 2.5 Coder produce noticeably higher structured code, deal with edge circumstances extra reliably, and comply with coding conventions extra constantly than general-purpose fashions of equal dimension.


ollama pull qwen2.5-coder:14b


ollama record

The ollama record command ought to present the mannequin title, dimension, and modification date, confirming the weights are downloaded and prepared.

Verifying the Native API

Earlier than configuring Claude Code, affirm that Ollama’s OpenAI-compatible endpoint is responding:

curl http://localhost:11434/v1/chat/completions 
  -H "Content material-Kind: utility/json" 
  -H "Authorization: Bearer not-a-real-key-local-ollama-only" 
  -d '{
    "mannequin": "qwen2.5-coder:14b",
    "stream": false,
    "messages": [{"role": "user", "content": "Write a hello world function in JavaScript"}]
  }'

A profitable response returns a single JSON object containing the mannequin’s completion. If this command fails with “connection refused,” Ollama isn’t operating. If it returns a model-not-found error, the mannequin title doesn’t match what was pulled.

Step 2: Set up and Configure Claude Code for Native Use

Putting in Claude Code CLI

Set up Claude Code globally by npm:

npm set up -g @anthropic-ai/claude-code


claude --version

This installs the claude command globally. The CLI requires Node.js 18 or later. Observe the model quantity displayed — the atmosphere variables described under are version-dependent. Run claude --help to verify the supported configuration choices in your model.

Configuring Claude Code to Use Ollama

First: when you’ve got ANTHROPIC_API_KEY set in your atmosphere, unset it. Leaving it set could trigger Claude Code to route requests to Anthropic’s API as a substitute of Ollama, silently incurring prices.

unset ANTHROPIC_API_KEY

You configure Claude Code’s third-party supplier help with atmosphere variables. The precise variable names rely in your Claude Code model. Run claude --help to verify the right names. The variables under characterize one documented configuration method — confirm them towards the present Anthropic documentation in your put in model:




export OPENAI_API_KEY="not-a-real-key-local-ollama-only"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export CLAUDE_CODE_USE_OPENAI=1
export CLAUDE_MODEL="qwen2.5-coder:14b"

Model-dependent variables: The variable names CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL, and the selection between ANTHROPIC_BASE_URL and OPENAI_BASE_URL could differ throughout Claude Code releases. Verify them with claude --help or the Anthropic documentation in your model. If the variables are incorrect, Claude Code could silently fall again to the Anthropic API, incurring prices.

You set OPENAI_API_KEY to a placeholder string as a result of Ollama doesn’t require authentication, however Claude Code refuses to begin and not using a non-empty key worth. ANTHROPIC_BASE_URL factors to the native Ollama server’s OpenAI-compatible API path. CLAUDE_CODE_USE_OPENAI alerts Claude Code to make use of the OpenAI-compatible supplier path slightly than the Anthropic API. CLAUDE_MODEL specifies which Ollama mannequin to make use of and should match the mannequin title precisely as proven by ollama record, together with the tag (e.g., :14b).

For persistence, add these exports to ~/.bashrc, ~/.zshrc, or a project-level .env file. If utilizing a project-level .env file, guarantee it’s listed in .gitignore to forestall unintended commits.

Home windows customers (PowerShell):

$env:OPENAI_API_KEY="not-a-real-key-local-ollama-only"
$env:ANTHROPIC_BASE_URL="http://localhost:11434/v1"
$env:CLAUDE_CODE_USE_OPENAI="1"
$env:CLAUDE_MODEL="qwen2.5-coder:14b"

Launching Claude Code in Native Mode

With the atmosphere variables set, begin Claude Code in any venture listing:

cd /path/to/your/venture
claude

On startup, Claude Code ought to show the configured mannequin title (e.g., qwen2.5-coder:14b) slightly than a Claude Sonnet or Opus identifier. That is an preliminary indicator that configuration was utilized, however displaying the mannequin title alone doesn’t assure native routing — the configured variable worth could possibly be proven even when routing fails. To definitively affirm that requests attain Ollama, monitor connections throughout a session:


lsof -i :11434 | grep ESTABLISHED


It’s best to see an lively TCP connection to 127.0.0.1:11434. If no connection is proven, requests could also be going to Anthropic’s servers.

Step 3: Take It for a Spin with a React + Node.js Undertaking

Scaffolding a Check Undertaking

Create a minimal venture that offers Claude Code actual information to work with:

npm create vite@newest test-project -- --template react
cd test-project
npm set up
npm set up categorical

Add a minimal Specific server on the venture root. As a result of the Vite scaffold creates an ES module venture ("kind": "module" in package deal.json), the CommonJS require() syntax won’t work by default. Both rename the file server.cjs, or add "kind": "commonjs" to a separate root-level package deal.json, or rewrite utilizing ES module import syntax. The instance under makes use of the .cjs method:


const categorical = require('categorical');
const app = categorical();
const PORT = course of.env.PORT ?? 3001;

app.use(categorical.json());

app.get("https://www.sitepoint.com/", (req, res) => {
  res.json({ message: 'Server is operating' });
});

const server = app.pay attention(PORT, () => {
  console.log(`Server listening on port ${PORT}`);
});

server.on('error', (err) => {
  if (err.code === 'EADDRINUSE') {
    console.error(`Port ${PORT} is already in use. Set PORT env var to make use of a distinct port.`);
  } else {
    console.error('Server failed to begin:', err);
  }
  course of.exit(1);
});

This gives each a React frontend and a Node.js backend for Claude Code to function on.

Operating Actual Coding Duties

With Claude Code operating within the venture listing, concern a sensible immediate:

Add a /api/well being endpoint to server.cjs that returns { standing: "wholesome", uptime: course of.uptime() }
and create a React part referred to as HealthStatus that fetches and shows this information.

With qwen2.5-coder:14b, count on output structured like this (your outcomes will fluctuate primarily based on immediate phrasing and mannequin state):


app.get('/api/well being', (req, res) => {
  res.json({
    standing: 'wholesome',
    uptime: course of.uptime(),
    timestamp: new Date().toISOString()
  });
});

import { useState, useEffect } from 'react';

const API_BASE = import.meta.env.VITE_API_BASE ?? 'http://localhost:3001';

perform HealthStatus() {
  const [health, setHealth] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    const controller = new AbortController();

    fetch(`${API_BASE}/api/well being`, { sign: controller.sign })
      .then((res) => {
        if (!res.okay) throw new Error(`HTTP ${res.standing}`);
        return res.json();
      })
      .then((information) => {
        setHealth(information);
        setLoading(false);
      })
      .catch((err) => {
        if (err.title === 'AbortError') return;
        console.error('Didn't fetch well being standing:', err);
        setError(err.message);
        setLoading(false);
      });

    return () => controller.abort();
  }, []);

  if (loading) return <p>Loading well being standing...</p>;
  if (error) return <p>Error: {error}</p>;
  if (!well being) return <p>Unable to succeed in server.</p>;

  return (
    <div>
      <h2>Server Well being</h2>
      <p>Standing: {well being.standing}</p>
      <p>Uptime: {Math.spherical(well being.uptime)}s</p>
    </div>
  );
}

export default HealthStatus;

Observe on fetch URLs: The React frontend runs on the Vite dev server (usually port 5173), whereas the Specific backend runs on port 3001. The part above makes use of the VITE_API_BASE atmosphere variable to configure the API origin, falling again to http://localhost:3001 for native improvement. For manufacturing or containerised deployments, set VITE_API_BASE to the suitable backend URL. Alternatively, configure a Vite proxy by including server: { proxy: { '/api': 'http://localhost:3001' } } to vite.config.js and use relative fetch paths.

Claude Code’s agentic capabilities imply it reads the present server.cjs, identifies the place to insert the brand new endpoint, writes the modifications, creates the brand new part file, and may even replace imports in App.jsx if prompted.

Evaluating Output High quality

Native fashions within the 7B to 14B vary deal with boilerplate code, CRUD endpoint technology, easy part creation, check scaffolding, and easy refactoring properly. For single-endpoint handlers and remoted part information, they produce usable output on first try with out guide correction.

The place native fashions fall brief is in complicated multi-file reasoning: tracing a bug throughout a number of interconnected modules, making architectural selections that require understanding a full codebase’s patterns, or producing right output when the context window fills up. Claude Sonnet 4 handles these eventualities with noticeably greater accuracy. For instance, Sonnet appropriately traces cross-module kind errors that qwen2.5-coder:14b misses after a number of makes an attempt, and it maintains coherence throughout longer context home windows.

Efficiency Tuning and Optimization

Ollama Configuration for Higher Efficiency

Ollama exposes a number of atmosphere variables and configuration choices that have an effect on inference velocity:




export OLLAMA_NUM_PARALLEL=2















ollama present qwen2.5-coder:14b --modelfile | grep num_ctx




Setting OLLAMA_NUM_PARALLEL above 1 permits concurrent request dealing with, which issues much less for single-user Claude Code classes however helps if different instruments share the identical Ollama occasion. Rising the context size permits the mannequin to purpose over extra code directly, however will increase reminiscence consumption considerably; very lengthy contexts can eat considerably extra reminiscence than the bottom mannequin load.

Selecting the Proper Mannequin for the Activity

A sensible technique is to maintain a number of fashions pulled and swap between them. Use a smaller mannequin like llama3.1:8b for fast completions and easy edits the place velocity issues. Swap to qwen2.5-coder:14b or deepseek-coder-v2:16b for duties requiring greater code high quality. Switching fashions requires solely altering the CLAUDE_MODEL atmosphere variable (or the equal in your Claude Code model) and restarting Claude Code.

Full Implementation Guidelines and Mannequin Comparability Desk

Setup Guidelines

  1. Set up Ollama (brew set up ollama or curl set up script) and confirm with ollama --version
  2. Begin Ollama server (ollama serve or brew companies begin ollama on macOS) and ensure port 11434 is listening
  3. Pull a coding mannequin (ollama pull qwen2.5-coder:14b) and confirm with ollama record
  4. Check the API endpoint with curl http://localhost:11434/v1/chat/completions (embody "stream": false within the request physique)
  5. Set up Claude Code (npm set up -g @anthropic-ai/claude-code) and confirm with claude --version
  6. Unset ANTHROPIC_API_KEY if current (unset ANTHROPIC_API_KEY)
  7. Test claude --help to verify the right atmosphere variable names in your model
  8. Set atmosphere variables (OPENAI_API_KEY, ANTHROPIC_BASE_URL, CLAUDE_CODE_USE_OPENAI, CLAUDE_MODEL), adapting variable names in case your model differs
  9. Launch Claude Code in a venture listing and ensure the mannequin title in startup output
  10. Run lsof -i :11434 (or netstat -ano | findstr :11434 on Home windows) throughout a session to confirm native routing
  11. Run a check immediate and confirm the response comes from the native mannequin

Native Coding Mannequin Comparability Desk

MannequinDimensionMin. Free RAM (This autumn)Coding High quality*PaceFinest For
llama3.1:8b~4.7GB16GBReasonableQuickFast completions, easy edits
codellama:13b~7.4GB32GB**GoodReasonableNormal code technology
qwen2.5-coder:14b~8.9GB32GBVery GoodReasonableFinest total for coding duties
deepseek-coder-v2:16b~9.1GB32GBVery GoodReasonableComplicated code technology
codellama:34b~19GB64GBGloriousSluggishMost native high quality
llama3.1:70b~40GB64GB+GloriousVery SluggishClose to-API high quality (if {hardware} permits)

*Coding High quality rankings replicate casual single-file move charges on HumanEval-style duties. “Reasonable” = frequent guide fixes wanted; “Good” = occasional fixes; “Very Good” = first-attempt success on most single-file duties; “Glorious” = constant first-attempt success together with multi-function information.

**16GB is the technical minimal for codellama:13b; 32GB is advisable for secure inference with out swapping. Sizes and RAM figures assume This autumn quantization; Q8 quantization roughly doubles RAM necessities. Confirm precise on-disk dimension with ollama record after pulling.

Finest total choose: qwen2.5-coder:14b gives the strongest steadiness of code technology high quality, cheap useful resource necessities, and sensible inference velocity for iterative improvement workflows.

Troubleshooting Frequent Points

Connection Refused or Mannequin Not Discovered

If Claude Code stories connection errors, confirm that ollama serve is operating and that http://localhost:11434 responds to requests. On macOS, test whether or not the Homebrew service is already operating with brew companies record — operating ollama serve manually when the service is lively causes a port battle. A “mannequin not discovered” error means the worth in CLAUDE_MODEL doesn’t precisely match the mannequin title proven by ollama record, together with the tag (e.g., :14b).

Sluggish Responses or Out-of-Reminiscence Errors

If inference is unacceptably gradual or the system runs out of reminiscence, cut back the context window (through the Modelfile PARAMETER num_ctx or the per-request choices area), swap to a smaller quantized mannequin, or confirm that GPU offloading is lively. On NVIDIA methods, nvidia-smi confirms whether or not Ollama is using the GPU. On Apple Silicon, Metallic acceleration is automated.

Claude Code Ignoring Native Config

Surroundings variables override one another in ways in which trigger routing errors. You probably have an ANTHROPIC_API_KEY set within the shell atmosphere or in a worldwide configuration file, Claude Code could prioritize the Anthropic supplier over the OpenAI-compatible path. Unset any Anthropic-specific variables (unset ANTHROPIC_API_KEY) earlier than launching Claude Code in native mode. Moreover, confirm that the atmosphere variable names you’re utilizing match these supported by your put in Claude Code model — run claude --help to verify.

Warning: If atmosphere variables are misconfigured, Claude Code could silently route requests to Anthropic’s API, incurring surprising prices. At all times confirm native routing by checking for lively connections to localhost:11434 throughout your session.

When to Use Native vs. API: A Sensible Framework

Use native fashions for iterative improvement, boilerplate technology, check writing, refactoring, and work on personal or proprietary codebases the place information should not depart the machine. Use the Anthropic API for complicated architectural reasoning, large-context multi-file modifications that exceed native mannequin capabilities, and code that ships to manufacturing with out further human evaluation.

Essentially the most sensible method is a hybrid one: default to native for the majority of each day coding duties and swap to the API selectively for heavy lifts. This sample captures the vast majority of price financial savings whereas preserving entry to frontier mannequin high quality when it issues.

What Comes Subsequent

This setup eliminates API prices for almost all of routine coding agent interactions. Builders who beforehand spent $100 or extra per day on Anthropic API credit can reserve that spend for duties that genuinely require frontier mannequin capabilities. Builders who route the vast majority of routine duties domestically can considerably cut back API prices; precise financial savings depend upon particular person workflow composition and the ratio of local-suitable duties to these requiring frontier fashions.

From right here, the pure subsequent steps are experimenting with further fashions because the open-weight ecosystem evolves and creating task-specific Modelfile configurations tuned for specific programming languages or frameworks. Past that, you’ll be able to combine native Claude Code classes into CI workflows for automated code evaluation on personal repositories.


Related Articles

Latest Articles