Want to run Claude Cowork with local models â completely free, offline, and private? Since January 2026, Ollama v0.14 ships native Anthropic Messages API compatibility, which means Claude's agentic desktop tool can now talk directly to open-source models running on your hardware. No API key. No subscription. No data leaving your machine.
This guide covers everything: installation, configuration, model selection, performance benchmarks, limitations, and a full comparison chart between cloud Claude and local inference. Whether you're a developer concerned about privacy or someone who wants unlimited Claude Cowork usage at zero cost â this is the definitive setup guide for 2026.
What Is Claude Cowork?
Claude Cowork is Anthropic's agentic desktop tool that brings Claude Code's capabilities to Claude Desktop for knowledge work beyond coding. Instead of responding to prompts one at a time, Claude can take on complex, multi-step tasks and execute them on your behalf â formatting documents, organizing files, synthesizing research, and automating workflows.
Key Capabilities
- Multi-step task execution: Describe an outcome, step away, and come back to finished work
- File system access: Read, write, and organize files on your computer
- Scheduled tasks: Automate recurring work (cloud-only feature)
- Projects: Persistent workspaces with their own files, links, instructions, and memory
- Plugins: Extend functionality with skills, connectors, and sub-agents
- Computer Use: Control desktop apps by seeing, clicking, and typing
Cowork runs directly on your computer in an isolated VM, giving Claude access to files you choose to share. Code executes safely in sandboxed environments while Claude makes real changes to your files.
Why Use Local Models with Claude Cowork?
Running Claude Cowork against cloud APIs costs money and sends your data to external servers. Here's why local models change the equation:
| Factor | Cloud Claude | Local Models |
|---|---|---|
| Cost | $20-200/month (Pro/Max plans) | $0 after hardware |
| Privacy | Data sent to Anthropic servers | Everything stays on your machine |
| Rate Limits | Usage caps, especially heavy Cowork tasks | Unlimited â run as much as you want |
| Offline | Requires internet | Works completely offline |
| Data Residency | Cross-border transfer concerns | Full GDPR/compliance control |
| Speed | 60-80 tokens/sec | 8-25 tokens/sec (hardware dependent) |
The tradeoff is clear: local models trade speed for privacy, cost savings, and unlimited usage. For many workflows â especially those involving sensitive code, proprietary documents, or air-gapped environments â that tradeoff makes perfect sense.
Prerequisites & Hardware Requirements
Before setting up local models with Claude Cowork, ensure your system meets these requirements:
Software Requirements
- Ollama v0.14.0+ (required for Anthropic Messages API compatibility)
- Claude Code CLI installed via
curl -fsSL https://claude.ai/install.sh | bash - macOS 13+, Windows 10+, or Linux (Ubuntu 20.04+ recommended)
Hardware Requirements
| Tier | Hardware | Best Model | Experience |
|---|---|---|---|
| Minimum Viable | 16GB RAM (M1/M2) or RTX 3060 12GB | GLM-4.7-Flash (Q4) | Usable for single-file tasks. Slower on complex operations. |
| Recommended | 32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GB | Qwen3-Coder 30B (Q4) | Solid for most coding workflows. Multi-file works but slower. |
| Ideal | 64GB+ RAM (M2/M3/M4 Max) or RTX 4090 24GB | Qwen2.5-Coder-32B (Q6) | Best local experience. Higher quantization, faster throughput. |
Step-by-Step Setup: Ollama + Claude Code
Step 1: Install Ollama
macOS (Homebrew):
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from ollama.com
Verify installation:
ollama --version
# Must be v0.14.0 or later
Step 2: Pull a Local Model
Choose a model with tool calling support (required for Claude Code's agentic features):
# Top pick â 30B MoE, only 3B active params, runs on 16GB RAM
ollama pull glm-4.7-flash
# Alternative â strong coding model
ollama pull qwen3-coder
# Budget option for 8GB machines
ollama pull devstral-small-2
Step 3: Install Claude Code
macOS/Linux:
curl -fsSL https://claude.ai/install.sh | bash
Windows:
irm https://claude.ai/install.ps1 | iex
Step 4: Connect Claude Code to Ollama
Fastest method â one command:
ollama launch claude
This automatically sets ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL, and launches Claude Code pointed at your local Ollama instance. Select your model from the list and hit Enter.
Manual method â explicit environment variables:
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_BASE_URL=http://localhost:11434
# Launch Claude Code
claude
Or inline without modifying your shell profile:
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude
Step 5: Verify Connection
Once Claude Code launches, try a simple command:
> Read the current directory and list all files
If the model reads files and responds with actual file listings (not just describing what it would do), tool calling is working correctly.
Setup with LM Studio
LM Studio provides a graphical interface for managing local models:
- Download LM Studio from lmstudio.ai
- Search and download GLM-4.7-Flash or Qwen3-Coder
- Go to Local Server tab â Start Server (default port: 1234)
- Configure Claude Code:
export ANTHROPIC_AUTH_TOKEN=lm-studio
export ANTHROPIC_BASE_URL=http://localhost:1234
claude
Best Local Models for Claude Cowork
| Model | Parameters | Context | Tool Calling | RAM/VRAM Needed | Best For |
|---|---|---|---|---|---|
| GLM-4.7-Flash â | 30B MoE (3B active) | 128K | Yes (79.5%) | ~6.5GB (Q4) | Best balance of speed + capability |
| Qwen3-Coder | 30B | 128K | Yes | ~20GB (Q4) | Strong coding tasks |
| GPT-OSS:20B | 20B | 32K | Yes | ~12GB (Q4) | Good general purpose |
| Devstral-Small-2 | 24B | 128K | Yes | ~16GB (Q4) | Code-focused tasks |
| Qwen2.5-Coder:32B | 32B | 128K | Limited | ~24GB (Q4) | Complex coding (needs strong hardware) |
Top recommendation: GLM-4.7-Flash. Its mixture-of-experts architecture means only 3B parameters activate per token despite being a 30B model. That translates to fast inference on modest hardware (16GB RAM) while maintaining 128K context and strong tool-calling abilities (79.5% on agent benchmarks).
Free Cloud Models via Ollama
Don't want to run inference locally? Ollama also proxies free cloud models with generous rate limits:
| Model | Context | Speed | Cost |
|---|---|---|---|
| qwen3.5:cloud | 128K+ | 30-60 tok/s | Free (rate limited) |
| glm-5:cloud | 128K+ | 30-60 tok/s | Free (rate limited) |
| kimi-k2.5:cloud | 128K+ | 30-60 tok/s | Free (rate limited) |
| qwen3-coder:480b-cloud | 128K+ | 30-60 tok/s | Free (rate limited) |
# Use free cloud model through Ollama
ollama launch claude --model qwen3.5:cloud
These models run on remote infrastructure but use the same Ollama interface. Your code still goes to external servers (not truly private), but it's free and significantly faster than local inference.
Full Comparison: Cloud Claude vs Local Models
| Aspect | Cloud Claude (Sonnet/Opus) | Local Models (Ollama) | Ollama Cloud Models |
|---|---|---|---|
| Speed | 60-80 tok/s | 8-25 tok/s | 30-60 tok/s |
| Code Quality | 98% edit accuracy | 70-80% edit accuracy | 85-95% edit accuracy |
| Multi-file Reasoning | Excellent | Fair (degrades with complexity) | Good |
| Tool Calling | Always reliable | Model-dependent (GLM best) | Reliable |
| Monthly Cost | $20-200 | $0 (electricity only) | $0 |
| Privacy | Data sent to Anthropic | 100% local | Data sent to model provider |
| Offline | No | Yes | No |
| Rate Limits | Yes (heavy Cowork tasks consume more) | None | Yes (generous) |
| Scheduled Tasks | Yes | No | No |
| Computer Use | Yes | No | No |
| Plugins | Full support | Limited | Limited |
| Context Window | 200K+ | 32K-128K (model dependent) | 128K+ |
Performance Benchmarks
Real-world numbers from published benchmarks comparing local and cloud inference:
Token Throughput
| Setup | Tokens/sec | Notes |
|---|---|---|
| Claude API (Sonnet 4) | 60-80 | Anthropic's infrastructure |
| Ollama cloud model | 30-60 | Varies by model and load |
| RTX 4070 Ti Super (32B Q4) | 15-25 | $489 GPU, 16GB VRAM |
| M1 Max 64GB (GLM-4.7-Flash) | 10-20 | Apple Silicon unified memory |
| RTX 3060 12GB (GLM-4.7-Flash) | 8-15 | Budget GPU |
Real-World Task Timing
| Task | Cloud Claude | GLM-4.7 Local (M1 Max) | Difference |
|---|---|---|---|
| Simple file read + edit | ~3 seconds | ~15 seconds | 5x slower |
| Multi-file refactoring | ~1 minute | ~12 minutes | 12x slower |
| Full repo analysis | ~1.2 minutes | ~82 minutes | 68x slower |
Coding Quality Scores (50-task benchmark)
| Task Type | GLM-4.7-Flash | Qwen3-Coder | Cloud Claude Sonnet |
|---|---|---|---|
| Function generation | 3.9/5 | 4.1/5 | 4.4/5 |
| Bug detection | 3.5/5 | 3.8/5 | 4.6/5 |
| Refactoring | 3.7/5 | 4.0/5 | 4.3/5 |
| Multi-file context | 2.5/5 | 2.8/5 | 4.5/5 |
| Code explanation | 4.0/5 | 4.2/5 | 4.1/5 |
Cost Analysis
| Option | Upfront | Monthly | 6-Month Total | 12-Month Total |
|---|---|---|---|---|
| Claude Pro Plan | $0 | $20 | $120 | $240 |
| Claude Max Plan | $0 | $100-200 | $600-1,200 | $1,200-2,400 |
| Local GPU (RTX 4070 Ti) | $489 | $8-12 (electricity) | $537-561 | $585-633 |
| Local (Apple Silicon, existing Mac) | $0 | $3-5 (electricity) | $18-30 | $36-60 |
| Ollama Cloud Models | $0 | $0 | $0 | $0 |
Breakeven point: A heavy Claude Max user ($200/month) recoups GPU investment in just 2.5 months. Even Claude Pro users ($20/month) break even within 6-8 months if they already own capable hardware.
Limitations of Local Models
Be realistic about what local models cannot do:
- Slower inference (3-68x): Simple tasks take 5x longer. Complex repo analysis can take 68x longer than cloud Claude.
- Lower edit accuracy (70-80% vs 98%): Local models produce patches with wrong line numbers, bad whitespace, and mismatched context. Over a 50-edit session, you'll spend more time fixing broken patches than writing code.
- Weaker multi-file reasoning: Cloud Claude excels at understanding relationships across large codebases. Local models degrade significantly with complexity.
- Tool calling reliability: Not all models support tool calling. Without it, Claude Code becomes a plain text generator that describes actions instead of executing them.
- No scheduled tasks: Recurring automated work only runs with cloud Cowork.
- No Computer Use: Desktop control (clicking, typing in apps) requires cloud Claude.
- No plugins: Most Cowork plugins require cloud infrastructure.
- Context window limits: Local models typically max at 128K tokens vs 200K+ for cloud Claude.
- Streaming tool calls require Ollama 0.14.3-rc1+: The stable release may not handle all tool-use scenarios correctly.
What's Possible with Local Models
Despite the limitations, local models unlock significant capabilities:
- 100% offline development: Write code on planes, in cafes without WiFi, or in restricted networks.
- Complete data privacy: Proprietary code, PII, medical records, defense contracts â nothing leaves your machine.
- GDPR/compliance: Eliminate cross-border data transfer concerns entirely. No DPAs needed.
- Air-gapped environments: Defense, healthcare, and government organizations can use AI coding assistance without network access.
- Unlimited usage: No rate limits, no monthly caps, no throttling during heavy use.
- Custom fine-tuned models: Train models on your codebase for domain-specific assistance.
- Hybrid workflows: Use local for sensitive work, cloud for complex tasks. Switch instantly.
- Zero-cost experimentation: Try different models, approaches, and prompts without watching a billing meter.
Troubleshooting
"Connection refused" error
- Ensure Ollama is running:
ollama serve - Check port isn't blocked:
curl http://localhost:11434/api/tags - Verify Ollama version:
ollama --version(must be 0.14.0+)
Model just talks instead of acting
If Claude Code responds with "I would read the file..." instead of actually reading it, tool calling isn't working:
- Switch to a model with confirmed tool support: GLM-4.7-Flash or any cloud model
- Update Ollama to 0.14.3-rc1+ for streaming tool calls
- Ensure
ANTHROPIC_AUTH_TOKENis set toollama, not an actual API key
Slow generation (under 5 tok/s)
- Drop to smaller quantization: Q4_K_M instead of Q6_K
- Reduce context:
ollama run glm-4.7-flash --num-ctx 32768 - Switch to GLM-4.7-Flash if using a dense model (MoE = faster)
- Consider Ollama cloud models:
ollama launch claude --model qwen3.5:cloud
"Role model" request failures
Claude Code tries to use "haiku" for background tasks. Fix by setting the small model override in your Claude Code settings to use the same local model.
Frequently Asked Questions
Can I use Claude Cowork completely offline with local models?
Yes. Once you've pulled your model via Ollama, everything runs locally. No internet required for inference. However, some Cowork features (scheduled tasks, plugins, Computer Use) are cloud-only and won't work offline.
Is it really free?
Running local models through Ollama is completely free. No API keys, no billing, no subscription. Ollama's cloud models (like qwen3.5:cloud) are also free with generous rate limits. Your only cost for truly local inference is hardware and electricity.
What's the best model for Claude Code with Ollama?
GLM-4.7-Flash is the top recommendation: 128K context, native tool calling (79.5% benchmark), and it runs on 16GB RAM thanks to mixture-of-experts architecture. For Ollama cloud models, Qwen 3.5 and GLM-5 offer frontier-level quality at zero cost.
How much slower is local compared to cloud?
Expect 3-5x slower for simple tasks and up to 68x slower for complex multi-file analysis. The speed gap is the primary tradeoff. However, for many single-file tasks (code explanation, simple edits, documentation), the delay is tolerable (10-20 seconds vs 3-5 seconds).
Can I switch between local and cloud models?
Yes. Use local models for sensitive/private work and cloud Claude for complex tasks. You can switch by simply changing environment variables or using separate terminal profiles.
Does the quality match cloud Claude?
No. Local models score 85-90% of cloud Claude on single-file tasks but significantly worse on multi-file reasoning (50-60% of cloud quality). Edit accuracy drops from 98% to 70-80%, meaning more manual corrections needed.
The Bottom Line
Claude Cowork with local models is not a replacement for cloud Claude â it's a complement. The ideal workflow in 2026 looks like this:
- Local models â sensitive codebases, unlimited experimentation, offline work, privacy-first environments
- Ollama cloud models â free, faster than local, good quality, acceptable for non-sensitive work
- Cloud Claude â complex multi-file reasoning, scheduled automation, Computer Use, maximum quality
The setup takes 5 minutes. The cost is zero. If you have a Mac with 16GB+ RAM or a GPU with 12GB+ VRAM, there's no reason not to try it. Start with ollama pull glm-4.7-flash and ollama launch claude â you'll be coding with a local AI agent in under a minute.
For more AI coding tools, explore our Claude Opus 4.6 review and our free AI Image Generator.