The OpenAI Codex App is the most powerful AI coding agent available in 2026 â but it doesn't have to cost you anything. By connecting Codex to local models via Ollama, LM Studio, Unsloth, or llama.cpp, you get the full agentic coding experience running entirely on your hardware. Zero API costs. Complete privacy. No rate limits.
This guide covers every setup method, the best models to use, a full comparison chart between cloud GPT-5.5 and local models, real performance benchmarks, limitations, and practical workflows. Whether you're on a MacBook with 16GB RAM or a workstation with an RTX 4090 â this is your complete reference for running Codex app with local models in 2026.
What Is the OpenAI Codex App?
OpenAI ships Codex across three surfaces. Understanding the difference matters for local model setup:
| Surface | What It Is | Local Models? |
|---|---|---|
| Codex App (Desktop) | macOS/Windows desktop app. Multi-agent orchestration, worktrees, automations, Computer Use, in-app browser, 90+ plugins. | Yes (via config.toml) |
| Codex CLI | Terminal-based coding agent. Runs in your shell, reads/writes files, executes commands. | Yes (via config.toml or env vars) |
| Codex Agent (ChatGPT) | Cloud-only agent in ChatGPT. Runs in sandboxed environments on OpenAI's servers. | No (cloud only) |
Timeline
- February 2, 2026: Codex App launches on macOS
- March 4, 2026: Windows support added
- April 16, 2026: Major expansion â Computer Use, in-app browser, image generation, 90+ plugins, memory preview, automations
The April 16 update transformed Codex from a code-only tool into a full desktop automation platform. GPT-5.5 (codename "Spud") powers the cloud version with improved context handling, coding quality, and token efficiency.
Why Use Local Models with Codex?
- $0 cost: No ChatGPT Plus/Pro subscription needed. Run unlimited coding tasks.
- Privacy: Proprietary code never leaves your machine. Critical for enterprise, defense, healthcare.
- Offline: Code on planes, restricted networks, air-gapped environments.
- No rate limits: Cloud Codex throttles heavy users. Local has no caps.
- Custom models: Use fine-tuned models trained on your codebase.
- Experimentation: Try different models instantly without billing concerns.
Prerequisites & Hardware Requirements
| Tier | Hardware | Best Model | Tokens/sec |
|---|---|---|---|
| Minimum | 16GB RAM (Apple Silicon) or RTX 3060 12GB | GLM-4.7-Flash (Q4) | 8-15 |
| Recommended | 32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GB | Qwen3-Coder 30B (Q4) | 15-25 |
| Ideal | 64GB+ RAM (M4 Max) or RTX 4090 24GB | Qwen2.5-Coder-32B (Q6) | 20-35 |
Software Requirements
- Codex App or CLI:
brew install --cask codex(Mac) ornpm install -g @openai/codex(Linux/Windows) - Local inference server: Ollama, LM Studio, Unsloth Studio, or llama.cpp
- Model with tool calling: GLM-4.7-Flash, Qwen3-Coder, or GPT-OSS recommended
Method 1: Setup with Ollama
The simplest approach. Ollama handles model management and serves an OpenAI-compatible API.
Step 1: Install Ollama & Pull Model
# Install Ollama
brew install ollama # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh # Linux
# Pull recommended model
ollama pull glm-4.7-flash
# Start Ollama server (if not running)
ollama serve
Step 2: Configure config.toml
Create or edit ~/.codex/config.toml:
[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"
[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"
Step 3: Launch Codex
codex --profile local
Or with inline model specification:
codex --model glm-4.7-flash -c model_provider=ollama
Method 2: Setup with LM Studio
- Download and install LM Studio
- Search and download GLM-4.7-Flash-GGUF (Q4_K_M quantization recommended)
- Go to Local Server tab â Load model â Click Start Server
- Note the port (default: 1234)
Add to ~/.codex/config.toml:
[model_providers.lmstudio]
name = "LM Studio"
base_url = "http://localhost:1234/v1"
wire_api = "responses"
[profiles.lmstudio]
model_provider = "lmstudio"
model = "glm-4.7-flash"
codex --profile lmstudio
Method 3: Setup with Unsloth Studio
Unsloth provides a web UI with self-healing tool calling and automatic inference parameter tuning:
Step 1: Launch Unsloth and load your model
Step 2: Export API key
export UNSLOTH_STUDIO_API_KEY=sk-uns...xxxx
Step 3: Configure config.toml
[model_providers.unsloth_api]
name = "Unsloth Studio"
base_url = "http://localhost:8888/v1"
env_key = "UNSLOTH_STUDIO_API_KEY"
wire_api = "responses"
[profiles.unsloth_api]
model_provider = "unsloth_api"
model = "gpt-oss-20b-GGUF"
Step 4: Launch
codex -p unsloth_api
Method 4: Setup with llama.cpp
For maximum control and performance tuning, build llama.cpp from source:
Step 1: Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON # Use -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j \
--clean-first --target llama-server
cp llama.cpp/build/bin/llama-server llama.cpp/
Step 2: Download Model
pip install huggingface_hub hf_transfer
python -c "
import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='unsloth/GLM-4.7-Flash-GGUF',
local_dir='models/GLM-4.7-Flash-GGUF',
allow_patterns=['*UD-Q4_K_XL*']
)"
Step 3: Start Server
./llama.cpp/llama-server \
--model models/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
--alias "unsloth/GLM-4.7-Flash" \
--port 8001 \
--ctx-size 131072 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--batch-size 4096 --ubatch-size 1024 \
--temp 1.0 --top-p 0.95 --min-p 0.01
Step 4: Configure Codex
[model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://localhost:8001/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000
[profiles.llama_cpp]
model_provider = "llama_cpp"
model = "unsloth/GLM-4.7-Flash"
codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp
Best Local Models for Codex
| Model | Params | Context | Tool Calling | VRAM/RAM | Verdict |
|---|---|---|---|---|---|
| GLM-4.7-Flash â | 30B MoE (3B active) | 128K | Yes (79.5%) | ~6.5GB | Best overall â fast, capable, low requirements |
| Qwen3-Coder | 30B | 128K | Yes | ~20GB | Strong coding quality, needs more hardware |
| GPT-OSS:20B | 20B | 32K | Yes | ~12GB | Good general purpose, smaller context |
| Devstral-Small-2 | 24B | 128K | Yes | ~16GB | Code-focused, solid tool calling |
| Qwen3-Coder-Next | 30B+ | 128K | Yes | ~20GB | Latest iteration, improved reasoning |
Full Comparison: Cloud GPT-5.5 vs Local Models
| Feature | Codex Cloud (GPT-5.5) | Local Models | Ollama Cloud (Free) |
|---|---|---|---|
| Speed | 60-80 tok/s | 8-25 tok/s | 30-60 tok/s |
| Code Quality | Best in class (SWE-bench 90.2%) | 70-85% of cloud quality | 85-95% of cloud quality |
| Computer Use | â Full desktop control | â Not available | â Not available |
| In-App Browser | â Browse and comment | â Not available | â Not available |
| Automations | â Scheduled, recurring | â Not available | â Not available |
| Memory | â Remembers preferences | â Not available | â Not available |
| 90+ Plugins | â Full catalog | â Most unavailable | â Most unavailable |
| Image Generation | â gpt-image-1.5 | â Not available | â Not available |
| Multi-file Reasoning | Excellent | Fair | Good |
| Monthly Cost | $20-200 | $0 | $0 |
| Privacy | Data sent to OpenAI | 100% local | Data sent to provider |
| Offline | No | Yes | No |
| Rate Limits | Yes | None | Yes (generous) |
| Wire API | Responses (native) | Responses (required) | Responses (required) |
Performance Benchmarks
| Setup | Tokens/sec | Cost/Month | Quality Score |
|---|---|---|---|
| GPT-5.5 Cloud (Codex default) | 60-80 | $20-200 | 10/10 |
| Ollama Cloud (qwen3.5:cloud) | 30-60 | $0 | 8.5/10 |
| RTX 4090 (GLM-4.7-Flash) | 20-30 | ~$12 | 7.5/10 |
| RTX 4070 Ti (GLM-4.7-Flash Q4) | 15-25 | ~$10 | 7.5/10 |
| M4 Max 64GB (Qwen3-Coder) | 15-20 | ~$5 | 8/10 |
| M1 Max 32GB (GLM-4.7-Flash) | 10-15 | ~$4 | 7/10 |
| RTX 3060 12GB (GLM-4.7-Flash) | 8-15 | ~$8 | 7/10 |
Limitations
Critical things to understand before going local:
- wire_api = "responses" is mandatory: Codex has deprecated Chat Completions support. Your local server MUST support the OpenAI Responses API at
/v1/responses. Ollama, Unsloth, and recent llama.cpp builds support this. - Computer Use is cloud-only: The desktop automation feature (clicking, typing in apps) requires GPT-5.5 and OpenAI's infrastructure. It will NOT work with local models.
- Automations/scheduling disabled: Recurring tasks, thread reuse, and future work scheduling require cloud connectivity.
- Memory doesn't persist: The "remembers preferences" feature is cloud-only.
- Plugins mostly unavailable: The 90+ plugins (Atlassian, GitLab, CircleCI, etc.) require cloud authentication.
- Slower inference (3-10x): Simple tasks take 3x longer; complex tasks up to 10x longer than cloud.
- Weaker multi-file reasoning: Local models struggle with cross-file dependencies and architectural understanding.
- Edit accuracy drops: Cloud GPT-5.5 has ~98% edit accuracy. Local models land at 70-80%, meaning broken patches that need manual fixing.
- Tool calling can fail: Models without robust tool calling support will generate text descriptions instead of executing actions.
Possibilities
- Free unlimited coding: Run thousands of tasks without watching a billing meter.
- Complete privacy: Trade secrets, proprietary algorithms, client code â all stays local.
- GDPR/HIPAA compliance: Zero cross-border data transfer. No DPAs needed with third parties.
- Hybrid workflow: Use
--profile localfor sensitive work,--profile cloudfor complex tasks. Switch in one flag. - Custom fine-tuned models: Train domain-specific models on your codebase and use them through Codex.
- Offline development: Airports, rural areas, classified facilities â code with AI anywhere.
- Team standardization: Share config.toml across your team for consistent local setups.
- Model A/B testing: Compare different models on the same task instantly.
Cost Analysis
| Option | Upfront | Monthly | 6-Month Total | 12-Month Total | Quality |
|---|---|---|---|---|---|
| ChatGPT Plus (Cloud Codex) | $0 | $20 | $120 | $240 | Best |
| ChatGPT Pro | $0 | $200 | $1,200 | $2,400 | Best + unlimited |
| Local GPU (RTX 4070 Ti) | $489 | ~$10 | $549 | $609 | 70-85% |
| Existing Mac (16GB+) | $0 | ~$4 | $24 | $48 | 70-85% |
| Ollama Cloud Models | $0 | $0 | $0 | $0 | 85-95% |
Best value: Ollama cloud models give 85-95% of cloud quality at $0 cost. If privacy isn't a hard requirement, start here.
Troubleshooting
"type of tool must be function" error
This means your server doesn't support wire_api = "responses" correctly. Update to the latest version of your inference server (Ollama 0.14.3+, latest llama.cpp).
Model not found
- Check available models:
ollama listorcurl http://localhost:8001/v1/models - Use the exact model name from the API response in your config.toml
Codex hangs or times out
- Add
stream_idle_timeout_ms = 10000000to your model_provider config - Local models are slower â Codex may timeout waiting for responses on complex tasks
Tool calling not working
- Verify your model supports tool calling (GLM-4.7-Flash recommended)
- Enable jinja templates in llama.cpp: add
--jinjaflag - Check that
wire_api = "responses"is set (not "chat")
Frequently Asked Questions
Can the Codex desktop app use local models?
Yes. The Codex App reads from ~/.codex/config.toml and supports custom model providers pointing to local servers. You configure a model_provider with a local base_url and select it via profiles.
Does Computer Use work with local models?
No. Computer Use (background desktop automation) is exclusively a cloud feature that requires GPT-5.5 and OpenAI's infrastructure. Local models cannot control your desktop.
What's the difference between Codex App and Codex CLI with local models?
Both use the same config.toml and support the same local model providers. The App adds GUI features (worktree visualization, terminal tabs, preview panes) while the CLI is terminal-only. Cloud-exclusive features (Computer Use, automations, plugins) are absent in both when using local models.
Which local model is best for Codex?
GLM-4.7-Flash is the top pick: 128K context, strong tool calling (79.5%), and runs on 16GB RAM thanks to MoE architecture. For raw coding quality, Qwen3-Coder 30B is slightly better but requires 20GB+ VRAM.
Is Chat Completions API still supported?
No. OpenAI deprecated Chat Completions support in Codex. You must use wire_api = "responses" in your config.toml. Servers that only expose /v1/chat/completions will not work.
Can I use Ollama's free cloud models with Codex?
Yes. Ollama proxies models like qwen3.5:cloud and glm-5:cloud with generous free tiers. They run at 30-60 tok/s with no hardware requirements beyond running Ollama itself. Configure them the same way as local models in config.toml.
Recommended Workflow
The most productive setup combines local and cloud:
# ~/.codex/config.toml
# Default to local for privacy
model_provider = "ollama"
model = "glm-4.7-flash"
[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"
[model_providers.cloud]
name = "OpenAI Cloud"
# Uses default OpenAI API
[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"
[profiles.cloud]
model_provider = "cloud"
model = "gpt-5.5"
[profiles.free]
model_provider = "ollama"
model = "qwen3.5:cloud"
Daily usage:
# Private work (sensitive code)
codex --profile local "fix the auth module"
# Complex tasks (need quality)
codex --profile cloud "refactor the entire payment system"
# Free + fast (non-sensitive)
codex --profile free "add documentation to all functions"
For a broader comparison of AI coding tools, see our Cursor vs Windsurf vs Claude Code comparison. And if you're building creative projects, check out our free AI Image Generator.