Skip to main content
PROMPT SPACE
AI Tools
9 min read

How to Use OpenAI Codex App with Local Models: Complete Setup Guide (2026)

Run the OpenAI Codex desktop app with local models via Ollama, LM Studio, or llama.cpp. Full config.toml setup, best models, performance benchmarks, and limitations chart.

How to Use OpenAI Codex App with Local Models: Complete Setup Guide (2026)

The OpenAI Codex App is the most powerful AI coding agent available in 2026 — but it doesn't have to cost you anything. By connecting Codex to local models via Ollama, LM Studio, Unsloth, or llama.cpp, you get the full agentic coding experience running entirely on your hardware. Zero API costs. Complete privacy. No rate limits.

This guide covers every setup method, the best models to use, a full comparison chart between cloud GPT-5.5 and local models, real performance benchmarks, limitations, and practical workflows. Whether you're on a MacBook with 16GB RAM or a workstation with an RTX 4090 — this is your complete reference for running Codex app with local models in 2026.

💡
Already using Codex? See our OpenAI Codex macOS App deep-dive for the full cloud feature breakdown.

What Is the OpenAI Codex App?

OpenAI ships Codex across three surfaces. Understanding the difference matters for local model setup:

SurfaceWhat It IsLocal Models?
Codex App (Desktop)macOS/Windows desktop app. Multi-agent orchestration, worktrees, automations, Computer Use, in-app browser, 90+ plugins.Yes (via config.toml)
Codex CLITerminal-based coding agent. Runs in your shell, reads/writes files, executes commands.Yes (via config.toml or env vars)
Codex Agent (ChatGPT)Cloud-only agent in ChatGPT. Runs in sandboxed environments on OpenAI's servers.No (cloud only)

Timeline

  • February 2, 2026: Codex App launches on macOS
  • March 4, 2026: Windows support added
  • April 16, 2026: Major expansion — Computer Use, in-app browser, image generation, 90+ plugins, memory preview, automations

The April 16 update transformed Codex from a code-only tool into a full desktop automation platform. GPT-5.5 (codename "Spud") powers the cloud version with improved context handling, coding quality, and token efficiency.

Why Use Local Models with Codex?

  • $0 cost: No ChatGPT Plus/Pro subscription needed. Run unlimited coding tasks.
  • Privacy: Proprietary code never leaves your machine. Critical for enterprise, defense, healthcare.
  • Offline: Code on planes, restricted networks, air-gapped environments.
  • No rate limits: Cloud Codex throttles heavy users. Local has no caps.
  • Custom models: Use fine-tuned models trained on your codebase.
  • Experimentation: Try different models instantly without billing concerns.

Prerequisites & Hardware Requirements

TierHardwareBest ModelTokens/sec
Minimum16GB RAM (Apple Silicon) or RTX 3060 12GBGLM-4.7-Flash (Q4)8-15
Recommended32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GBQwen3-Coder 30B (Q4)15-25
Ideal64GB+ RAM (M4 Max) or RTX 4090 24GBQwen2.5-Coder-32B (Q6)20-35

Software Requirements

  • Codex App or CLI: brew install --cask codex (Mac) or npm install -g @openai/codex (Linux/Windows)
  • Local inference server: Ollama, LM Studio, Unsloth Studio, or llama.cpp
  • Model with tool calling: GLM-4.7-Flash, Qwen3-Coder, or GPT-OSS recommended

Method 1: Setup with Ollama

The simplest approach. Ollama handles model management and serves an OpenAI-compatible API.

Step 1: Install Ollama & Pull Model

terminal
# Install Ollama
brew install ollama   # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh   # Linux

# Pull recommended model
ollama pull glm-4.7-flash

# Start Ollama server (if not running)
ollama serve

Step 2: Configure config.toml

Create or edit ~/.codex/config.toml:

terminal
[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

Step 3: Launch Codex

terminal
codex --profile local

Or with inline model specification:

terminal
codex --model glm-4.7-flash -c model_provider=ollama

Method 2: Setup with LM Studio

  1. Download and install LM Studio
  2. Search and download GLM-4.7-Flash-GGUF (Q4_K_M quantization recommended)
  3. Go to Local Server tab → Load model → Click Start Server
  4. Note the port (default: 1234)

Add to ~/.codex/config.toml:

terminal
[model_providers.lmstudio]
name = "LM Studio"
base_url = "http://localhost:1234/v1"
wire_api = "responses"

[profiles.lmstudio]
model_provider = "lmstudio"
model = "glm-4.7-flash"
terminal
codex --profile lmstudio

Method 3: Setup with Unsloth Studio

Unsloth provides a web UI with self-healing tool calling and automatic inference parameter tuning:

Step 1: Launch Unsloth and load your model

Step 2: Export API key

terminal
export UNSLOTH_STUDIO_API_KEY=sk-uns...xxxx

Step 3: Configure config.toml

terminal
[model_providers.unsloth_api]
name = "Unsloth Studio"
base_url = "http://localhost:8888/v1"
env_key = "UNSLOTH_STUDIO_API_KEY"
wire_api = "responses"

[profiles.unsloth_api]
model_provider = "unsloth_api"
model = "gpt-oss-20b-GGUF"

Step 4: Launch

terminal
codex -p unsloth_api

Method 4: Setup with llama.cpp

For maximum control and performance tuning, build llama.cpp from source:

Step 1: Build llama.cpp

terminal
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON  # Use -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j \
    --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server llama.cpp/

Step 2: Download Model

terminal
pip install huggingface_hub hf_transfer
python -c "
import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='unsloth/GLM-4.7-Flash-GGUF',
    local_dir='models/GLM-4.7-Flash-GGUF',
    allow_patterns=['*UD-Q4_K_XL*']
)"

Step 3: Start Server

terminal
./llama.cpp/llama-server \
    --model models/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --port 8001 \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --batch-size 4096 --ubatch-size 1024 \
    --temp 1.0 --top-p 0.95 --min-p 0.01

Step 4: Configure Codex

terminal
[model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://localhost:8001/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

[profiles.llama_cpp]
model_provider = "llama_cpp"
model = "unsloth/GLM-4.7-Flash"
terminal
codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp

Best Local Models for Codex

ModelParamsContextTool CallingVRAM/RAMVerdict
GLM-4.7-Flash ⭐30B MoE (3B active)128KYes (79.5%)~6.5GBBest overall — fast, capable, low requirements
Qwen3-Coder30B128KYes~20GBStrong coding quality, needs more hardware
GPT-OSS:20B20B32KYes~12GBGood general purpose, smaller context
Devstral-Small-224B128KYes~16GBCode-focused, solid tool calling
Qwen3-Coder-Next30B+128KYes~20GBLatest iteration, improved reasoning

Full Comparison: Cloud GPT-5.5 vs Local Models

FeatureCodex Cloud (GPT-5.5)Local ModelsOllama Cloud (Free)
Speed60-80 tok/s8-25 tok/s30-60 tok/s
Code QualityBest in class (SWE-bench 90.2%)70-85% of cloud quality85-95% of cloud quality
Computer Use✅ Full desktop control❌ Not available❌ Not available
In-App Browser✅ Browse and comment❌ Not available❌ Not available
Automations✅ Scheduled, recurring❌ Not available❌ Not available
Memory✅ Remembers preferences❌ Not available❌ Not available
90+ Plugins✅ Full catalog❌ Most unavailable❌ Most unavailable
Image Generation✅ gpt-image-1.5❌ Not available❌ Not available
Multi-file ReasoningExcellentFairGood
Monthly Cost$20-200$0$0
PrivacyData sent to OpenAI100% localData sent to provider
OfflineNoYesNo
Rate LimitsYesNoneYes (generous)
Wire APIResponses (native)Responses (required)Responses (required)

Performance Benchmarks

SetupTokens/secCost/MonthQuality Score
GPT-5.5 Cloud (Codex default)60-80$20-20010/10
Ollama Cloud (qwen3.5:cloud)30-60$08.5/10
RTX 4090 (GLM-4.7-Flash)20-30~$127.5/10
RTX 4070 Ti (GLM-4.7-Flash Q4)15-25~$107.5/10
M4 Max 64GB (Qwen3-Coder)15-20~$58/10
M1 Max 32GB (GLM-4.7-Flash)10-15~$47/10
RTX 3060 12GB (GLM-4.7-Flash)8-15~$87/10

Limitations

Critical things to understand before going local:

  • wire_api = "responses" is mandatory: Codex has deprecated Chat Completions support. Your local server MUST support the OpenAI Responses API at /v1/responses. Ollama, Unsloth, and recent llama.cpp builds support this.
  • Computer Use is cloud-only: The desktop automation feature (clicking, typing in apps) requires GPT-5.5 and OpenAI's infrastructure. It will NOT work with local models.
  • Automations/scheduling disabled: Recurring tasks, thread reuse, and future work scheduling require cloud connectivity.
  • Memory doesn't persist: The "remembers preferences" feature is cloud-only.
  • Plugins mostly unavailable: The 90+ plugins (Atlassian, GitLab, CircleCI, etc.) require cloud authentication.
  • Slower inference (3-10x): Simple tasks take 3x longer; complex tasks up to 10x longer than cloud.
  • Weaker multi-file reasoning: Local models struggle with cross-file dependencies and architectural understanding.
  • Edit accuracy drops: Cloud GPT-5.5 has ~98% edit accuracy. Local models land at 70-80%, meaning broken patches that need manual fixing.
  • Tool calling can fail: Models without robust tool calling support will generate text descriptions instead of executing actions.

Possibilities

  • Free unlimited coding: Run thousands of tasks without watching a billing meter.
  • Complete privacy: Trade secrets, proprietary algorithms, client code — all stays local.
  • GDPR/HIPAA compliance: Zero cross-border data transfer. No DPAs needed with third parties.
  • Hybrid workflow: Use --profile local for sensitive work, --profile cloud for complex tasks. Switch in one flag.
  • Custom fine-tuned models: Train domain-specific models on your codebase and use them through Codex.
  • Offline development: Airports, rural areas, classified facilities — code with AI anywhere.
  • Team standardization: Share config.toml across your team for consistent local setups.
  • Model A/B testing: Compare different models on the same task instantly.

Cost Analysis

OptionUpfrontMonthly6-Month Total12-Month TotalQuality
ChatGPT Plus (Cloud Codex)$0$20$120$240Best
ChatGPT Pro$0$200$1,200$2,400Best + unlimited
Local GPU (RTX 4070 Ti)$489~$10$549$60970-85%
Existing Mac (16GB+)$0~$4$24$4870-85%
Ollama Cloud Models$0$0$0$085-95%

Best value: Ollama cloud models give 85-95% of cloud quality at $0 cost. If privacy isn't a hard requirement, start here.

Troubleshooting

"type of tool must be function" error

This means your server doesn't support wire_api = "responses" correctly. Update to the latest version of your inference server (Ollama 0.14.3+, latest llama.cpp).

Model not found

  • Check available models: ollama list or curl http://localhost:8001/v1/models
  • Use the exact model name from the API response in your config.toml

Codex hangs or times out

  • Add stream_idle_timeout_ms = 10000000 to your model_provider config
  • Local models are slower — Codex may timeout waiting for responses on complex tasks

Tool calling not working

  • Verify your model supports tool calling (GLM-4.7-Flash recommended)
  • Enable jinja templates in llama.cpp: add --jinja flag
  • Check that wire_api = "responses" is set (not "chat")

Frequently Asked Questions

Can the Codex desktop app use local models?

Yes. The Codex App reads from ~/.codex/config.toml and supports custom model providers pointing to local servers. You configure a model_provider with a local base_url and select it via profiles.

Does Computer Use work with local models?

No. Computer Use (background desktop automation) is exclusively a cloud feature that requires GPT-5.5 and OpenAI's infrastructure. Local models cannot control your desktop.

What's the difference between Codex App and Codex CLI with local models?

Both use the same config.toml and support the same local model providers. The App adds GUI features (worktree visualization, terminal tabs, preview panes) while the CLI is terminal-only. Cloud-exclusive features (Computer Use, automations, plugins) are absent in both when using local models.

Which local model is best for Codex?

GLM-4.7-Flash is the top pick: 128K context, strong tool calling (79.5%), and runs on 16GB RAM thanks to MoE architecture. For raw coding quality, Qwen3-Coder 30B is slightly better but requires 20GB+ VRAM.

Is Chat Completions API still supported?

No. OpenAI deprecated Chat Completions support in Codex. You must use wire_api = "responses" in your config.toml. Servers that only expose /v1/chat/completions will not work.

Can I use Ollama's free cloud models with Codex?

Yes. Ollama proxies models like qwen3.5:cloud and glm-5:cloud with generous free tiers. They run at 30-60 tok/s with no hardware requirements beyond running Ollama itself. Configure them the same way as local models in config.toml.

The most productive setup combines local and cloud:

terminal
# ~/.codex/config.toml

# Default to local for privacy
model_provider = "ollama"
model = "glm-4.7-flash"

[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[model_providers.cloud]
name = "OpenAI Cloud"
# Uses default OpenAI API

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

[profiles.cloud]
model_provider = "cloud"
model = "gpt-5.5"

[profiles.free]
model_provider = "ollama"
model = "qwen3.5:cloud"

Daily usage:

terminal
# Private work (sensitive code)
codex --profile local "fix the auth module"

# Complex tasks (need quality)
codex --profile cloud "refactor the entire payment system"

# Free + fast (non-sensitive)
codex --profile free "add documentation to all functions"

For a broader comparison of AI coding tools, see our Cursor vs Windsurf vs Claude Code comparison. And if you're building creative projects, check out our free AI Image Generator.

Tags:#openai codex#codex app#local models#ollama#llama.cpp#ai coding#codex desktop#free ai tools
P

Curating the best AI prompts, guides, and tools since 2025.

🎨

Related Prompt Collections

Free AI Prompts

Ready to Create Stunning AI Art?

Browse 4,000+ free, tested prompts for Midjourney, ChatGPT, Gemini, DALL-E & more. Copy, paste, create.