AI Tools

May 4, 20269 min read

How to Use OpenAI Codex App with Local Models: Complete Setup Guide (2026)

Run the OpenAI Codex desktop app with local models via Ollama, LM Studio, or llama.cpp. Full config.toml setup, best models, performance benchmarks, and limitations chart.

Tweet Share

How to Use OpenAI Codex App with Local Models: Complete Setup Guide (2026)

The OpenAI Codex App is the most powerful AI coding agent available in 2026 — but it doesn't have to cost you anything. By connecting Codex to local models via Ollama, LM Studio, Unsloth, or llama.cpp, you get the full agentic coding experience running entirely on your hardware. Zero API costs. Complete privacy. No rate limits.

This guide covers every setup method, the best models to use, a full comparison chart between cloud GPT-5.5 and local models, real performance benchmarks, limitations, and practical workflows. Whether you're on a MacBook with 16GB RAM or a workstation with an RTX 4090 — this is your complete reference for running Codex app with local models in 2026.

💡

Already using Codex? See our OpenAI Codex macOS App deep-dive for the full cloud feature breakdown.

What Is the OpenAI Codex App?

OpenAI ships Codex across three surfaces. Understanding the difference matters for local model setup:

Surface	What It Is	Local Models?
Codex App (Desktop)	macOS/Windows desktop app. Multi-agent orchestration, worktrees, automations, Computer Use, in-app browser, 90+ plugins.	Yes (via config.toml)
Codex CLI	Terminal-based coding agent. Runs in your shell, reads/writes files, executes commands.	Yes (via config.toml or env vars)
Codex Agent (ChatGPT)	Cloud-only agent in ChatGPT. Runs in sandboxed environments on OpenAI's servers.	No (cloud only)

Timeline

February 2, 2026: Codex App launches on macOS
March 4, 2026: Windows support added
April 16, 2026: Major expansion — Computer Use, in-app browser, image generation, 90+ plugins, memory preview, automations

The April 16 update transformed Codex from a code-only tool into a full desktop automation platform. GPT-5.5 (codename "Spud") powers the cloud version with improved context handling, coding quality, and token efficiency.

Why Use Local Models with Codex?

$0 cost: No ChatGPT Plus/Pro subscription needed. Run unlimited coding tasks.
Privacy: Proprietary code never leaves your machine. Critical for enterprise, defense, healthcare.
Offline: Code on planes, restricted networks, air-gapped environments.
No rate limits: Cloud Codex throttles heavy users. Local has no caps.
Custom models: Use fine-tuned models trained on your codebase.
Experimentation: Try different models instantly without billing concerns.

Prerequisites & Hardware Requirements

Tier	Hardware	Best Model	Tokens/sec
Minimum	16GB RAM (Apple Silicon) or RTX 3060 12GB	GLM-4.7-Flash (Q4)	8-15
Recommended	32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GB	Qwen3-Coder 30B (Q4)	15-25
Ideal	64GB+ RAM (M4 Max) or RTX 4090 24GB	Qwen2.5-Coder-32B (Q6)	20-35

Software Requirements

Codex App or CLI: brew install --cask codex (Mac) or npm install -g @openai/codex (Linux/Windows)
Local inference server: Ollama, LM Studio, Unsloth Studio, or llama.cpp
Model with tool calling: GLM-4.7-Flash, Qwen3-Coder, or GPT-OSS recommended

Method 1: Setup with Ollama

The simplest approach. Ollama handles model management and serves an OpenAI-compatible API.

Step 1: Install Ollama & Pull Model

terminal

# Install Ollama
brew install ollama   # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh   # Linux

# Pull recommended model
ollama pull glm-4.7-flash

# Start Ollama server (if not running)
ollama serve

Step 2: Configure config.toml

Create or edit ~/.codex/config.toml:

terminal

[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

Step 3: Launch Codex

terminal

codex --profile local

Or with inline model specification:

terminal

codex --model glm-4.7-flash -c model_provider=ollama

Method 2: Setup with LM Studio

Download and install LM Studio
Search and download GLM-4.7-Flash-GGUF (Q4_K_M quantization recommended)
Go to Local Server tab → Load model → Click Start Server
Note the port (default: 1234)

Add to ~/.codex/config.toml:

terminal

[model_providers.lmstudio]
name = "LM Studio"
base_url = "http://localhost:1234/v1"
wire_api = "responses"

[profiles.lmstudio]
model_provider = "lmstudio"
model = "glm-4.7-flash"

terminal

codex --profile lmstudio

Method 3: Setup with Unsloth Studio

Unsloth provides a web UI with self-healing tool calling and automatic inference parameter tuning:

Step 1: Launch Unsloth and load your model

Step 2: Export API key

terminal

export UNSLOTH_STUDIO_API_KEY=sk-uns...xxxx

Step 3: Configure config.toml

terminal

[model_providers.unsloth_api]
name = "Unsloth Studio"
base_url = "http://localhost:8888/v1"
env_key = "UNSLOTH_STUDIO_API_KEY"
wire_api = "responses"

[profiles.unsloth_api]
model_provider = "unsloth_api"
model = "gpt-oss-20b-GGUF"

Step 4: Launch

terminal

codex -p unsloth_api

Method 4: Setup with llama.cpp

For maximum control and performance tuning, build llama.cpp from source:

Step 1: Build llama.cpp

terminal

git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON  # Use -DGGML_CUDA=OFF for CPU/Metal
cmake --build llama.cpp/build --config Release -j \
    --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server llama.cpp/

Step 2: Download Model

terminal

pip install huggingface_hub hf_transfer
python -c "
import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='unsloth/GLM-4.7-Flash-GGUF',
    local_dir='models/GLM-4.7-Flash-GGUF',
    allow_patterns=['*UD-Q4_K_XL*']
)"

Step 3: Start Server

terminal

./llama.cpp/llama-server \
    --model models/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --port 8001 \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --batch-size 4096 --ubatch-size 1024 \
    --temp 1.0 --top-p 0.95 --min-p 0.01

Step 4: Configure Codex

terminal

[model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://localhost:8001/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

[profiles.llama_cpp]
model_provider = "llama_cpp"
model = "unsloth/GLM-4.7-Flash"

terminal

codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp

Best Local Models for Codex

Model	Params	Context	Tool Calling	VRAM/RAM	Verdict
GLM-4.7-Flash ⭐	30B MoE (3B active)	128K	Yes (79.5%)	~6.5GB	Best overall — fast, capable, low requirements
Qwen3-Coder	30B	128K	Yes	~20GB	Strong coding quality, needs more hardware
GPT-OSS:20B	20B	32K	Yes	~12GB	Good general purpose, smaller context
Devstral-Small-2	24B	128K	Yes	~16GB	Code-focused, solid tool calling
Qwen3-Coder-Next	30B+	128K	Yes	~20GB	Latest iteration, improved reasoning

Full Comparison: Cloud GPT-5.5 vs Local Models

Feature	Codex Cloud (GPT-5.5)	Local Models	Ollama Cloud (Free)
Speed	60-80 tok/s	8-25 tok/s	30-60 tok/s
Code Quality	Best in class (SWE-bench 90.2%)	70-85% of cloud quality	85-95% of cloud quality
Computer Use	✅ Full desktop control	❌ Not available	❌ Not available
In-App Browser	✅ Browse and comment	❌ Not available	❌ Not available
Automations	✅ Scheduled, recurring	❌ Not available	❌ Not available
Memory	✅ Remembers preferences	❌ Not available	❌ Not available
90+ Plugins	✅ Full catalog	❌ Most unavailable	❌ Most unavailable
Image Generation	✅ gpt-image-1.5	❌ Not available	❌ Not available
Multi-file Reasoning	Excellent	Fair	Good
Monthly Cost	$20-200	$0	$0
Privacy	Data sent to OpenAI	100% local	Data sent to provider
Offline	No	Yes	No
Rate Limits	Yes	None	Yes (generous)
Wire API	Responses (native)	Responses (required)	Responses (required)

Performance Benchmarks

Setup	Tokens/sec	Cost/Month	Quality Score
GPT-5.5 Cloud (Codex default)	60-80	$20-200	10/10
Ollama Cloud (qwen3.5:cloud)	30-60	$0	8.5/10
RTX 4090 (GLM-4.7-Flash)	20-30	~$12	7.5/10
RTX 4070 Ti (GLM-4.7-Flash Q4)	15-25	~$10	7.5/10
M4 Max 64GB (Qwen3-Coder)	15-20	~$5	8/10
M1 Max 32GB (GLM-4.7-Flash)	10-15	~$4	7/10
RTX 3060 12GB (GLM-4.7-Flash)	8-15	~$8	7/10

Limitations

Critical things to understand before going local:

wire_api = "responses" is mandatory: Codex has deprecated Chat Completions support. Your local server MUST support the OpenAI Responses API at /v1/responses. Ollama, Unsloth, and recent llama.cpp builds support this.
Computer Use is cloud-only: The desktop automation feature (clicking, typing in apps) requires GPT-5.5 and OpenAI's infrastructure. It will NOT work with local models.
Automations/scheduling disabled: Recurring tasks, thread reuse, and future work scheduling require cloud connectivity.
Memory doesn't persist: The "remembers preferences" feature is cloud-only.
Plugins mostly unavailable: The 90+ plugins (Atlassian, GitLab, CircleCI, etc.) require cloud authentication.
Slower inference (3-10x): Simple tasks take 3x longer; complex tasks up to 10x longer than cloud.
Weaker multi-file reasoning: Local models struggle with cross-file dependencies and architectural understanding.
Edit accuracy drops: Cloud GPT-5.5 has ~98% edit accuracy. Local models land at 70-80%, meaning broken patches that need manual fixing.
Tool calling can fail: Models without robust tool calling support will generate text descriptions instead of executing actions.

Possibilities

Free unlimited coding: Run thousands of tasks without watching a billing meter.
Complete privacy: Trade secrets, proprietary algorithms, client code — all stays local.
GDPR/HIPAA compliance: Zero cross-border data transfer. No DPAs needed with third parties.
Hybrid workflow: Use --profile local for sensitive work, --profile cloud for complex tasks. Switch in one flag.
Custom fine-tuned models: Train domain-specific models on your codebase and use them through Codex.
Offline development: Airports, rural areas, classified facilities — code with AI anywhere.
Team standardization: Share config.toml across your team for consistent local setups.
Model A/B testing: Compare different models on the same task instantly.

Cost Analysis

Option	Upfront	Monthly	6-Month Total	12-Month Total	Quality
ChatGPT Plus (Cloud Codex)	$0	$20	$120	$240	Best
ChatGPT Pro	$0	$200	$1,200	$2,400	Best + unlimited
Local GPU (RTX 4070 Ti)	$489	~$10	$549	$609	70-85%
Existing Mac (16GB+)	$0	~$4	$24	$48	70-85%
Ollama Cloud Models	$0	$0	$0	$0	85-95%

Best value: Ollama cloud models give 85-95% of cloud quality at $0 cost. If privacy isn't a hard requirement, start here.

Troubleshooting

"type of tool must be function" error

This means your server doesn't support wire_api = "responses" correctly. Update to the latest version of your inference server (Ollama 0.14.3+, latest llama.cpp).

Model not found

Check available models: ollama list or curl http://localhost:8001/v1/models
Use the exact model name from the API response in your config.toml

Codex hangs or times out

Add stream_idle_timeout_ms = 10000000 to your model_provider config
Local models are slower — Codex may timeout waiting for responses on complex tasks

Tool calling not working

Verify your model supports tool calling (GLM-4.7-Flash recommended)
Enable jinja templates in llama.cpp: add --jinja flag
Check that wire_api = "responses" is set (not "chat")

Frequently Asked Questions

Can the Codex desktop app use local models?

Yes. The Codex App reads from ~/.codex/config.toml and supports custom model providers pointing to local servers. You configure a model_provider with a local base_url and select it via profiles.

Does Computer Use work with local models?

No. Computer Use (background desktop automation) is exclusively a cloud feature that requires GPT-5.5 and OpenAI's infrastructure. Local models cannot control your desktop.

What's the difference between Codex App and Codex CLI with local models?

Both use the same config.toml and support the same local model providers. The App adds GUI features (worktree visualization, terminal tabs, preview panes) while the CLI is terminal-only. Cloud-exclusive features (Computer Use, automations, plugins) are absent in both when using local models.

Which local model is best for Codex?

GLM-4.7-Flash is the top pick: 128K context, strong tool calling (79.5%), and runs on 16GB RAM thanks to MoE architecture. For raw coding quality, Qwen3-Coder 30B is slightly better but requires 20GB+ VRAM.

Is Chat Completions API still supported?

No. OpenAI deprecated Chat Completions support in Codex. You must use wire_api = "responses" in your config.toml. Servers that only expose /v1/chat/completions will not work.

Can I use Ollama's free cloud models with Codex?

Yes. Ollama proxies models like qwen3.5:cloud and glm-5:cloud with generous free tiers. They run at 30-60 tok/s with no hardware requirements beyond running Ollama itself. Configure them the same way as local models in config.toml.

Recommended Workflow

The most productive setup combines local and cloud:

terminal

# ~/.codex/config.toml

# Default to local for privacy
model_provider = "ollama"
model = "glm-4.7-flash"

[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[model_providers.cloud]
name = "OpenAI Cloud"
# Uses default OpenAI API

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

[profiles.cloud]
model_provider = "cloud"
model = "gpt-5.5"

[profiles.free]
model_provider = "ollama"
model = "qwen3.5:cloud"

Daily usage:

terminal

# Private work (sensitive code)
codex --profile local "fix the auth module"

# Complex tasks (need quality)
codex --profile cloud "refactor the entire payment system"

# Free + fast (non-sensitive)
codex --profile free "add documentation to all functions"

For a broader comparison of AI coding tools, see our Cursor vs Windsurf vs Claude Code comparison. And if you're building creative projects, check out our free AI Image Generator.

Tags:#openai codex#codex app#local models#ollama#llama.cpp#ai coding#codex desktop#free ai tools

Tweet Share

All Articles

PromptSpace Team

Curating the best AI prompts, guides, and tools since 2025.

What Is the OpenAI Codex App?

Timeline

Why Use Local Models with Codex?

Prerequisites & Hardware Requirements

Software Requirements

Method 1: Setup with Ollama

Step 1: Install Ollama & Pull Model

Step 2: Configure config.toml

Step 3: Launch Codex

Method 2: Setup with LM Studio

Method 3: Setup with Unsloth Studio

Step 1: Launch Unsloth and load your model

Step 2: Export API key

Step 3: Configure config.toml

Step 4: Launch

Method 4: Setup with llama.cpp

Step 1: Build llama.cpp

Step 2: Download Model

Step 3: Start Server

Step 4: Configure Codex

Best Local Models for Codex

Full Comparison: Cloud GPT-5.5 vs Local Models

Performance Benchmarks

Limitations

Possibilities

Cost Analysis

Troubleshooting

"type of tool must be function" error

Model not found

Codex hangs or times out

Tool calling not working

Frequently Asked Questions

Can the Codex desktop app use local models?

Does Computer Use work with local models?

What's the difference between Codex App and Codex CLI with local models?

Which local model is best for Codex?

Is Chat Completions API still supported?

Can I use Ollama's free cloud models with Codex?

Recommended Workflow

Related Prompt Collections

50 Free Hyper-Realistic AI Photo Prompts

50 Free AI Prompts for Instagram Reels, Stories & Posts

50 Free AI Profile Picture Prompts

Ready to Create Stunning AI Art?