AI टूल्स

10 min readUpdated

OpenAI Codex App को Local Models के साथ कैसे इस्तेमाल करें: पूरी सेटअप गाइड (2026)

OpenAI Codex desktop app को Ollama, LM Studio या llama.cpp के ज़रिए local models के साथ चलाएं। पूरा config.toml setup, बेहतरीन models, परफॉरमेंस बेंचमार्क और troubleshooting।

Tweet WhatsApp LinkedIn

OpenAI Codex App को Local Models के साथ कैसे इस्तेमाल करें: पूरी सेटअप गाइड (2026)

OpenAI Codex App 2026 का सबसे शक्तिशाली AI coding agent है — लेकिन इस पर एक रुपया भी खर्च करने की ज़रूरत नहीं है। Codex को Ollama, LM Studio, Unsloth, या llama.cpp के ज़रिए local models से कनेक्ट करके आप पूरा agentic coding अनुभव अपने ही hardware पर पा सकते हैं। ज़ीरो API costs। पूरी privacy। कोई rate limits नहीं।

यह गाइड हर setup method, सबसे अच्छे models, cloud GPT-5.5 और local models के बीच पूरा comparison chart, असली performance benchmarks, limitations और practical workflows कवर करती है। चाहे आपके पास 16GB RAM वाला MacBook हो या RTX 4090 वाला workstation — 2026 में Codex app को local models के साथ चलाने के लिए यह आपका कम्पलीट reference है।

💡

पहले से Codex यूज़ कर रहे हैं? पूरे cloud feature breakdown के लिए हमारा OpenAI Codex macOS App deep-dive देखें।

OpenAI Codex App क्या है?

OpenAI Codex को तीन surfaces पर उपलब्ध कराता है। Local model setup के लिए इन तीनों का फर्क समझना ज़रूरी है:

Surface	क्या है	Local Models?
Codex App (Desktop)	macOS/Windows desktop app। Multi-agent orchestration, worktrees, automations, Computer Use, in-app browser, 90+ plugins।	हाँ (config.toml के ज़रिए)
Codex CLI	Terminal-based coding agent। आपके shell में चलता है, files पढ़ता/लिखता है, commands execute करता है।	हाँ (config.toml या env vars के ज़रिए)
Codex Agent (ChatGPT)	ChatGPT में cloud-only agent। OpenAI के servers पर sandboxed environments में चलता है।	नहीं (केवल cloud)

Timeline

2 फरवरी 2026: Codex App macOS पर लॉन्च हुआ
4 मार्च 2026: Windows support जुड़ा
16 अप्रैल 2026: बड़ा expansion — Computer Use, in-app browser, image generation, 90+ plugins, memory preview, automations

16 अप्रैल का update Codex को सिर्फ़ code वाले tool से पूरे desktop automation platform में बदल गया। GPT-5.5 (codename "Spud") cloud version को बेहतर context handling, coding quality और token efficiency के साथ चलाता है।

Codex के साथ Local Models क्यों इस्तेमाल करें?

$0 cost: ChatGPT Plus/Pro subscription की ज़रूरत नहीं। Unlimited coding tasks चलाएं।
Privacy: Proprietary code कभी आपकी मशीन से बाहर नहीं जाता। Enterprise, defense, healthcare के लिए ज़रूरी।
Offline: Planes में, restricted networks पर, air-gapped environments में code लिखें।
कोई rate limits नहीं: Cloud Codex भारी users को throttle करता है। Local पर कोई caps नहीं।
Custom models: अपने codebase पर fine-tuned models इस्तेमाल करें।
Experimentation: Billing की चिंता किए बिना अलग-अलग models try करें।

Prerequisites और Hardware Requirements

Tier	Hardware	Best Model	Tokens/sec
Minimum	16GB RAM (Apple Silicon) या RTX 3060 12GB	GLM-4.7-Flash (Q4)	8-15
Recommended	32GB RAM (M1 Pro/Max) या RTX 4070 Ti 16GB	Qwen3-Coder 30B (Q4)	15-25
Ideal	64GB+ RAM (M4 Max) या RTX 4090 24GB	Qwen2.5-Coder-32B (Q6)	20-35

Software Requirements

Codex App या CLI: brew install --cask codex (Mac) या npm install -g @openai/codex (Linux/Windows)
Local inference server: Ollama, LM Studio, Unsloth Studio, या llama.cpp
Tool calling वाला model: GLM-4.7-Flash, Qwen3-Coder, या GPT-OSS recommended

Method 1: Ollama के साथ Setup

सबसे आसान तरीका। Ollama model management संभालता है और एक OpenAI-compatible API serve करता है।

Step 1: Ollama Install करें और Model Pull करें

terminal

# Ollama install करें
brew install ollama   # macOS
# या: curl -fsSL https://ollama.com/install.sh | sh   # Linux

# Recommended model pull करें
ollama pull glm-4.7-flash

# Ollama server start करें (अगर पहले से नहीं चल रहा)
ollama serve

Step 2: config.toml Configure करें

~/.codex/config.toml create या edit करें:

terminal

[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

Step 3: Codex Launch करें

terminal

codex --profile local

या inline model specification के साथ:

terminal

codex --model glm-4.7-flash -c model_provider=ollama

Method 2: LM Studio के साथ Setup

LM Studio download और install करें
GLM-4.7-Flash-GGUF search करके download करें (Q4_K_M quantization recommended)
Local Server tab पर जाएं → Model Load करें → Start Server पर click करें
Port नोट करें (default: 1234)

~/.codex/config.toml में जोड़ें:

terminal

[model_providers.lmstudio]
name = "LM Studio"
base_url = "http://localhost:1234/v1"
wire_api = "responses"

[profiles.lmstudio]
model_provider = "lmstudio"
model = "glm-4.7-flash"

terminal

codex --profile lmstudio

Method 3: Unsloth Studio के साथ Setup

Unsloth एक web UI देता है जिसमें self-healing tool calling और automatic inference parameter tuning होती है:

Step 1: Unsloth launch करें और अपना model load करें

Step 2: API key Export करें

terminal

export UNSLOTH_STUDIO_API_KEY=sk-uns...pre>

Step 3: config.toml Configure करें

[model_providers.unsloth_api]
name = "Unsloth Studio"
base_url = "http://localhost:8888/v1"
env_key = "UNSLOTH_STUDIO_API_KEY"
wire_api = "responses"

[profiles.unsloth_api]
model_provider = "unsloth_api"
model = "gpt-oss-20b-GGUF"



Step 4: Launch


      
        terminal
        
      
      codex -p unsloth_api
    

Method 4: llama.cpp के साथ Setup
    

अधिकतम control और performance tuning के लिए llama.cpp को source से build करें:

Step 1: llama.cpp Build करें


      
        terminal
        
      
      git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON  # CPU/Metal के लिए -DGGML_CUDA=OFF use करें
cmake --build llama.cpp/build --config Release -j \
    --clean-first --target llama-server
cp llama.cpp/build/bin/llama-server llama.cpp/
    

Step 2: Model Download करें


      
        terminal
        
      
      pip install huggingface_hub hf_transfer
python -c "
import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='unsloth/GLM-4.7-Flash-GGUF',
    local_dir='models/GLM-4.7-Flash-GGUF',
    allow_patterns=['*UD-Q4_K_XL*']
)"
    

Step 3: Server Start करें


      
        terminal
        
      
      ./llama.cpp/llama-server \
    --model models/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \
    --alias "unsloth/GLM-4.7-Flash" \
    --port 8001 \
    --ctx-size 131072 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --batch-size 4096 --ubatch-size 1024 \
    --temp 1.0 --top-p 0.95 --min-p 0.01
    

Step 4: Codex Configure करें


      
        terminal
        
      
      [model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://localhost:8001/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

[profiles.llama_cpp]
model_provider = "llama_cpp"
model = "unsloth/GLM-4.7-Flash"
    


      
        terminal
        
      
      codex --model unsloth/GLM-4.7-Flash -c model_provider=llama_cpp
    

Codex के लिए सबसे बेहतर Local Models
    



Model Params Context Tool Calling VRAM/RAM Verdict


GLM-4.7-Flash ⭐ 30B MoE (3B active) 128K हाँ (79.5%) ~6.5GB ओवरऑल बेस्ट — तेज़, सक्षम, कम requirements
Qwen3-Coder 30B 128K हाँ ~20GB मज़बूत coding quality, ज़्यादा hardware चाहिए
GPT-OSS:20B 20B 32K हाँ ~12GB अच्छा general purpose, छोटा context
Devstral-Small-2 24B 128K हाँ ~16GB Code-focused, solid tool calling
Qwen3-Coder-Next 30B+ 128K हाँ ~20GB नवीनतम iteration, बेहतर reasoning



पूरा Comparison: Cloud GPT-5.5 बनाम Local Models
    



Feature Codex Cloud (GPT-5.5) Local Models Ollama Cloud (Free)


Speed 60-80 tok/s 8-25 tok/s 30-60 tok/s
Code Quality क्लास में बेस्ट (SWE-bench 90.2%) cloud quality का 70-85% cloud quality का 85-95%
Computer Use ✅ पूरा desktop control ❌ उपलब्ध नहीं ❌ उपलब्ध नहीं
In-App Browser ✅ Browse और comment ❌ उपलब्ध नहीं ❌ उपलब्ध नहीं
Automations ✅ Scheduled, recurring ❌ उपलब्ध नहीं ❌ उपलब्ध नहीं
Memory ✅ Preferences याद रखता है ❌ उपलब्ध नहीं ❌ उपलब्ध नहीं
90+ Plugins ✅ पूरा catalog ❌ अधिकतर unavailable ❌ अधिकतर unavailable
Image Generation ✅ gpt-image-1.5 ❌ उपलब्ध नहीं ❌ उपलब्ध नहीं
Multi-file Reasoning शानदार ठीक-ठाक अच्छा
Monthly Cost $20-200 $0 $0
Privacy Data OpenAI को जाता है 100% local Data provider को जाता है
Offline नहीं हाँ नहीं
Rate Limits हाँ नहीं हाँ (generous)
Wire API Responses (native) Responses (required) Responses (required)



Performance Benchmarks
    



Setup Tokens/sec Cost/Month Quality Score


GPT-5.5 Cloud (Codex default) 60-80 $20-200 10/10
Ollama Cloud (qwen3.5:cloud) 30-60 $0 8.5/10
RTX 4090 (GLM-4.7-Flash) 20-30 ~$12 7.5/10
RTX 4070 Ti (GLM-4.7-Flash Q4) 15-25 ~$10 7.5/10
M4 Max 64GB (Qwen3-Coder) 15-20 ~$5 8/10
M1 Max 32GB (GLM-4.7-Flash) 10-15 ~$4 7/10
RTX 3060 12GB (GLM-4.7-Flash) 8-15 ~$8 7/10



Limitations
    

Local जाने से पहले समझने वाली ज़रूरी बातें:


wire_api = "responses" अनिवार्य है: Codex ने Chat Completions support हटा दिया है। आपके local server को OpenAI Responses API /v1/responses पर support करना MUST है। Ollama, Unsloth, और हाल के llama.cpp builds इसे support करते हैं।
Computer Use केवल cloud है: Desktop automation feature (apps में clicking, typing) के लिए GPT-5.5 और OpenAI का infrastructure चाहिए। Local models के साथ यह काम नहीं करेगा।
Automations/scheduling बंद हैं: Recurring tasks, thread reuse, और future work scheduling के लिए cloud connectivity चाहिए।
Memory persist नहीं होती: "Preferences याद रखने" वाला feature cloud-only है।
Plugins ज़्यादातर unavailable: 90+ plugins (Atlassian, GitLab, CircleCI, आदि) के लिए cloud authentication चाहिए।
धीमी inference (3-10x): Simple tasks 3x ज़्यादा वक़्त लेते हैं; complex tasks cloud से 10x तक धीमे।
कमज़ोर multi-file reasoning: Local models cross-file dependencies और architectural understanding में संघर्ष करते हैं।
Edit accuracy गिरती है: Cloud GPT-5.5 की edit accuracy ~98% है। Local models 70-80% पर रहते हैं — यानी टूटी हुई patches जिन्हें manually ठीक करना पड़ता है।
Tool calling fail हो सकता है: मज़बूत tool calling support के बिना मॉडल actions execute करने की जगह सिर्फ़ text descriptions generate करेंगे।


Possibilities
    


फ्री unlimited coding: Billing meter देखे बिना हज़ारों tasks चलाएं।
पूरी privacy: Trade secrets, proprietary algorithms, client code — सब local रहता है।
GDPR/HIPAA compliance: ज़ीरो cross-border data transfer। Third parties के साथ DPAs की ज़रूरत नहीं।
Hybrid workflow: Sensitive काम के लिए --profile local, complex tasks के लिए --profile cloud। एक ही flag से switch करें।
Custom fine-tuned models: अपने codebase पर domain-specific models train करें और Codex के ज़रिए use करें।
Offline development: Airports, ग्रामीण इलाक़े, classified facilities — कहीं भी AI के साथ code करें।
Team standardization: Consistent local setups के लिए अपनी team में config.toml share करें।
Model A/B testing: एक ही task पर अलग-अलग models तुरंत compare करें।


Cost Analysis
    



Option Upfront Monthly 6-Month Total 12-Month Total Quality


ChatGPT Plus (Cloud Codex) $0 $20 $120 $240 Best
ChatGPT Pro $0 $200 $1,200 $2,400 Best + unlimited
Local GPU (RTX 4070 Ti) $489 ~$10 $549 $609 70-85%
Existing Mac (16GB+) $0 ~$4 $24 $48 70-85%
Ollama Cloud Models $0 $0 $0 $0 85-95%



सबसे बेहतर value: Ollama cloud models cloud quality का 85-95% $0 cost में देते हैं। अगर privacy hard requirement नहीं है, तो यहीं से शुरू करें।

Troubleshooting
    

"type of tool must be function" error
इसका मतलब है कि आपका server wire_api = "responses" को सही तरीक़े से support नहीं करता। अपने inference server का latest version (Ollama 0.14.3+, latest llama.cpp) update करें।

Model not found

उपलब्ध models check करें: ollama list या curl http://localhost:8001/v1/models
API response में दिख रहा exact model name अपने config.toml में use करें


Codex hang या timeout हो जाता है

अपनी model_provider config में stream_idle_timeout_ms = 10000000 जोड़ें
Local models धीमे होते हैं — Codex complex tasks पर responses का इंतज़ार करते हुए timeout हो सकता है


Tool calling काम नहीं कर रही

Verify करें कि आपका model tool calling support करता है (GLM-4.7-Flash recommended)
llama.cpp में jinja templates enable करें: --jinja flag जोड़ें
Check करें कि wire_api = "responses" set है ("chat" नहीं)


अक्सर पूछे जाने वाले सवाल
    

क्या Codex desktop app local models use कर सकता है?
हाँ। Codex App ~/.codex/config.toml से पढ़ता है और local servers की तरफ़ point करने वाले custom model providers support करता है। आप एक model_provider को local base_url के साथ configure करते हैं और profiles के ज़रिए select करते हैं।

क्या Computer Use local models के साथ काम करता है?
नहीं। Computer Use (background desktop automation) सिर्फ़ cloud feature है जिसके लिए GPT-5.5 और OpenAI का infrastructure चाहिए। Local models आपके desktop को control नहीं कर सकते।

Local models के साथ Codex App और Codex CLI में क्या फ़र्क है?
दोनों एक ही config.toml use करते हैं और एक ही local model providers support करते हैं। App GUI features (worktree visualization, terminal tabs, preview panes) जोड़ता है जबकि CLI केवल terminal-based है। Cloud-exclusive features (Computer Use, automations, plugins) local models use करने पर दोनों में नहीं होते।

Codex के लिए कौन सा local model सबसे बेहतर है?
GLM-4.7-Flash टॉप पिक है: 128K context, मज़बूत tool calling (79.5%), और MoE architecture की वजह से 16GB RAM पर चलता है। Raw coding quality के लिए Qwen3-Coder 30B थोड़ा बेहतर है लेकिन 20GB+ VRAM चाहिए।

क्या Chat Completions API अब भी supported है?
नहीं। OpenAI ने Codex में Chat Completions support deprecated कर दिया है। आपको अपने config.toml में wire_api = "responses" use करना ही होगा। केवल /v1/chat/completions expose करने वाले servers काम नहीं करेंगे।

क्या मैं Ollama के free cloud models Codex के साथ use कर सकता हूँ?
हाँ। Ollama qwen3.5:cloud और glm-5:cloud जैसे models को generous free tiers के साथ proxy करता है। ये 30-60 tok/s पर चलते हैं और Ollama के अलावा hardware requirements नहीं हैं। Local models की तरह ही config.toml में configure करें।

Recommended Workflow
    

सबसे productive setup local और cloud को मिलाता है:


      
        terminal
        
      
      # ~/.codex/config.toml

# Privacy के लिए default local
model_provider = "ollama"
model = "glm-4.7-flash"

[model_providers.ollama]
name = "Ollama Local"
base_url = "http://localhost:11434/v1"
wire_api = "responses"

[model_providers.cloud]
name = "OpenAI Cloud"
# Default OpenAI API use करता है

[profiles.local]
model_provider = "ollama"
model = "glm-4.7-flash"

[profiles.cloud]
model_provider = "cloud"
model = "gpt-5.5"

[profiles.free]
model_provider = "ollama"
model = "qwen3.5:cloud"
    

रोज़ का इस्तेमाल:

      
        terminal
        
      
      # Private काम (sensitive code)
codex --profile local "fix the auth module"

# Complex tasks (quality चाहिए)
codex --profile cloud "refactor the entire payment system"

# Free + तेज़ (non-sensitive)
codex --profile free "add documentation to all functions"
    

AI coding tools के व्यापक comparison के लिए हमारा Cursor vs Windsurf vs Claude Code comparison देखें। और अगर आप creative projects बना रहे हैं, तो हमारा फ्री AI Image Generator देखें।

Model	Params	Context	Tool Calling	VRAM/RAM	Verdict
GLM-4.7-Flash ⭐	30B MoE (3B active)	128K	हाँ (79.5%)	~6.5GB	ओवरऑल बेस्ट — तेज़, सक्षम, कम requirements
Qwen3-Coder	30B	128K	हाँ	~20GB	मज़बूत coding quality, ज़्यादा hardware चाहिए
GPT-OSS:20B	20B	32K	हाँ	~12GB	अच्छा general purpose, छोटा context
Devstral-Small-2	24B	128K	हाँ	~16GB	Code-focused, solid tool calling
Qwen3-Coder-Next	30B+	128K	हाँ	~20GB	नवीनतम iteration, बेहतर reasoning

Feature	Codex Cloud (GPT-5.5)	Local Models	Ollama Cloud (Free)
Speed	60-80 tok/s	8-25 tok/s	30-60 tok/s
Code Quality	क्लास में बेस्ट (SWE-bench 90.2%)	cloud quality का 70-85%	cloud quality का 85-95%
Computer Use	✅ पूरा desktop control	❌ उपलब्ध नहीं	❌ उपलब्ध नहीं
In-App Browser	✅ Browse और comment	❌ उपलब्ध नहीं	❌ उपलब्ध नहीं
Automations	✅ Scheduled, recurring	❌ उपलब्ध नहीं	❌ उपलब्ध नहीं
Memory	✅ Preferences याद रखता है	❌ उपलब्ध नहीं	❌ उपलब्ध नहीं
90+ Plugins	✅ पूरा catalog	❌ अधिकतर unavailable	❌ अधिकतर unavailable
Image Generation	✅ gpt-image-1.5	❌ उपलब्ध नहीं	❌ उपलब्ध नहीं
Multi-file Reasoning	शानदार	ठीक-ठाक	अच्छा
Monthly Cost	$20-200	$0	$0
Privacy	Data OpenAI को जाता है	100% local	Data provider को जाता है
Offline	नहीं	हाँ	नहीं
Rate Limits	हाँ	नहीं	हाँ (generous)
Wire API	Responses (native)	Responses (required)	Responses (required)

Setup	Tokens/sec	Cost/Month	Quality Score
GPT-5.5 Cloud (Codex default)	60-80	$20-200	10/10
Ollama Cloud (qwen3.5:cloud)	30-60	$0	8.5/10
RTX 4090 (GLM-4.7-Flash)	20-30	~$12	7.5/10
RTX 4070 Ti (GLM-4.7-Flash Q4)	15-25	~$10	7.5/10
M4 Max 64GB (Qwen3-Coder)	15-20	~$5	8/10
M1 Max 32GB (GLM-4.7-Flash)	10-15	~$4	7/10
RTX 3060 12GB (GLM-4.7-Flash)	8-15	~$8	7/10

Option	Upfront	Monthly	6-Month Total	12-Month Total	Quality
ChatGPT Plus (Cloud Codex)	$0	$20	$120	$240	Best
ChatGPT Pro	$0	$200	$1,200	$2,400	Best + unlimited
Local GPU (RTX 4070 Ti)	$489	~$10	$549	$609	70-85%
Existing Mac (16GB+)	$0	~$4	$24	$48	70-85%
Ollama Cloud Models	$0	$0	$0	$0	85-95%

Tags:#OpenAI Codex#codex app#local models#ollama#llama.cpp#AI कोडिंग#codex desktop#फ्री AI टूल्स

All Articles

S
Written by Shahrukh
Creator of PromptSpace · AI Researcher & Prompt Engineer
Building the largest free AI prompt library with 4,000+ prompts. Covering AI image generation, prompt engineering, and tool comparisons since 2024. 159+ articles published.

OpenAI Codex App क्या है?

Timeline

Codex के साथ Local Models क्यों इस्तेमाल करें?

Prerequisites और Hardware Requirements

Software Requirements

Method 1: Ollama के साथ Setup

Step 1: Ollama Install करें और Model Pull करें

Step 2: config.toml Configure करें

Step 3: Codex Launch करें

Method 2: LM Studio के साथ Setup

Method 3: Unsloth Studio के साथ Setup

Step 1: Unsloth launch करें और अपना model load करें

Step 2: API key Export करें

Step 3: config.toml Configure करें

Step 4: Launch

Method 4: llama.cpp के साथ Setup

Step 1: llama.cpp Build करें

Step 2: Model Download करें

Step 3: Server Start करें

Step 4: Codex Configure करें

Codex के लिए सबसे बेहतर Local Models

पूरा Comparison: Cloud GPT-5.5 बनाम Local Models

Performance Benchmarks

Limitations

Possibilities

Cost Analysis

Troubleshooting

"type of tool must be function" error

Model not found

Codex hang या timeout हो जाता है

Tool calling काम नहीं कर रही

अक्सर पूछे जाने वाले सवाल

क्या Codex desktop app local models use कर सकता है?

क्या Computer Use local models के साथ काम करता है?

Local models के साथ Codex App और Codex CLI में क्या फ़र्क है?

Codex के लिए कौन सा local model सबसे बेहतर है?

क्या Chat Completions API अब भी supported है?

क्या मैं Ollama के free cloud models Codex के साथ use कर सकता हूँ?

Recommended Workflow

Related Articles

50 Best Free AI Image Generators in 2026

AI Video Generator: Create Free Videos in 2026

AI Logo Design: Create Free Logos with AI in 2026

Related Prompt Collections

50 Free Hyper-Realistic AI Photo Prompts

50 Free AI Prompts for Instagram Reels, Stories & Posts

50 Free AI Profile Picture Prompts

Explore More Articles

Getting Started with AI Image Generation

Best Practices for Writing AI Prompts

Top 100 Midjourney Prompts for 2026 - The Ultimate Collection

Best AI Prompts for Instagram Reels - Go Viral in 2026

Free Prompts for Viral AI Content - The Creator's Playbook

FLUX vs Midjourney vs DALL-E 3 - Which AI Image Generator Wins in 2026?

Ready to Create Stunning AI Art?

`Related Articles`

`Explore More Articles`