Claude Code for Ollama — Workflow Guide
The Setup
You are using Ollama to run LLMs locally on your machine for development, testing, and privacy-sensitive applications. Ollama provides an API compatible with the OpenAI format, making it easy to swap between local and cloud models. Claude Code can help integrate Ollama, but it configures cloud API endpoints and ignores local model management.
What Claude Code Gets Wrong By Default
-
Points to cloud API endpoints. Claude configures
https://api.openai.com/v1as the base URL. Ollama runs locally athttp://localhost:11434with its own API format and an OpenAI-compatible endpoint at/v1. -
Uses cloud model names. Claude references
gpt-4orclaude-3. Ollama uses model tags likellama3.3,mistral,codellama— pulled and managed locally withollama pull. -
Adds API key authentication. Claude includes
Authorization: Bearer sk-...headers. Ollama’s local API requires no authentication by default — adding auth headers causes connection failures. -
Ignores model pulling and management. Claude assumes models are always available. Ollama models must be pulled first:
ollama pull llama3.3downloads the model. Claude skips this step and gets “model not found” errors.
The CLAUDE.md Configuration
# Ollama Local LLM Project
## AI/LLM
- Runtime: Ollama (local LLM inference)
- API: http://localhost:11434 (native) or /v1 (OpenAI-compatible)
- Models: pulled locally with ollama pull <model>
- No API key required for local access
## Ollama Rules
- Base URL: http://localhost:11434 (NOT cloud endpoints)
- Model names: llama3.3, mistral, codellama (NOT gpt-4)
- No authentication headers needed
- Pull models first: ollama pull <model-name>
- List models: ollama list
- Chat API: POST /api/chat with model and messages
- OpenAI compat: POST /v1/chat/completions (same format)
- Streaming: stream: true returns NDJSON chunks
## Conventions
- Ollama client in lib/ollama.ts
- OLLAMA_HOST env var for non-default host
- Check model availability before requests
- Use OpenAI SDK with baseURL for compatibility
- Modelfile for custom model configurations
- GPU detection: ollama runs on GPU automatically if available
- Fallback to cloud API if Ollama is unavailable
Workflow Example
You want to create a development tool that uses Ollama for code review. Prompt Claude Code:
“Create a local code review tool that sends diffs to Ollama’s Llama model for analysis. Use the OpenAI-compatible API so it can easily switch to a cloud model. Handle the case where Ollama is not running.”
Claude Code should configure the OpenAI SDK with baseURL: 'http://localhost:11434/v1' and no API key, use llama3.3 as the model, send the diff as a user message with a code review system prompt, handle connection errors gracefully with a message to start Ollama, and support streaming for real-time review output.
Common Pitfalls
-
Model not pulled before first request. Claude makes API calls assuming the model exists. Ollama returns a 404 if the model has not been pulled. Check
ollama listor catch the error and prompt the user to runollama pull <model>. -
Memory requirements not considered. Claude selects large models (70B parameters) without checking available RAM. Each model needs roughly its parameter count in GB of memory (7B ~ 8GB, 70B ~ 48GB). Choose models that fit the development machine.
-
Context window differences. Claude sets high
max_tokensvalues from cloud API habits. Local models have different context windows (typically 4K-8K for small models). Check the model’s context length withollama show <model>and adjust accordingly.