Claude Code for Groq Inference — Workflow Guide
The Setup
You are integrating Groq’s LPU-powered inference API for ultra-fast LLM responses in your application. Groq provides an OpenAI-compatible API that runs open-source models (Llama, Mixtral, Gemma) at extremely high speeds. Claude Code can write Groq integrations, but it configures the OpenAI SDK incorrectly or assumes Groq-specific features that do not exist.
What Claude Code Gets Wrong By Default
-
Uses the OpenAI package without configuring baseURL. Claude writes
new OpenAI({ apiKey })pointing to OpenAI’s servers. Groq requiresbaseURL: 'https://api.groq.com/openai/v1'or the dedicatedgroq-sdkpackage. -
Requests GPT model names. Claude uses
model: 'gpt-4'ormodel: 'gpt-3.5-turbo'. Groq runs its own model catalog:llama-3.3-70b-versatile,mixtral-8x7b-32768,gemma2-9b-it. GPT models are not available on Groq. -
Expects function calling on all models. Claude uses tool calling assuming all models support it. Not all Groq-hosted models support function calling — check the model’s capabilities before using tools.
-
Ignores rate limits and speed advantages. Claude adds generic retry logic and loading states designed for slow APIs. Groq responses arrive in milliseconds, but rate limits are strict — design for fast responses but respect tokens-per-minute limits.
The CLAUDE.md Configuration
# Groq Fast Inference Project
## AI Inference
- Provider: Groq (LPU-powered, ultra-fast inference)
- SDK: groq-sdk or openai with custom baseURL
- Models: llama-3.3-70b-versatile, mixtral-8x7b-32768, gemma2-9b-it
- API: OpenAI-compatible REST API
## Groq Rules
- Install groq-sdk or configure openai with baseURL
- API key: GROQ_API_KEY environment variable
- Model names are Groq-specific, NOT OpenAI model names
- Streaming: supported, use for best perceived performance
- Rate limits: tokens-per-minute varies by model, implement backoff
- JSON mode: response_format: { type: 'json_object' } supported
- Tool use: supported on select models only (check docs)
- No image input — text-only models currently
## Conventions
- Groq client in lib/groq.ts as singleton
- Use streaming for chat interfaces (fast TTFT)
- Implement token-based rate limiting, not request-based
- Model selection based on task: 70b for complex, 8b for simple
- Fallback to alternative model on rate limit (429)
- Cache responses for identical prompts (Groq is fast but has limits)
Workflow Example
You want to add fast AI-powered text summarization. Prompt Claude Code:
“Create a summarization API endpoint using Groq’s Llama model. Accept text input, generate a concise summary using streaming, and return the result. Include rate limit handling with exponential backoff.”
Claude Code should initialize the Groq client with the API key, create an endpoint that calls groq.chat.completions.create() with model: 'llama-3.3-70b-versatile', stream: true, a summarization system prompt, handle the streaming response, and implement retry logic with exponential backoff on 429 status codes.
Common Pitfalls
-
Not leveraging Groq’s speed in UX. Claude adds elaborate loading spinners and skeleton screens. Groq responses arrive so fast (often under 1 second) that heavy loading states flash and feel jarring. Use minimal loading indicators or none at all for short prompts.
-
Rate limit strategy mismatch. Claude implements per-request rate limiting. Groq limits by tokens-per-minute, not requests-per-minute. A few large requests can exhaust the limit faster than many small ones. Track token usage, not just request count.
-
Model availability assumptions. Claude hardcodes a specific model version. Groq regularly updates its model catalog — models can be deprecated or replaced. Use a config variable for the model name and handle model-not-found errors gracefully.