Not flushing before process exit. Claude does not call langfuse.flush() at the end. Langfuse batches events and sends them asynchronously — if the process exits before flush, traces are lost. Always flush in serverless functions and before shutdown. Missing session grouping. Claude crea...

Claude Code for Langfuse LLM Analytics (2026)

Last updated: April 18, 2026

The Setup

You are monitoring your LLM application with Langfuse, an open-source observability platform for LLM apps. Langfuse traces every LLM call, tracks costs, measures latency, captures user feedback, and provides analytics dashboards. Claude Code can instrument LLM applications, but it generates basic console.log debugging instead of structured observability with Langfuse.

What Claude Code Gets Wrong By Default

Logs LLM calls to console. Claude adds console.log(response) for debugging. Langfuse provides structured tracing with spans, generations, and metadata — console logs are unstructured and unsearchable.
Calculates costs manually. Claude writes token counting and cost calculation code. Langfuse automatically tracks token usage and costs per model — it knows pricing for major providers and calculates costs from usage data.
Ignores trace hierarchy. Claude treats each LLM call independently. Langfuse uses traces (a complete user interaction), spans (logical steps), and generations (individual LLM calls) — this hierarchy shows how components interact.
Does not capture user feedback. Claude has no feedback mechanism. Langfuse provides a scores API for capturing thumbs up/down, ratings, or automated evaluation scores linked to specific traces.

The CLAUDE.md Configuration

# Langfuse LLM Observability
## Monitoring
- Platform: Langfuse (open-source LLM observability)
- Tracing: traces, spans, generations hierarchy
- Costs: automatic token and cost tracking
- Feedback: scores API for user ratings
## Langfuse Rules
- SDK: langfuse Python/JS SDK or OpenAI wrapper
- Trace: one trace per user interaction
- Span: logical steps within a trace
- Generation: individual LLM API calls
- Scores: attach feedback to traces
- OpenAI wrapper: from langfuse.openai import openai
- Flush: langfuse.flush() before process exit
## Conventions
- Initialize Langfuse with LANGFUSE_PUBLIC_KEY and SECRET_KEY
- Use @observe() decorator for automatic tracing (Python)
- Use trace.generation() for LLM calls within a trace
- Add metadata: user_id, session_id, tags
- Capture input/output for each generation
- Score traces with user feedback
- Self-hosted: Docker Compose for Langfuse server

Workflow Example

You want to add Langfuse tracing to a RAG chatbot. Prompt Claude Code:

“Add Langfuse observability to our RAG chatbot. Trace each user message as a trace, the retrieval step as a span, and the LLM generation as a generation. Include the retrieved documents as metadata and capture the user’s feedback score. Use the Python SDK.”

Claude Code should initialize the Langfuse client, create a trace with user_id and session_id, add a span for document retrieval with retrieved docs as metadata, create a generation for the LLM call with model name and token usage, and expose a feedback endpoint that calls langfuse.score() linked to the trace ID.

Common Pitfalls

Not flushing before process exit. Claude does not call langfuse.flush() at the end. Langfuse batches events and sends them asynchronously — if the process exits before flush, traces are lost. Always flush in serverless functions and before shutdown.
Missing session grouping. Claude creates individual traces without session context. Multi-turn conversations should share a session_id so Langfuse groups them together for analysis.
Tracing in production without sampling. Claude traces every single request. In high-traffic production, trace a percentage of requests with sample_rate=0.1 to reduce cost and data volume while maintaining statistical significance.

Which model? → Take the 5-question quiz in our Model Selector.

Estimate tokens → Calculate your usage with our Token Estimator.

Try it: Paste your error into our Error Diagnostic for an instant fix.

Common Questions

What AI models work best with this approach?

Claude Opus 4 and Claude Sonnet 4 handle complex reasoning tasks. For simpler operations, Claude Haiku 3.5 offers faster responses at lower cost. Match model capability to task complexity.

How do I handle AI agent failures gracefully?

Implement retry logic with exponential backoff, set clear timeout boundaries, and design fallback paths for critical operations. Log all failures for pattern analysis.

Can this workflow scale to production?

Yes. Add rate limiting, request queuing, and monitoring before production deployment. Most AI agent architectures scale horizontally by adding worker instances behind a load balancer.

Claude Code for Langfuse LLM Analytics (2026)

The Setup

What Claude Code Gets Wrong By Default

The CLAUDE.md Configuration

Workflow Example

Common Pitfalls

See Also

Common Questions

What AI models work best with this approach?

How do I handle AI agent failures gracefully?

Can this workflow scale to production?

About the Author

The Setup

What Claude Code Gets Wrong By Default

The CLAUDE.md Configuration

Workflow Example

Common Pitfalls

Related Guides

See Also

Common Questions

What AI models work best with this approach?

How do I handle AI agent failures gracefully?

Can this workflow scale to production?

Related Resources

About the Author

Related Guides