Claude Code for LitServe Lightning (2026)
Claude Code for LitServe Lightning Workflow Guide
LitServe is a blazing-fast AI model serving engine built on top of FastAPI, designed specifically for AI inference. Lightning AI provides a complete platform for building, training, and deploying AI applications. Together, these tools enable developers to productionize AI models efficiently. This guide shows you how to use Claude Code CLI to streamline your LitServe development workflow within the Lightning ecosystem, covering architecture decisions, batching strategies, performance tuning, and production-grade deployment patterns.
Understanding the LitServe Architecture
LitServe extends FastAPI with AI-specific optimizations, making it ideal for serving neural networks, language models, and other ML workloads. The framework handles batch inference, GPU acceleration, and streaming responses out of the box. Lightning AI adds orchestration, deployment, and scaling capabilities on top.
The core abstraction in LitServe is the LitAPI class, which separates the four stages of every inference request into distinct methods:
| Method | Responsibility | When It Runs |
|---|---|---|
setup(device) |
Load model, tokenizer, any static assets | Once at server startup |
decode_request(request) |
Parse and validate incoming JSON | Per request, before batching |
predict(inputs) |
Run model inference | Per batch |
encode_response(output) |
Format model output as JSON | Per request, after inference |
This separation is intentional and powerful. Claude Code can help you keep each method lean and testable, a pattern that pays dividends when debugging production issues where latency spikes or errors need to be isolated to a single stage.
When you combine Claude Code with LitServe, you gain an intelligent development partner that understands both your application logic and the AI serving infrastructure. Claude Code can help you scaffold servers, debug inference issues, optimize batch processing, and generate deployment configurations.
LitServe vs. Alternatives
Before committing to LitServe, it helps to understand the landscape. Claude Code can reason through these tradeoffs with you when you describe your requirements:
| Framework | Throughput | Latency | Setup Complexity | GPU Batching | Streaming | Best For |
|---|---|---|---|---|---|---|
| LitServe | Very High | Low | Low | Native | Yes | AI inference, fast iteration |
| TorchServe | High | Medium | High | Yes | Limited | PyTorch model registry |
| Triton Inference Server | Highest | Lowest | Very High | Advanced | Yes | Large-scale GPU serving |
| BentoML | High | Low | Medium | Yes | Yes | Multi-framework, packaging |
| FastAPI (raw) | High | Low | Low | Manual | Yes | Custom logic, non-ML APIs |
| Ray Serve | Very High | Medium | Medium | Yes | Yes | Distributed, complex pipelines |
LitServe hits the sweet spot for most teams: minimal boilerplate, native batching, and straightforward Lightning AI deployment. Claude Code excels at helping you get a LitServe server production-ready faster than any of the more complex alternatives.
Setting Up Your Development Environment
Before starting, ensure you have Claude Code installed and a Lightning AI account. Create a new project directory and initialize it with the necessary dependencies:
mkdir litserve-lightning-project && cd litserve-lightning-project
uv init
uv pip install litserve lightning
Use Claude Code to verify your setup by running a quick diagnostic:
claude --print "Check that litserve and lightning are properly installed by checking their versions"
Claude Code can generate a comprehensive setup script that ensures all dependencies are compatible. This prevents common version conflicts between PyTorch, CUDA, and the serving libraries.
Recommended Project Structure
Claude Code will typically suggest organizing a LitServe project like this:
litserve-lightning-project/
src/
api.py # LitAPI subclass definitions
models.py # Pydantic request/response schemas
utils.py # Preprocessing helpers
config.py # Server configuration constants
tests/
test_api.py # Unit tests for each LitAPI method
test_e2e.py # End-to-end inference tests
app.py # LitServe server entry point
lightning_app.py # Lightning AI deployment wrapper
Dockerfile
requirements.txt
README.md
You can generate this scaffold by asking Claude Code: “Create a LitServe project scaffold for serving a HuggingFace text classification model with proper separation of concerns.”
Pinning Compatible Versions
A common source of pain in AI serving projects is version incompatibility. Claude Code can generate a reliable requirements.txt:
litserve==0.2.3
lightning==2.2.0
torch==2.2.0+cu121
transformers==4.38.2
fastapi==0.110.0
uvicorn==0.27.1
pydantic==2.6.1
redis==5.0.1
prometheus-client==0.20.0
Creating Your First LitServe Server
A minimal LitServe server requires defining a model loader and the inference logic. Here’s a practical example of serving a text classification model:
from litserve import LitServe, LitAPI
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
class ClassificationLitAPI(LitAPI):
def setup(self, device):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
self.model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to(device)
self.model.eval()
def decode_request(self, request):
return request["text"]
def predict(self, inputs):
with torch.no_grad():
tokens = self.tokenizer(inputs, return_tensors="pt", padding=True).to(self.device)
outputs = self.model(tokens)
probs = torch.softmax(outputs.logits, dim=-1)
return probs.cpu().numpy()
def encode_response(self, output):
pred_class = output[0].argmax()
confidence = output[0].max()
return {"prediction": "positive" if pred_class == 1 else "negative", "confidence": float(confidence)}
if __name__ == "__main__":
server = LitServe(ClassificationLitAPI(), device="cuda" if torch.cuda.is_available() else "cpu")
server.run(port=8000)
Claude Code excels at expanding this skeleton into production-ready code. You can ask it to add error handling, request validation, logging, and monitoring with simple prompts.
Adding Solid Input Validation
The skeleton above has no input validation, a significant production risk. Claude Code will expand decode_request to catch bad inputs early:
from pydantic import BaseModel, validator, ValidationError
from typing import Optional
import logging
logger = logging.getLogger(__name__)
class ClassificationRequest(BaseModel):
text: str
max_length: Optional[int] = 512
@validator("text")
def text_not_empty(cls, v):
v = v.strip()
if not v:
raise ValueError("text field cannot be empty or whitespace")
return v
@validator("max_length")
def max_length_valid(cls, v):
if v is not None and (v < 10 or v > 512):
raise ValueError("max_length must be between 10 and 512")
return v
class ClassificationLitAPI(LitAPI):
def setup(self, device):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
self.model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to(device)
self.model.eval()
logger.info(f"Model loaded on {device}")
def decode_request(self, request):
try:
validated = ClassificationRequest(request)
return {"text": validated.text, "max_length": validated.max_length}
except ValidationError as e:
logger.warning(f"Invalid request: {e}")
raise ValueError(f"Request validation failed: {e}")
def predict(self, inputs):
text = inputs["text"]
max_length = inputs["max_length"]
with torch.no_grad():
tokens = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
).to(self.device)
outputs = self.model(tokens)
probs = torch.softmax(outputs.logits, dim=-1)
return probs.cpu().numpy()
def encode_response(self, output):
labels = ["negative", "positive"]
pred_idx = int(output[0].argmax())
return {
"prediction": labels[pred_idx],
"confidence": float(output[0].max()),
"probabilities": {
labels[i]: float(output[0][i]) for i in range(len(labels))
}
}
Integrating Claude Code for Development
Claude Code becomes particularly valuable when you need to add advanced features. For instance, to implement batching for higher throughput:
Ask Claude: "Add dynamic batching to this LitServe server with batch timeout of 0.1s"
Claude will generate the batching configuration:
Claude Code understands LitServe’s batching API and can configure optimal batch sizes based on your GPU memory. It can also generate async inference code, add request queuing, and implement health check endpoints.
When debugging inference issues, provide Claude Code with your error logs and model architecture details. It can identify common problems like tensor shape mismatches, device placement errors, or memory leaks.
Implementing Dynamic Batching
Dynamic batching is one of the highest-impact optimizations for GPU-based serving. Instead of processing one request at a time, LitServe collects requests into batches and processes them together, GPU usage can go from 15% to 85%+ on bursty workloads.
Claude Code can wire up batching with correct handling of the list-of-inputs interface:
from litserve import LitServe, LitAPI
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import List
class BatchedClassificationLitAPI(LitAPI):
def setup(self, device):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
self.model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to(device)
self.model.eval()
def decode_request(self, request):
# Called once per request before batching
return request["text"]
def batch(self, inputs: List[str]) -> dict:
# Called once with a list of decoded requests
# Tokenize all texts together for efficient padding
return self.tokenizer(
inputs,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
def predict(self, inputs: dict):
# inputs is the batched tokenizer output
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(inputs)
probs = torch.softmax(outputs.logits, dim=-1)
return probs.cpu().numpy()
def unbatch(self, outputs):
# Split batch output back into per-request results
return [outputs[i] for i in range(len(outputs))]
def encode_response(self, output):
pred_idx = int(output.argmax())
labels = ["negative", "positive"]
return {
"prediction": labels[pred_idx],
"confidence": float(output.max())
}
if __name__ == "__main__":
server = LitServe(
BatchedClassificationLitAPI(),
device="cuda",
max_batch_size=32, # Maximum items per batch
batch_timeout=0.05, # Wait up to 50ms to fill the batch
workers_per_device=1 # One worker per GPU
)
server.run(port=8000)
Benchmarking Batching Impact
Claude Code can generate a load testing script to quantify the throughput improvement from batching:
import asyncio
import aiohttp
import time
import statistics
async def single_request(session, url, text):
async with session.post(url, json={"text": text}) as resp:
return await resp.json()
async def load_test(url, texts, concurrency=50, total_requests=500):
"""Measure throughput and latency under concurrent load."""
latencies = []
start_time = time.perf_counter()
async with aiohttp.ClientSession() as session:
semaphore = asyncio.Semaphore(concurrency)
async def bounded_request(text):
async with semaphore:
t0 = time.perf_counter()
await single_request(session, url, text)
latencies.append((time.perf_counter() - t0) * 1000)
texts_to_send = [texts[i % len(texts)] for i in range(total_requests)]
await asyncio.gather(*[bounded_request(t) for t in texts_to_send])
total_time = time.perf_counter() - start_time
return {
"throughput_rps": total_requests / total_time,
"p50_ms": statistics.median(latencies),
"p95_ms": sorted(latencies)[int(0.95 * len(latencies))],
"p99_ms": sorted(latencies)[int(0.99 * len(latencies))],
}
test_texts = [
"This movie was absolutely fantastic!",
"I hated every minute of it.",
"An average film with some good moments."
]
results = asyncio.run(load_test("http://localhost:8000/predict", test_texts))
print(f"Throughput: {results['throughput_rps']:.0f} req/sec")
print(f"P50 latency: {results['p50_ms']:.1f}ms")
print(f"P95 latency: {results['p95_ms']:.1f}ms")
print(f"P99 latency: {results['p99_ms']:.1f}ms")
Deploying to Lightning AI
Lightning AI provides several deployment options: Lightning Apps, ServeDeploy, and cloud inference endpoints. For LitServe servers, the recommended approach uses Lightning’s serve capabilities.
Create a app.py for Lightning deployment:
import lightning as L
from litserve import LitServe, LitAPI
class ClassificationLitAPI(LitAPI):
def setup(self, device):
# Model loading logic
pass
def decode_request(self, request):
return request["text"]
def predict(self, inputs):
# Inference logic
pass
def encode_response(self, output):
return output
class LightningLitServe(L.LightningWork):
def __init__(self):
super().__init__(parallel=False)
self._server = None
def run(self):
api = ClassificationLitAPI()
self._server = LitServe(api)
self._server.run()
app = L.LightningApp(
LightningLitServe(),
cloud_compute=L.CloudCompute("gpu-fast")
)
Claude Code can generate this structure and customize it for your specific model. It handles the integration between LitServe and Lightning’s work orchestration system.
To deploy, use the Lightning CLI:
lightning run app app.py --cloud
Claude Code can also help you set up CI/CD pipelines for automatic deployments, configure environment variables securely, and set up monitoring dashboards.
Docker Containerization
For portable deployments outside Lightning AI, Claude Code generates production-ready Dockerfiles:
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
WORKDIR /app
Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
Copy application code
COPY src/ ./src/
COPY app.py .
Pre-download model weights (bakes them into the image)
RUN python -c "from transformers import AutoTokenizer, AutoModelForSequenceClassification; \
AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'); \
AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')"
Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["python", "app.py"]
Build and test locally before pushing:
docker build -t my-litserve-app:latest .
docker run --gpus all -p 8000:8000 my-litserve-app:latest
Environment-Specific Configuration
Claude Code will help you manage configuration across dev, staging, and production using environment variables and a config module:
src/config.py
import os
from dataclasses import dataclass
@dataclass
class ServerConfig:
model_name: str = os.getenv("MODEL_NAME", "distilbert-base-uncased-finetuned-sst-2-english")
device: str = os.getenv("DEVICE", "auto")
port: int = int(os.getenv("PORT", "8000"))
max_batch_size: int = int(os.getenv("MAX_BATCH_SIZE", "32"))
batch_timeout_ms: float = float(os.getenv("BATCH_TIMEOUT_MS", "50")) / 1000
workers_per_device: int = int(os.getenv("WORKERS_PER_DEVICE", "1"))
log_level: str = os.getenv("LOG_LEVEL", "INFO")
redis_url: str = os.getenv("REDIS_URL", "")
config = ServerConfig()
Optimizing Performance
Production LitServe deployments require careful performance tuning. Claude Code can analyze your serving patterns and recommend optimizations:
Batch Sizing: Calculate optimal batch sizes based on your model memory footprint and latency requirements. Claude Code can generate scripts that benchmark different configurations.
Caching: Implement request caching for models with repeated inference patterns. Redis or in-memory caches reduce redundant computation.
Streaming: For large language models, enable streaming responses to improve perceived latency:
def encode_response(self, output):
yield from output # Stream tokens as they're generated
Claude Code understands the streaming API and can migrate synchronous endpoints to streaming with minimal code changes.
Redis Caching for Repeated Requests
For production workloads where many users send identical or near-identical queries (search autocomplete, FAQ classification), response caching can dramatically reduce GPU load. Claude Code can add Redis caching to any LitServe server:
import redis
import json
import hashlib
from typing import Optional
class CachedClassificationLitAPI(LitAPI):
def setup(self, device):
self.device = device
# Load model
self.tokenizer = AutoTokenizer.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
self.model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
).to(device)
self.model.eval()
# Connect to Redis cache
redis_url = os.getenv("REDIS_URL", "redis://localhost:6379")
self.cache = redis.from_url(redis_url, decode_responses=True)
self.cache_ttl = 3600 # Cache results for 1 hour
def _cache_key(self, text: str) -> str:
return f"clf:{hashlib.md5(text.encode()).hexdigest()}"
def decode_request(self, request):
text = request["text"].strip()
cache_key = self._cache_key(text)
# Check cache first
cached = self.cache.get(cache_key)
if cached:
return {"text": text, "cached_result": json.loads(cached)}
return {"text": text, "cached_result": None}
def predict(self, inputs):
if inputs["cached_result"] is not None:
# Return a sentinel that encode_response will recognize
return {"__cached__": inputs["cached_result"]}
with torch.no_grad():
tokens = self.tokenizer(
inputs["text"],
return_tensors="pt",
truncation=True,
max_length=512
).to(self.device)
outputs = self.model(tokens)
probs = torch.softmax(outputs.logits, dim=-1)
return {"__probs__": probs.cpu().numpy(), "__text__": inputs["text"]}
def encode_response(self, output):
if "__cached__" in output:
return output["__cached__"]
probs = output["__probs__"][0]
labels = ["negative", "positive"]
result = {
"prediction": labels[int(probs.argmax())],
"confidence": float(probs.max())
}
# Store in cache
cache_key = self._cache_key(output["__text__"])
self.cache.setex(cache_key, self.cache_ttl, json.dumps(result))
return result
Streaming Responses for Language Models
When serving generative models, streaming is essential for good user experience. Claude Code can configure LitServe’s streaming generator pattern:
from litserve import LitServe, LitAPI
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
class StreamingLLMLitAPI(LitAPI):
def setup(self, device):
self.device = device
model_id = "microsoft/phi-2"
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
def decode_request(self, request):
return {
"prompt": request["prompt"],
"max_new_tokens": request.get("max_new_tokens", 200),
"temperature": request.get("temperature", 0.7)
}
def predict(self, inputs):
from transformers import TextIteratorStreamer
import threading
tokens = self.tokenizer(inputs["prompt"], return_tensors="pt").to(self.device)
streamer = TextIteratorStreamer(
self.tokenizer,
skip_prompt=True,
skip_special_tokens=True
)
generate_kwargs = {
tokens,
"streamer": streamer,
"max_new_tokens": inputs["max_new_tokens"],
"temperature": inputs["temperature"],
"do_sample": True,
}
# Run generation in a background thread
thread = threading.Thread(target=self.model.generate, kwargs=generate_kwargs)
thread.start()
# Yield tokens as they arrive
for token in streamer:
yield token
thread.join()
def encode_response(self, output):
# Each yielded token becomes a streaming chunk
for token in output:
yield {"token": token}
if __name__ == "__main__":
server = LitServe(
StreamingLLMLitAPI(),
stream=True,
device="cuda"
)
server.run(port=8000)
Performance Optimization Summary
Claude Code can reason through the performance levers available at each layer of the stack. Here is a prioritized checklist Claude will typically recommend:
| Priority | Optimization | Expected Gain | Effort |
|---|---|---|---|
| 1 | Enable dynamic batching | 3-10x throughput | Low |
| 2 | Use FP16/BF16 for inference | 1.5-2x throughput, 50% memory | Low |
| 3 | Set model.eval() and torch.no_grad() |
Prevents gradient tracking overhead | Very low |
| 4 | Pin model to GPU with device_map="auto" |
Eliminates CPU-GPU transfers | Low |
| 5 | Add response caching (Redis) | Near-zero cost for cached hits | Medium |
| 6 | ONNX or TorchScript export | 20-40% latency reduction | Medium |
| 7 | Increase workers_per_device |
Linear CPU throughput scaling | Low |
| 8 | Enable streaming for generative models | Perceived latency improvement | Medium |
Adding Observability
Production servers need monitoring. Claude Code can instrument your LitServe server with Prometheus metrics and structured logging in a single session:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import logging
import json
Configure structured logging
logging.basicConfig(
level=logging.INFO,
format='{"timestamp": "%(asctime)s", "level": "%(levelname)s", "message": "%(message)s"}'
)
logger = logging.getLogger(__name__)
Prometheus metrics
REQUEST_COUNT = Counter(
"litserve_requests_total",
"Total number of inference requests",
["endpoint", "status"]
)
REQUEST_LATENCY = Histogram(
"litserve_request_duration_seconds",
"Request latency in seconds",
["endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
BATCH_SIZE = Histogram(
"litserve_batch_size",
"Number of requests processed per batch",
buckets=[1, 2, 4, 8, 16, 32]
)
GPU_MEMORY_USED = Gauge(
"litserve_gpu_memory_bytes",
"GPU memory currently allocated"
)
class MonitoredClassificationLitAPI(LitAPI):
def setup(self, device):
self.device = device
# ... model loading ...
# Start Prometheus metrics server on separate port
start_http_server(9090)
logger.info("Prometheus metrics available on port 9090")
def predict(self, inputs):
batch_size = len(inputs) if isinstance(inputs, list) else 1
BATCH_SIZE.observe(batch_size)
start = time.perf_counter()
try:
result = self._run_inference(inputs)
REQUEST_COUNT.labels(endpoint="predict", status="success").inc()
return result
except Exception as e:
REQUEST_COUNT.labels(endpoint="predict", status="error").inc()
logger.error(json.dumps({"event": "inference_error", "error": str(e)}))
raise
finally:
REQUEST_LATENCY.labels(endpoint="predict").observe(
time.perf_counter() - start
)
if self.device == "cuda":
GPU_MEMORY_USED.set(
torch.cuda.memory_allocated() / (1024 3)
)
def _run_inference(self, inputs):
with torch.no_grad():
tokens = self.tokenizer(
inputs, return_tensors="pt", padding=True, truncation=True
).to(self.device)
outputs = self.model(tokens)
return torch.softmax(outputs.logits, dim=-1).cpu().numpy()
Best Practices for Claude Code + LitServe Workflows
-
Version Control Your Prompts: Store Claude Code interaction history to reproduce and audit development decisions.
-
Modularize Inference Logic: Keep prediction functions separate from serving code for easier testing and Claude Code interaction.
-
Use Type Hints: Claude Code generates more accurate code when you provide type annotations.
-
Implement Health Checks: Add
/healthand/metricsendpoints for Kubernetes readiness probes. -
Monitor GPU Usage: Use Lightning’s built-in observability to track inference latency and throughput.
Writing Testable LitAPI Code
One of the most valuable things Claude Code does is encourage, and generate, unit tests for each method of your LitAPI. Because the four methods are separated by design, you can test each independently:
tests/test_api.py
import pytest
import numpy as np
from unittest.mock import MagicMock, patch
from src.api import ClassificationLitAPI
@pytest.fixture
def api():
"""Create a ClassificationLitAPI instance with mocked model."""
with patch("src.api.AutoTokenizer") as mock_tokenizer_cls, \
patch("src.api.AutoModelForSequenceClassification") as mock_model_cls:
mock_tokenizer = MagicMock()
mock_model = MagicMock()
mock_tokenizer_cls.from_pretrained.return_value = mock_tokenizer
mock_model_cls.from_pretrained.return_value = mock_model
api = ClassificationLitAPI()
api.setup("cpu")
api.tokenizer = mock_tokenizer
api.model = mock_model
api.device = "cpu"
return api
def test_decode_request_valid(api):
result = api.decode_request({"text": "Great product!"})
assert result == "Great product!"
def test_decode_request_empty_raises(api):
with pytest.raises(ValueError, match="empty"):
api.decode_request({"text": " "})
def test_decode_request_missing_field_raises(api):
with pytest.raises((KeyError, ValueError)):
api.decode_request({})
def test_encode_response_positive(api):
probs = np.array([[0.1, 0.9]])
result = api.encode_response(probs)
assert result["prediction"] == "positive"
assert result["confidence"] == pytest.approx(0.9, abs=1e-4)
def test_encode_response_negative(api):
probs = np.array([[0.85, 0.15]])
result = api.encode_response(probs)
assert result["prediction"] == "negative"
assert result["confidence"] == pytest.approx(0.85, abs=1e-4)
Run tests with pytest, and use Claude Code to expand coverage when you add new features. This discipline catches regressions early, especially useful when Claude Code is iterating rapidly on your inference logic.
CI/CD with GitHub Actions
Claude Code can generate a complete GitHub Actions workflow for testing and deploying your LitServe server:
.github/workflows/deploy.yml
name: Test and Deploy LitServe
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- name: Install dependencies
run: uv pip install -r requirements.txt
- name: Run unit tests
run: pytest tests/ -v --tb=short
- name: Run linting
run: ruff check src/ tests/
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t $REGISTRY/my-litserve-app:$GITHUB_SHA .
- name: Push to registry
run: docker push $REGISTRY/my-litserve-app:$GITHUB_SHA
- name: Deploy to Lightning AI
run: lightning run app lightning_app.py --cloud
env:
LIGHTNING_API_KEY: ${{ secrets.LIGHTNING_API_KEY }}
Conclusion
Combining Claude Code with LitServe and Lightning AI creates a powerful development workflow for AI serving. Claude Code acts as an intelligent development partner, handling boilerplate generation, debugging, optimization suggestions, and deployment configuration. Start with simple servers, then progressively add batching, streaming, and monitoring as your requirements grow.
The key to success is treating Claude Code as a collaborative partner, provide clear context about your model architecture, performance requirements, and deployment targets. With these inputs, Claude Code transforms complex AI serving tasks into manageable development steps.
Concretely, the workflow that works best is: describe your model and SLA requirements to Claude Code, let it generate the initial LitAPI scaffold with appropriate batching and caching settings, ask it to add observability and tests, then use it to generate the Docker and CI/CD configuration. What would take a senior engineer two to three days to assemble correctly from documentation and Stack Overflow becomes a focused two-to-four hour session. The remaining complexity, tuning batch sizes against your actual traffic patterns, calibrating cache TTLs, setting alert thresholds, requires production data that only your specific deployment can provide.
Related Reading
- AI Assisted Architecture Design Workflow Guide
- AI Assisted Code Review Workflow Best Practices
- Best Way to Integrate Claude Code into Team Workflow
- Claude Code for Difftastic — Workflow Guide
- Claude Code for RisingWave Streaming — Guide
- Claude Code for DBeaver — Workflow Guide
- Claude Code for Architecture Decision Record Workflow
- Claude Code for FluxCD Notification Workflow Guide
- Color Contrast Checking Workflow with Claude Code
- Claude Code Indie Developer Side Project Workflow Guide
- Claude Code for Regula Policy Workflow Guide
Built by theluckystrike. More at zovo.one
Get started → Generate your project setup with our Project Starter.
See Also
Try it: Paste your error into our Error Diagnostic for an instant fix.