Claude Code for Modal Serverless ML — Guide
The Setup
You are deploying machine learning workloads with Modal, a serverless platform designed for ML inference and batch processing. Modal provides GPU containers that scale to zero, custom container images defined in Python, and a Python-native SDK for deploying functions. Claude Code can set up ML infrastructure, but it generates Docker + Kubernetes configurations instead of Modal’s Python-first approach.
What Claude Code Gets Wrong By Default
-
Creates Dockerfiles and Kubernetes manifests. Claude writes Dockerfile, deployment.yaml, and service.yaml for ML serving. Modal replaces all of this with Python decorators —
@app.function(gpu="A100")is the entire infrastructure definition. -
Uses Flask/FastAPI for serving. Claude creates REST API servers for inference. Modal has
@app.web_endpoint()that creates HTTPS endpoints directly — no web framework needed. -
Manages container images separately. Claude writes a Dockerfile for the ML environment. Modal defines images in Python:
modal.Image.debian_slim().pip_install("torch", "transformers")— the image is built and cached by Modal. -
Provisions static GPU instances. Claude sets up always-on GPU servers. Modal scales to zero and cold-starts in seconds — you pay only for active compute, not idle GPUs.
The CLAUDE.md Configuration
# Modal ML Deployment
## Infrastructure
- Platform: Modal (serverless ML compute)
- GPU: A100, H100, T4, L4 on-demand
- Scale: auto-scale to zero, pay per second
- Deploy: modal deploy app.py
## Modal Rules
- App: app = modal.App("my-app")
- Function: @app.function(gpu="A100", image=image)
- Image: modal.Image.debian_slim().pip_install(...)
- Web: @app.web_endpoint() for HTTP endpoints
- Cron: @app.function(schedule=modal.Period(hours=1))
- Volumes: modal.Volume for persistent storage
- Secrets: modal.Secret.from_name("my-secret")
## Conventions
- Define images in Python, not Dockerfiles
- Use @app.cls() for stateful GPU functions (model loading)
- Load model in __enter__, inference in methods
- @modal.enter() for one-time setup per container
- Use volumes for model cache (avoid re-download)
- modal serve app.py for local development
- modal deploy app.py for production
Workflow Example
You want to deploy a text-to-image model for inference. Prompt Claude Code:
“Deploy a Stable Diffusion XL model on Modal with a web endpoint. Use an A100 GPU, cache the model weights in a Modal Volume to avoid re-downloading, and return the generated image as a PNG response. Include a health check endpoint.”
Claude Code should create a Modal app with a custom image installing diffusers and torch, a @app.cls(gpu="A100") class that loads the model in @modal.enter() from a Volume cache, an inference method with @modal.web_endpoint() that accepts a prompt and returns a PNG, and a simple health check endpoint.
Common Pitfalls
-
Re-downloading models on every cold start. Claude loads models from Hugging Face in the function body. Without a Modal Volume for caching, the model downloads on every container start. Use
modal.Volumeto persist model weights across invocations. -
Not using @app.cls for stateful functions. Claude loads the model inside the function on every call. Modal’s
@app.cls()with@modal.enter()loads the model once per container lifecycle — subsequent calls reuse the loaded model. -
Image layer cache invalidation. Claude puts frequently changing code in the image definition. Modal caches image layers — put stable dependencies in the image and dynamic code in the function. Use
modal.Mountfor code that changes between deploys.