Claude Code for Modal Serverless ML — Guide

Written by Michael Lip · Solo founder of Zovo · $400K+ on Upwork · 100% JSS Join 50+ builders · More at zovo.one

The Setup

You are deploying machine learning workloads with Modal, a serverless platform designed for ML inference and batch processing. Modal provides GPU containers that scale to zero, custom container images defined in Python, and a Python-native SDK for deploying functions. Claude Code can set up ML infrastructure, but it generates Docker + Kubernetes configurations instead of Modal’s Python-first approach.

What Claude Code Gets Wrong By Default

  1. Creates Dockerfiles and Kubernetes manifests. Claude writes Dockerfile, deployment.yaml, and service.yaml for ML serving. Modal replaces all of this with Python decorators — @app.function(gpu="A100") is the entire infrastructure definition.

  2. Uses Flask/FastAPI for serving. Claude creates REST API servers for inference. Modal has @app.web_endpoint() that creates HTTPS endpoints directly — no web framework needed.

  3. Manages container images separately. Claude writes a Dockerfile for the ML environment. Modal defines images in Python: modal.Image.debian_slim().pip_install("torch", "transformers") — the image is built and cached by Modal.

  4. Provisions static GPU instances. Claude sets up always-on GPU servers. Modal scales to zero and cold-starts in seconds — you pay only for active compute, not idle GPUs.

The CLAUDE.md Configuration

# Modal ML Deployment

## Infrastructure
- Platform: Modal (serverless ML compute)
- GPU: A100, H100, T4, L4 on-demand
- Scale: auto-scale to zero, pay per second
- Deploy: modal deploy app.py

## Modal Rules
- App: app = modal.App("my-app")
- Function: @app.function(gpu="A100", image=image)
- Image: modal.Image.debian_slim().pip_install(...)
- Web: @app.web_endpoint() for HTTP endpoints
- Cron: @app.function(schedule=modal.Period(hours=1))
- Volumes: modal.Volume for persistent storage
- Secrets: modal.Secret.from_name("my-secret")

## Conventions
- Define images in Python, not Dockerfiles
- Use @app.cls() for stateful GPU functions (model loading)
- Load model in __enter__, inference in methods
- @modal.enter() for one-time setup per container
- Use volumes for model cache (avoid re-download)
- modal serve app.py for local development
- modal deploy app.py for production

Workflow Example

You want to deploy a text-to-image model for inference. Prompt Claude Code:

“Deploy a Stable Diffusion XL model on Modal with a web endpoint. Use an A100 GPU, cache the model weights in a Modal Volume to avoid re-downloading, and return the generated image as a PNG response. Include a health check endpoint.”

Claude Code should create a Modal app with a custom image installing diffusers and torch, a @app.cls(gpu="A100") class that loads the model in @modal.enter() from a Volume cache, an inference method with @modal.web_endpoint() that accepts a prompt and returns a PNG, and a simple health check endpoint.

Common Pitfalls

  1. Re-downloading models on every cold start. Claude loads models from Hugging Face in the function body. Without a Modal Volume for caching, the model downloads on every container start. Use modal.Volume to persist model weights across invocations.

  2. Not using @app.cls for stateful functions. Claude loads the model inside the function on every call. Modal’s @app.cls() with @modal.enter() loads the model once per container lifecycle — subsequent calls reuse the loaded model.

  3. Image layer cache invalidation. Claude puts frequently changing code in the image definition. Modal caches image layers — put stable dependencies in the image and dynamic code in the function. Use modal.Mount for code that changes between deploys.