Claude Code for Unstructured IO — Guide
The Setup
You are processing documents for RAG pipelines with Unstructured, a library that extracts and chunks content from PDFs, DOCX, HTML, images, and dozens of other file formats. Unstructured handles OCR, table extraction, and document partitioning into structured elements. Claude Code can process documents, but it writes custom parsing scripts for each file format instead of using Unstructured’s unified API.
What Claude Code Gets Wrong By Default
-
Writes format-specific parsers. Claude creates separate scripts for PDF (PyPDF2), DOCX (python-docx), and HTML (BeautifulSoup). Unstructured provides
partition()that auto-detects and processes any supported format through a single function. -
Extracts plain text without structure. Claude dumps entire documents as raw text strings. Unstructured preserves document structure — titles, paragraphs, tables, and lists are returned as typed
Elementobjects with metadata. -
Implements custom chunking logic. Claude writes character-count-based text splitting. Unstructured has built-in chunking strategies (
chunk_by_title,chunk_elements) that respect document structure — chunks do not split mid-paragraph or mid-table. -
Ignores OCR for scanned documents. Claude skips scanned PDFs and images entirely. Unstructured integrates with Tesseract OCR and other engines — scanned documents are processed alongside digital ones.
The CLAUDE.md Configuration
# Unstructured Document Processing
## Processing
- Library: Unstructured (document ETL for LLMs)
- Input: PDF, DOCX, HTML, PPTX, images, email, and more
- Output: structured Elements with metadata
- Use case: RAG pipeline document ingestion
## Unstructured Rules
- Partition: partition(filename) or partition_pdf/html/etc.
- Elements: Title, NarrativeText, Table, ListItem types
- Chunking: chunk_by_title for structure-aware splitting
- OCR: auto for scanned PDFs, configure strategy
- API: Unstructured API for hosted processing
- Metadata: source, page_number, coordinates
## Conventions
- Use partition() for auto-format detection
- Filter elements by type for targeted extraction
- chunk_by_title for RAG-ready chunks
- max_characters on chunks for embedding model limits
- Include metadata in vector store for attribution
- Use hi_res strategy for complex PDFs with tables
- Batch process with partition_multiple for directories
Workflow Example
You want to build a document ingestion pipeline for a RAG chatbot. Prompt Claude Code:
“Create a Python pipeline that processes a directory of mixed documents (PDFs, DOCX, HTML) using Unstructured. Partition each document, chunk by title with 500 character max, embed each chunk with OpenAI embeddings, and upsert to a vector database. Preserve source file and page number metadata.”
Claude Code should use partition() for auto-detection, filter for relevant element types, apply chunk_by_title(max_characters=500), extract metadata (source filename, page_number), generate embeddings, and upsert to the vector store with metadata for source attribution.
Common Pitfalls
-
Missing system dependencies for OCR. Claude uses Unstructured’s OCR features without installing Tesseract or Poppler. The
hi_resstrategy requires system packages — installtesseract-ocrandpoppler-utilsbefore processing scanned documents. -
Using fast strategy for complex PDFs. Claude uses the default
faststrategy for PDFs with tables and images. Thefaststrategy misses tables and embedded content — usehi_resfor complex documents at the cost of slower processing. -
Not filtering element types. Claude sends every element to the embedding pipeline. Unstructured extracts headers, footers, page numbers, and other metadata elements. Filter for
NarrativeText,Title, andTableto avoid indexing irrelevant content.