Document Processing
How Moongraph processes documents — parsing, chunking, and embedding.
Document Processing
This page explains what happens when you upload a document — how it's parsed, split into chunks, and prepared for search.
The processing pipeline
When you upload a document, it goes through several stages:
| Stage | What happens |
|---|---|
| Upload | File is transferred to storage |
| Parsing | Text content is extracted (OCR for images) |
| Chunking | Document is split into semantic segments |
| Embedding | Vectors are generated for semantic search |
| Storage | Data is indexed for retrieval |
Each stage can take different amounts of time depending on document size and complexity. Scanned PDFs with images take longer than text-based documents.
Smart Analysis (ColPali)
For documents with tables, images, diagrams, or complex layouts, enable Smart Analysis during upload. This uses advanced visual processing that:
- Preserves table structure
- Extracts text from images and diagrams
- Handles multi-column layouts
- Processes handwritten content (with lower accuracy)
Why chunking
When you upload a document, Moongraph doesn't store it as a single blob. It breaks the document into chunks—smaller segments that can be independently searched and retrieved.
Large language models have limited context windows. When you ask a question, Moongraph can't send your entire document collection to the AI. Instead, it:
- Searches for relevant chunks using your query
- Retrieves the most relevant ones
- Sends those chunks as context to the AI
This is the foundation of RAG (Retrieval-Augmented Generation).
How chunking works
Documents are split based on:
- Paragraph boundaries
- Section breaks
- Page structure
- Semantic coherence (keeping related content together)
The goal is to create chunks that are:
- Small enough to fit in context windows
- Large enough to contain meaningful information
- Aligned with natural content boundaries
Chunk quality matters
Better document quality leads to better chunks, which leads to better search results and AI answers.
What helps:
- Clear text (native PDFs, well-formatted documents)
- Structured content (headings, paragraphs, sections)
- Consistent formatting
What hurts:
- OCR errors from scanned documents
- Walls of unstructured text
- Mixed formatting and layouts
If you're getting poor results, document quality is often the cause.
What embeddings are
Each chunk is converted into an embedding—a numerical vector that captures its semantic meaning.
When you search for "budget discussions," Moongraph doesn't just look for those exact words. It finds chunks that are semantically similar to your query, even if they use different terminology like "financial planning" or "cost analysis."
This is semantic search, and it's different from keyword search. Semantic search understands meaning; keyword search matches strings.
How embeddings work
- Text is processed by an embedding model
- The model outputs a high-dimensional vector (hundreds or thousands of numbers)
- Similar meanings result in similar vectors
- Search compares the query embedding to chunk embeddings
Chunks with embeddings close to your query embedding are considered relevant and retrieved.
Trade-offs
Semantic search strengths:
- Finds conceptually related content
- Works with different phrasings
- Understands synonyms and paraphrases
Semantic search weaknesses:
- May miss exact phrase matches
- Less precise for proper nouns and identifiers
- Can retrieve tangentially related content
For exact matches (names, IDs, specific phrases), lexical search may work better. The Agent's retrieve_chunks tool supports both semantic and lexical modes.
What affects processing speed
| Factor | Impact |
|---|---|
| Document size | Larger files take longer |
| Page count | More pages = longer parsing |
| Image content | OCR is slower than text extraction |
| Scan quality | Poor scans require more processing |
| Smart Analysis | Visual processing adds time but improves accuracy |
Tips for reliable processing
Document quality:
- Use 300 DPI minimum for scans
- Text-based PDFs process faster than scanned images
- Standard fonts are more reliably extracted
File preparation:
- Remove password protection before uploading
- Split very large documents (500+ pages)
- Use descriptive filenames
Related
- RAG & Retrieval — How the Agent uses chunks
- Files Reference — Processing statuses and limits
- Upload Documents — Step-by-step upload guide