Document Processing

This page explains what happens when you upload a document — how it's parsed, split into chunks, and prepared for search.

The processing pipeline

When you upload a document, it goes through several stages:

Stage	What happens
Upload	File is transferred to storage
Parsing	Text content is extracted (OCR for images)
Chunking	Document is split into semantic segments
Embedding	Vectors are generated for semantic search
Storage	Data is indexed for retrieval

Each stage can take different amounts of time depending on document size and complexity. Scanned PDFs with images take longer than text-based documents.

Smart Analysis (ColPali)

For documents with tables, images, diagrams, or complex layouts, enable Smart Analysis during upload. This uses advanced visual processing that:

Preserves table structure
Extracts text from images and diagrams
Handles multi-column layouts
Processes handwritten content (with lower accuracy)

Why chunking

When you upload a document, Moongraph doesn't store it as a single blob. It breaks the document into chunks—smaller segments that can be independently searched and retrieved.

Large language models have limited context windows. When you ask a question, Moongraph can't send your entire document collection to the AI. Instead, it:

Searches for relevant chunks using your query
Retrieves the most relevant ones
Sends those chunks as context to the AI

This is the foundation of RAG (Retrieval-Augmented Generation).

How chunking works

Documents are split based on:

Paragraph boundaries
Section breaks
Page structure
Semantic coherence (keeping related content together)

The goal is to create chunks that are:

Small enough to fit in context windows
Large enough to contain meaningful information
Aligned with natural content boundaries

Chunk quality matters

Better document quality leads to better chunks, which leads to better search results and AI answers.

What helps:

Clear text (native PDFs, well-formatted documents)
Structured content (headings, paragraphs, sections)
Consistent formatting

What hurts:

OCR errors from scanned documents
Walls of unstructured text
Mixed formatting and layouts

If you're getting poor results, document quality is often the cause.

What embeddings are

Each chunk is converted into an embedding—a numerical vector that captures its semantic meaning.

When you search for "budget discussions," Moongraph doesn't just look for those exact words. It finds chunks that are semantically similar to your query, even if they use different terminology like "financial planning" or "cost analysis."

This is semantic search, and it's different from keyword search. Semantic search understands meaning; keyword search matches strings.

How embeddings work

Text is processed by an embedding model
The model outputs a high-dimensional vector (hundreds or thousands of numbers)
Similar meanings result in similar vectors
Search compares the query embedding to chunk embeddings

Chunks with embeddings close to your query embedding are considered relevant and retrieved.

Trade-offs

Semantic search strengths:

Finds conceptually related content
Works with different phrasings
Understands synonyms and paraphrases

Semantic search weaknesses:

May miss exact phrase matches
Less precise for proper nouns and identifiers
Can retrieve tangentially related content

For exact matches (names, IDs, specific phrases), lexical search may work better. The Agent's retrieve_chunks tool supports both semantic and lexical modes.

What affects processing speed

Factor	Impact
Document size	Larger files take longer
Page count	More pages = longer parsing
Image content	OCR is slower than text extraction
Scan quality	Poor scans require more processing
Smart Analysis	Visual processing adds time but improves accuracy

Tips for reliable processing

Document quality:

Use 300 DPI minimum for scans
Text-based PDFs process faster than scanned images
Standard fonts are more reliably extracted

File preparation:

Remove password protection before uploading
Split very large documents (500+ pages)
Use descriptive filenames

RAG & Retrieval — How the Agent uses chunks
Files Reference — Processing statuses and limits
Upload Documents — Step-by-step upload guide

Document Processing

On this page