MoongraphMoongraph

Document Processing

How Moongraph processes documents — parsing, chunking, and embedding.

Document Processing

This page explains what happens when you upload a document — how it's parsed, split into chunks, and prepared for search.

The processing pipeline

When you upload a document, it goes through several stages:

StageWhat happens
UploadFile is transferred to storage
ParsingText content is extracted (OCR for images)
ChunkingDocument is split into semantic segments
EmbeddingVectors are generated for semantic search
StorageData is indexed for retrieval

Each stage can take different amounts of time depending on document size and complexity. Scanned PDFs with images take longer than text-based documents.

Smart Analysis (ColPali)

For documents with tables, images, diagrams, or complex layouts, enable Smart Analysis during upload. This uses advanced visual processing that:

  • Preserves table structure
  • Extracts text from images and diagrams
  • Handles multi-column layouts
  • Processes handwritten content (with lower accuracy)

Why chunking

When you upload a document, Moongraph doesn't store it as a single blob. It breaks the document into chunks—smaller segments that can be independently searched and retrieved.

Large language models have limited context windows. When you ask a question, Moongraph can't send your entire document collection to the AI. Instead, it:

  1. Searches for relevant chunks using your query
  2. Retrieves the most relevant ones
  3. Sends those chunks as context to the AI

This is the foundation of RAG (Retrieval-Augmented Generation).

How chunking works

Documents are split based on:

  • Paragraph boundaries
  • Section breaks
  • Page structure
  • Semantic coherence (keeping related content together)

The goal is to create chunks that are:

  • Small enough to fit in context windows
  • Large enough to contain meaningful information
  • Aligned with natural content boundaries

Chunk quality matters

Better document quality leads to better chunks, which leads to better search results and AI answers.

What helps:

  • Clear text (native PDFs, well-formatted documents)
  • Structured content (headings, paragraphs, sections)
  • Consistent formatting

What hurts:

  • OCR errors from scanned documents
  • Walls of unstructured text
  • Mixed formatting and layouts

If you're getting poor results, document quality is often the cause.

What embeddings are

Each chunk is converted into an embedding—a numerical vector that captures its semantic meaning.

When you search for "budget discussions," Moongraph doesn't just look for those exact words. It finds chunks that are semantically similar to your query, even if they use different terminology like "financial planning" or "cost analysis."

This is semantic search, and it's different from keyword search. Semantic search understands meaning; keyword search matches strings.

How embeddings work

  1. Text is processed by an embedding model
  2. The model outputs a high-dimensional vector (hundreds or thousands of numbers)
  3. Similar meanings result in similar vectors
  4. Search compares the query embedding to chunk embeddings

Chunks with embeddings close to your query embedding are considered relevant and retrieved.

Trade-offs

Semantic search strengths:

  • Finds conceptually related content
  • Works with different phrasings
  • Understands synonyms and paraphrases

Semantic search weaknesses:

  • May miss exact phrase matches
  • Less precise for proper nouns and identifiers
  • Can retrieve tangentially related content

For exact matches (names, IDs, specific phrases), lexical search may work better. The Agent's retrieve_chunks tool supports both semantic and lexical modes.

What affects processing speed

FactorImpact
Document sizeLarger files take longer
Page countMore pages = longer parsing
Image contentOCR is slower than text extraction
Scan qualityPoor scans require more processing
Smart AnalysisVisual processing adds time but improves accuracy

Tips for reliable processing

Document quality:

  • Use 300 DPI minimum for scans
  • Text-based PDFs process faster than scanned images
  • Standard fonts are more reliably extracted

File preparation:

  • Remove password protection before uploading
  • Split very large documents (500+ pages)
  • Use descriptive filenames

On this page