Structured Extraction
How Moongraph extracts structured data from documents using schemas.
Structured Extraction
Structured extraction lets you pull specific information out of documents automatically. Instead of reading through hundreds of pages to find details, you define what you're looking for and the system finds and organizes it for you.
What extraction does
Define a schema describing what you want to extract — field names like "Author Name" or "Publication Date" with their types. Run extraction on selected documents, and the system produces a spreadsheet-like table of results.
Example: You have 50 research papers. You want to extract author names, publication dates, and key findings from each. Extraction processes all 50 and gives you a structured table you can filter, sort, and export.
Key concepts
Schemas
A schema is a template describing what to extract:
- Field names — What to call each piece of data
- Field types — What kind of data it is (text, number, enum, etc.)
- Instructions — Optional guidance for accuracy
Schemas can be reused across multiple extraction runs.
Runs
A run is an extraction job. You select a schema and documents, and the system processes each document in the background. Results appear as they complete.
Processing takes roughly 10 pages per minute. A 50-document run with 20 pages each (~1,000 pages total) takes about 100 minutes.
Provenance
Every extracted value includes provenance — a direct quote from the source document showing where the data came from. This lets you:
- Verify accuracy
- Check original context
- Cite sources
Field types
| Type | What it captures | Use case |
|---|---|---|
| Text | Free-form text | Names, descriptions, quotes |
| Number | Numeric values | Years, counts, measurements |
| Boolean | Yes/No answer | "Is peer reviewed?", "Contains tables?" |
| Enum | One choice from a predefined list | Categories, ratings, classifications |
| Object | Grouped sub-fields | Addresses (city, state, zip) |
| Array | List of multiple values | Multiple authors, keywords |
Row granularity
What determines one row in results?
The extraction interprets your schema to decide. For example, with fields "Author", "Book", "Character":
- Flat fields → One row per character (author/book repeated)
- Characters as Array → One row per book (characters listed together)
Use extraction instructions to be explicit:
"Extract one row for each book. List all characters in the Characters field."
Visibility
Control who can use your schema:
| Level | Who can see it |
|---|---|
| Private | Only you |
| Shared | Your team |
| Public | All users |
| Template | Admin-created, available to everyone, cannot be edited |
When to use extraction
Extraction works well for:
- Pulling consistent fields from similar documents (research papers, contracts, reports)
- Building datasets from document collections
- Extracting facts for analysis or comparison
Extraction is less suited for:
- Documents with highly variable structure
- Very open-ended information needs
- One-off questions (use Agent instead)