Structured Extraction

Structured extraction lets you pull specific information out of documents automatically. Instead of reading through hundreds of pages to find details, you define what you're looking for and the system finds and organizes it for you.

What extraction does

Define a schema describing what you want to extract — field names like "Author Name" or "Publication Date" with their types. Run extraction on selected documents, and the system produces a spreadsheet-like table of results.

Example: You have 50 research papers. You want to extract author names, publication dates, and key findings from each. Extraction processes all 50 and gives you a structured table you can filter, sort, and export.

Key concepts

Schemas

A schema is a template describing what to extract:

Field names — What to call each piece of data
Field types — What kind of data it is (text, number, enum, etc.)
Instructions — Optional guidance for accuracy

Schemas can be reused across multiple extraction runs.

Runs

A run is an extraction job. You select a schema and documents, and the system processes each document in the background. Results appear as they complete.

Processing takes roughly 10 pages per minute. A 50-document run with 20 pages each (~1,000 pages total) takes about 100 minutes.

Provenance

Every extracted value includes provenance — a direct quote from the source document showing where the data came from. This lets you:

Verify accuracy
Check original context
Cite sources

Field types

Type	What it captures	Use case
Text	Free-form text	Names, descriptions, quotes
Number	Numeric values	Years, counts, measurements
Boolean	Yes/No answer	"Is peer reviewed?", "Contains tables?"
Enum	One choice from a predefined list	Categories, ratings, classifications
Object	Grouped sub-fields	Addresses (city, state, zip)
Array	List of multiple values	Multiple authors, keywords

Row granularity

What determines one row in results?

The extraction interprets your schema to decide. For example, with fields "Author", "Book", "Character":

Flat fields → One row per character (author/book repeated)
Characters as Array → One row per book (characters listed together)

Use extraction instructions to be explicit:

"Extract one row for each book. List all characters in the Characters field."

Visibility

Control who can use your schema:

Level	Who can see it
Private	Only you
Shared	Your team
Public	All users
Template	Admin-created, available to everyone, cannot be edited

When to use extraction

Extraction works well for:

Pulling consistent fields from similar documents (research papers, contracts, reports)
Building datasets from document collections
Extracting facts for analysis or comparison

Extraction is less suited for:

Documents with highly variable structure
Very open-ended information needs
One-off questions (use Agent instead)

Structured Extraction

On this page