Custom Extractor

Identify and extract specific data from your documents. Train a custom ML model to identify and extract custom entities, checkboxes, and other form elements from documents with as little as 10 examples.

Overview

The Custom Extractor trains field-level extraction models on your documents. Define the fields you need, label a few examples, and M3 Forge trains a model that learns your specific document layouts and terminology.

Use cases:

Invoice line items (number, date, vendor, totals)
Contract terms and clauses
Form field values (names, addresses, dates)
Table data extraction
Checkbox and radio button detection
Custom entity recognition

Creating a Custom Extractor

Create Processor

Navigate to Processors → Custom Processors and click Create processor on the Custom Extractor card. Enter a descriptive name (e.g., “ACME Invoice Extractor”) in the slide panel and click Create.

Define Fields

Open your extractor and navigate to the Field Management tab. This is where you define what data to extract.

Field Management tab showing extraction fields table with data types, occurrence patterns, and color indicators

Click Add Field to configure each extraction target:

Setting	Description	Example
Field Name	Machine-friendly identifier	`invoice_number`
Display Name	Human-readable label	”Invoice Number”
Data Type	Value type	PLAIN_TEXT, DATETIME, NUMBER, CURRENCY, ADDRESS, CHECKBOX
Occurrence	How often the field appears	OPTIONAL_ONCE, OPTIONAL_MULTIPLE, REQUIRED_ONCE, REQUIRED_MULTIPLE
Method	Extraction approach	EXTRACT, NORMALIZE, CLASSIFY, DETECT
Prompt Hint	Guidance for AI extraction	”The invoice number usually appears in the top-right corner”
Color	Visual indicator in labeling UI	Color swatch

Fields support parent/child hierarchies for nested extraction (e.g., a “Line Items” parent with “Description”, “Quantity”, “Price” children).

Document Prompt: Optionally add a document-level prompt that provides context for the AI about what kind of document it’s processing and what to look for.

Generate Schema with AI

For faster setup, use the Schema Generator panel:

Click Generate Schema in the field management toolbar
Upload up to 5 representative documents (drag-and-drop or S3 import)

Schema Generator panel showing AI-suggested fields with checkboxes for selection and JSON schema preview

AI analyzes documents and suggests extraction fields
Review generated fields — each shows name, display name, data type, occurrence, and description
Select which fields to import using checkboxes
Preview the JSON schema of selected fields
Click Apply to create all selected fields at once

The Schema Generator is particularly useful when starting with a new document type — it bootstraps your field definitions from real documents.

Import Training Documents

Navigate to the Documents tab and click Import Documents. Upload your training documents:

Supported formats: PDF, PNG, JPEG, TIFF
Minimum: 10 examples recommended for initial training
Best practice: Include diverse document variations (different vendors, layouts, edge cases)

Documents can be uploaded via file picker, drag-and-drop, or imported from S3 storage.

Label Examples

Click Start Annotating or select a document to open the labeling interface.

The labeling interface provides a two-panel layout:

Extractor labeling interface with document canvas showing bounding boxes and right-panel field list with annotation counts

Left panel — Document canvas:

Multi-page document viewer with zoom (0.25x to 3x)
Three drawing modes: Draw, Select, Move
Draw bounding boxes around field values
Page navigation with previous/next buttons

Right panel — Annotation panel:

Field list with assigned colors and annotation counts
Select a field, then draw a box on the document to annotate
Keyboard shortcuts: 1-9 to quickly select fields
Delete annotations with Delete/Backspace key

AI-assisted labeling:

Auto-annotate — AI suggests field values from the document
Auto-layout — AI positions annotations on the page
View raw AI extraction results for comparison

Workflow per document:

Select a field from the annotation panel (or press 1-9)
Draw a bounding box around the field value on the document
Repeat for all fields on all pages
Click Complete & Next to save and move to the next document
Skip documents with no extractable data

Train Model

Navigate to Training Jobs and click Start Training. See Training for detailed configuration.

Evaluate Results

After training, the Evaluation dashboard shows:

Overall metrics:

Accuracy, Precision, Recall, F1 Score (as percentages)

Extractor evaluation dashboard showing per-field accuracy, precision, recall, F1 scores with color-coded badges

Per-field breakdown table:

Field	Accuracy	Precision	Recall	F1	TP	FP	FN
invoice_number	95.2%	94.8%	95.6%	95.2%	43	2	2
date	92.1%	91.5%	92.7%	92.1%	38	4	3

Color-coded badges indicate metric strength: green (≥90%), yellow (≥70%), red (<70%).

Training progress:

Epoch-by-epoch metrics: Train Loss, Validation Loss, Train Accuracy, Validation Accuracy, Learning Rate
Use to assess convergence and detect overfitting

Deploy

Activate the trained version for production use. See Training — Production Deployment.

Dashboard

The extractor dashboard provides an at-a-glance overview:

Dataset overview — Total documents, annotated count, unlabeled, auto-labeled
Annotation progress — Percentage bar showing labeling completion
Train/Test split — Three-way view showing Training, Test, and Unassigned document counts
Per-field statistics — Color-coded annotation counts per field
Processor details — ID, creation date, last updated
Quick actions — Test, View Logs, Configure
Version history — Collapsible list with active version badge

Field Data Types

Type	Description	Example Values
PLAIN_TEXT	Free-form text string	”INV-2024-001”, “Acme Corp”
DATETIME	Date and/or time value	”2024-03-15”, “March 15, 2024”
NUMBER	Numeric value	”42”, “3.14”, “1,000”
CURRENCY	Monetary amount	”$1,234.56”, “EUR 500.00”
ADDRESS	Physical address	”123 Main St, City, ST 12345”
CHECKBOX	Boolean checked/unchecked	Checked box, empty box

Field Occurrence Patterns

Pattern	Description	When to Use
OPTIONAL_ONCE	Field may or may not appear, at most once	Optional reference numbers, notes
OPTIONAL_MULTIPLE	Field may appear zero or more times	Variable line items
REQUIRED_ONCE	Field must appear exactly once	Invoice number, date
REQUIRED_MULTIPLE	Field must appear one or more times	At least one line item required

Extraction Methods

Method	Description	When to Use
EXTRACT	Pull the raw field value from the document	Most fields — text, numbers, dates
NORMALIZE	Extract and standardize format	Dates to ISO format, phone numbers
CLASSIFY	Categorize the field value	Document type indicators, status fields
DETECT	Detect presence/absence	Checkboxes, signatures, stamps

Backend Architecture

Custom extractors in M3 Forge map to the marie-ai extraction pipeline:

Foundation mode — Zero-shot LLM extraction using field definitions as prompts. No training required.
Fine-tune mode — LayoutLMv3-based models trained on your labeled examples for high accuracy.
Hybrid mode — FAISS-based semantic matching combined with fuzzy string matching for robust extraction.

The extraction pipeline supports multiple annotator types that can be combined:

Annotator	Approach	Best For
LLM	Generative AI with prompt templates	Complex extraction, varied layouts
Embedding	FAISS hybrid semantic + fuzzy matching	Consistent field labels, OCR text
Regex	Deterministic pattern matching	Structured IDs, codes, standardized formats

The Schema Generator uses Foundation mode (zero-shot LLM) to suggest fields. For production accuracy, train a fine-tuned model with labeled examples.

Best Practices

Start with Schema Generator — Upload 3-5 representative documents to bootstrap field definitions
Label diverse examples — Include different vendors, layouts, and edge cases
Use prompt hints — Guide the AI with field-specific context (“usually in top-right corner”)
Set correct occurrence — Use REQUIRED for mandatory fields to catch extraction failures
Leverage auto-annotate — Let AI suggest initial annotations, then correct mistakes
Review per-field metrics — Focus improvement efforts on fields with lowest F1 scores
Add child fields — Use hierarchies for table extraction (parent = table, children = columns)

Next Steps

Configure Annotators for advanced extraction pipelines
Learn about Training job management and hyperparameters
Route low-confidence extractions to HITL review
Integrate extractors into Workflows