Custom Extractor
Identify and extract specific data from your documents. Train a custom ML model to identify and extract custom entities, checkboxes, and other form elements from documents with as little as 10 examples.
Overview
The Custom Extractor trains field-level extraction models on your documents. Define the fields you need, label a few examples, and M3 Forge trains a model that learns your specific document layouts and terminology.
Use cases:
- Invoice line items (number, date, vendor, totals)
- Contract terms and clauses
- Form field values (names, addresses, dates)
- Table data extraction
- Checkbox and radio button detection
- Custom entity recognition
Creating a Custom Extractor
Create Processor
Navigate to Processors → Custom Processors and click Create processor on the Custom Extractor card. Enter a descriptive name (e.g., “ACME Invoice Extractor”) in the slide panel and click Create.
Define Fields
Open your extractor and navigate to the Field Management tab. This is where you define what data to extract.

Click Add Field to configure each extraction target:
| Setting | Description | Example |
|---|---|---|
| Field Name | Machine-friendly identifier | invoice_number |
| Display Name | Human-readable label | ”Invoice Number” |
| Data Type | Value type | PLAIN_TEXT, DATETIME, NUMBER, CURRENCY, ADDRESS, CHECKBOX |
| Occurrence | How often the field appears | OPTIONAL_ONCE, OPTIONAL_MULTIPLE, REQUIRED_ONCE, REQUIRED_MULTIPLE |
| Method | Extraction approach | EXTRACT, NORMALIZE, CLASSIFY, DETECT |
| Prompt Hint | Guidance for AI extraction | ”The invoice number usually appears in the top-right corner” |
| Color | Visual indicator in labeling UI | Color swatch |
Fields support parent/child hierarchies for nested extraction (e.g., a “Line Items” parent with “Description”, “Quantity”, “Price” children).
Document Prompt: Optionally add a document-level prompt that provides context for the AI about what kind of document it’s processing and what to look for.
Generate Schema with AI
For faster setup, use the Schema Generator panel:
- Click Generate Schema in the field management toolbar
- Upload up to 5 representative documents (drag-and-drop or S3 import)

- AI analyzes documents and suggests extraction fields
- Review generated fields — each shows name, display name, data type, occurrence, and description
- Select which fields to import using checkboxes
- Preview the JSON schema of selected fields
- Click Apply to create all selected fields at once
The Schema Generator is particularly useful when starting with a new document type — it bootstraps your field definitions from real documents.
Import Training Documents
Navigate to the Documents tab and click Import Documents. Upload your training documents:
- Supported formats: PDF, PNG, JPEG, TIFF
- Minimum: 10 examples recommended for initial training
- Best practice: Include diverse document variations (different vendors, layouts, edge cases)
Documents can be uploaded via file picker, drag-and-drop, or imported from S3 storage.
Label Examples
Click Start Annotating or select a document to open the labeling interface.
The labeling interface provides a two-panel layout:

Left panel — Document canvas:
- Multi-page document viewer with zoom (0.25x to 3x)
- Three drawing modes: Draw, Select, Move
- Draw bounding boxes around field values
- Page navigation with previous/next buttons
Right panel — Annotation panel:
- Field list with assigned colors and annotation counts
- Select a field, then draw a box on the document to annotate
- Keyboard shortcuts: 1-9 to quickly select fields
- Delete annotations with Delete/Backspace key
AI-assisted labeling:
- Auto-annotate — AI suggests field values from the document
- Auto-layout — AI positions annotations on the page
- View raw AI extraction results for comparison
Workflow per document:
- Select a field from the annotation panel (or press 1-9)
- Draw a bounding box around the field value on the document
- Repeat for all fields on all pages
- Click Complete & Next to save and move to the next document
- Skip documents with no extractable data
Train Model
Navigate to Training Jobs and click Start Training. See Training for detailed configuration.
Evaluate Results
After training, the Evaluation dashboard shows:
Overall metrics:
- Accuracy, Precision, Recall, F1 Score (as percentages)

Per-field breakdown table:
| Field | Accuracy | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|---|
| invoice_number | 95.2% | 94.8% | 95.6% | 95.2% | 43 | 2 | 2 |
| date | 92.1% | 91.5% | 92.7% | 92.1% | 38 | 4 | 3 |
Color-coded badges indicate metric strength: green (≥90%), yellow (≥70%), red (<70%).
Training progress:
- Epoch-by-epoch metrics: Train Loss, Validation Loss, Train Accuracy, Validation Accuracy, Learning Rate
- Use to assess convergence and detect overfitting
Deploy
Activate the trained version for production use. See Training — Production Deployment.
Dashboard
The extractor dashboard provides an at-a-glance overview:

- Dataset overview — Total documents, annotated count, unlabeled, auto-labeled
- Annotation progress — Percentage bar showing labeling completion
- Train/Test split — Three-way view showing Training, Test, and Unassigned document counts
- Per-field statistics — Color-coded annotation counts per field
- Processor details — ID, creation date, last updated
- Quick actions — Test, View Logs, Configure
- Version history — Collapsible list with active version badge
Field Data Types
| Type | Description | Example Values |
|---|---|---|
| PLAIN_TEXT | Free-form text string | ”INV-2024-001”, “Acme Corp” |
| DATETIME | Date and/or time value | ”2024-03-15”, “March 15, 2024” |
| NUMBER | Numeric value | ”42”, “3.14”, “1,000” |
| CURRENCY | Monetary amount | ”$1,234.56”, “EUR 500.00” |
| ADDRESS | Physical address | ”123 Main St, City, ST 12345” |
| CHECKBOX | Boolean checked/unchecked | Checked box, empty box |
Field Occurrence Patterns
| Pattern | Description | When to Use |
|---|---|---|
| OPTIONAL_ONCE | Field may or may not appear, at most once | Optional reference numbers, notes |
| OPTIONAL_MULTIPLE | Field may appear zero or more times | Variable line items |
| REQUIRED_ONCE | Field must appear exactly once | Invoice number, date |
| REQUIRED_MULTIPLE | Field must appear one or more times | At least one line item required |
Extraction Methods
| Method | Description | When to Use |
|---|---|---|
| EXTRACT | Pull the raw field value from the document | Most fields — text, numbers, dates |
| NORMALIZE | Extract and standardize format | Dates to ISO format, phone numbers |
| CLASSIFY | Categorize the field value | Document type indicators, status fields |
| DETECT | Detect presence/absence | Checkboxes, signatures, stamps |
Backend Architecture
Custom extractors in M3 Forge map to the marie-ai extraction pipeline:
- Foundation mode — Zero-shot LLM extraction using field definitions as prompts. No training required.
- Fine-tune mode — LayoutLMv3-based models trained on your labeled examples for high accuracy.
- Hybrid mode — FAISS-based semantic matching combined with fuzzy string matching for robust extraction.
The extraction pipeline supports multiple annotator types that can be combined:
| Annotator | Approach | Best For |
|---|---|---|
| LLM | Generative AI with prompt templates | Complex extraction, varied layouts |
| Embedding | FAISS hybrid semantic + fuzzy matching | Consistent field labels, OCR text |
| Regex | Deterministic pattern matching | Structured IDs, codes, standardized formats |
The Schema Generator uses Foundation mode (zero-shot LLM) to suggest fields. For production accuracy, train a fine-tuned model with labeled examples.
Best Practices
- Start with Schema Generator — Upload 3-5 representative documents to bootstrap field definitions
- Label diverse examples — Include different vendors, layouts, and edge cases
- Use prompt hints — Guide the AI with field-specific context (“usually in top-right corner”)
- Set correct occurrence — Use REQUIRED for mandatory fields to catch extraction failures
- Leverage auto-annotate — Let AI suggest initial annotations, then correct mistakes
- Review per-field metrics — Focus improvement efforts on fields with lowest F1 scores
- Add child fields — Use hierarchies for table extraction (parent = table, children = columns)
Next Steps
- Configure Annotators for advanced extraction pipelines
- Learn about Training job management and hyperparameters
- Route low-confidence extractions to HITL review
- Integrate extractors into Workflows