Skip to Content
ProcessorsExtractor

Custom Extractor

Identify and extract specific data from your documents. Train a custom ML model to identify and extract custom entities, checkboxes, and other form elements from documents with as little as 10 examples.

Overview

The Custom Extractor trains field-level extraction models on your documents. Define the fields you need, label a few examples, and M3 Forge trains a model that learns your specific document layouts and terminology.

Use cases:

  • Invoice line items (number, date, vendor, totals)
  • Contract terms and clauses
  • Form field values (names, addresses, dates)
  • Table data extraction
  • Checkbox and radio button detection
  • Custom entity recognition

Creating a Custom Extractor

Create Processor

Navigate to ProcessorsCustom Processors and click Create processor on the Custom Extractor card. Enter a descriptive name (e.g., “ACME Invoice Extractor”) in the slide panel and click Create.

Define Fields

Open your extractor and navigate to the Field Management tab. This is where you define what data to extract.

Field Management tab showing extraction fields table with data types, occurrence patterns, and color indicators

Click Add Field to configure each extraction target:

SettingDescriptionExample
Field NameMachine-friendly identifierinvoice_number
Display NameHuman-readable label”Invoice Number”
Data TypeValue typePLAIN_TEXT, DATETIME, NUMBER, CURRENCY, ADDRESS, CHECKBOX
OccurrenceHow often the field appearsOPTIONAL_ONCE, OPTIONAL_MULTIPLE, REQUIRED_ONCE, REQUIRED_MULTIPLE
MethodExtraction approachEXTRACT, NORMALIZE, CLASSIFY, DETECT
Prompt HintGuidance for AI extraction”The invoice number usually appears in the top-right corner”
ColorVisual indicator in labeling UIColor swatch

Fields support parent/child hierarchies for nested extraction (e.g., a “Line Items” parent with “Description”, “Quantity”, “Price” children).

Document Prompt: Optionally add a document-level prompt that provides context for the AI about what kind of document it’s processing and what to look for.

Generate Schema with AI

For faster setup, use the Schema Generator panel:

  1. Click Generate Schema in the field management toolbar
  2. Upload up to 5 representative documents (drag-and-drop or S3 import)
Schema Generator panel showing AI-suggested fields with checkboxes for selection and JSON schema preview
  1. AI analyzes documents and suggests extraction fields
  2. Review generated fields — each shows name, display name, data type, occurrence, and description
  3. Select which fields to import using checkboxes
  4. Preview the JSON schema of selected fields
  5. Click Apply to create all selected fields at once

The Schema Generator is particularly useful when starting with a new document type — it bootstraps your field definitions from real documents.

Import Training Documents

Navigate to the Documents tab and click Import Documents. Upload your training documents:

  • Supported formats: PDF, PNG, JPEG, TIFF
  • Minimum: 10 examples recommended for initial training
  • Best practice: Include diverse document variations (different vendors, layouts, edge cases)

Documents can be uploaded via file picker, drag-and-drop, or imported from S3 storage.

Label Examples

Click Start Annotating or select a document to open the labeling interface.

The labeling interface provides a two-panel layout:

Extractor labeling interface with document canvas showing bounding boxes and right-panel field list with annotation counts

Left panel — Document canvas:

  • Multi-page document viewer with zoom (0.25x to 3x)
  • Three drawing modes: Draw, Select, Move
  • Draw bounding boxes around field values
  • Page navigation with previous/next buttons

Right panel — Annotation panel:

  • Field list with assigned colors and annotation counts
  • Select a field, then draw a box on the document to annotate
  • Keyboard shortcuts: 1-9 to quickly select fields
  • Delete annotations with Delete/Backspace key

AI-assisted labeling:

  • Auto-annotate — AI suggests field values from the document
  • Auto-layout — AI positions annotations on the page
  • View raw AI extraction results for comparison

Workflow per document:

  1. Select a field from the annotation panel (or press 1-9)
  2. Draw a bounding box around the field value on the document
  3. Repeat for all fields on all pages
  4. Click Complete & Next to save and move to the next document
  5. Skip documents with no extractable data

Train Model

Navigate to Training Jobs and click Start Training. See Training for detailed configuration.

Evaluate Results

After training, the Evaluation dashboard shows:

Overall metrics:

  • Accuracy, Precision, Recall, F1 Score (as percentages)
Extractor evaluation dashboard showing per-field accuracy, precision, recall, F1 scores with color-coded badges

Per-field breakdown table:

FieldAccuracyPrecisionRecallF1TPFPFN
invoice_number95.2%94.8%95.6%95.2%4322
date92.1%91.5%92.7%92.1%3843

Color-coded badges indicate metric strength: green (≥90%), yellow (≥70%), red (<70%).

Training progress:

  • Epoch-by-epoch metrics: Train Loss, Validation Loss, Train Accuracy, Validation Accuracy, Learning Rate
  • Use to assess convergence and detect overfitting

Deploy

Activate the trained version for production use. See Training — Production Deployment.

Dashboard

The extractor dashboard provides an at-a-glance overview:

Extractor dashboard showing dataset overview, annotation progress bar, train/test split, and per-field statistics
  • Dataset overview — Total documents, annotated count, unlabeled, auto-labeled
  • Annotation progress — Percentage bar showing labeling completion
  • Train/Test split — Three-way view showing Training, Test, and Unassigned document counts
  • Per-field statistics — Color-coded annotation counts per field
  • Processor details — ID, creation date, last updated
  • Quick actions — Test, View Logs, Configure
  • Version history — Collapsible list with active version badge

Field Data Types

TypeDescriptionExample Values
PLAIN_TEXTFree-form text string”INV-2024-001”, “Acme Corp”
DATETIMEDate and/or time value”2024-03-15”, “March 15, 2024”
NUMBERNumeric value”42”, “3.14”, “1,000”
CURRENCYMonetary amount”$1,234.56”, “EUR 500.00”
ADDRESSPhysical address”123 Main St, City, ST 12345”
CHECKBOXBoolean checked/uncheckedChecked box, empty box

Field Occurrence Patterns

PatternDescriptionWhen to Use
OPTIONAL_ONCEField may or may not appear, at most onceOptional reference numbers, notes
OPTIONAL_MULTIPLEField may appear zero or more timesVariable line items
REQUIRED_ONCEField must appear exactly onceInvoice number, date
REQUIRED_MULTIPLEField must appear one or more timesAt least one line item required

Extraction Methods

MethodDescriptionWhen to Use
EXTRACTPull the raw field value from the documentMost fields — text, numbers, dates
NORMALIZEExtract and standardize formatDates to ISO format, phone numbers
CLASSIFYCategorize the field valueDocument type indicators, status fields
DETECTDetect presence/absenceCheckboxes, signatures, stamps

Backend Architecture

Custom extractors in M3 Forge map to the marie-ai extraction pipeline:

  • Foundation mode — Zero-shot LLM extraction using field definitions as prompts. No training required.
  • Fine-tune mode — LayoutLMv3-based models trained on your labeled examples for high accuracy.
  • Hybrid mode — FAISS-based semantic matching combined with fuzzy string matching for robust extraction.

The extraction pipeline supports multiple annotator types that can be combined:

AnnotatorApproachBest For
LLMGenerative AI with prompt templatesComplex extraction, varied layouts
EmbeddingFAISS hybrid semantic + fuzzy matchingConsistent field labels, OCR text
RegexDeterministic pattern matchingStructured IDs, codes, standardized formats

The Schema Generator uses Foundation mode (zero-shot LLM) to suggest fields. For production accuracy, train a fine-tuned model with labeled examples.

Best Practices

  1. Start with Schema Generator — Upload 3-5 representative documents to bootstrap field definitions
  2. Label diverse examples — Include different vendors, layouts, and edge cases
  3. Use prompt hints — Guide the AI with field-specific context (“usually in top-right corner”)
  4. Set correct occurrence — Use REQUIRED for mandatory fields to catch extraction failures
  5. Leverage auto-annotate — Let AI suggest initial annotations, then correct mistakes
  6. Review per-field metrics — Focus improvement efforts on fields with lowest F1 scores
  7. Add child fields — Use hierarchies for table extraction (parent = table, children = columns)

Next Steps

  • Configure Annotators for advanced extraction pipelines
  • Learn about Training job management and hyperparameters
  • Route low-confidence extractions to HITL review
  • Integrate extractors into Workflows
Last updated on