Custom Classifier

Group your documents into categories. Train a custom ML model to classify documents according to labels that you define.

Overview

The Custom Classifier categorizes documents into predefined classes. Define your label taxonomy, label example documents, and train a LayoutLMv3-based model that learns to recognize document types from both text content and visual layout.

Use cases:

Document routing (invoices vs. contracts vs. forms)
Type detection (W-2 vs. 1099 vs. W-9)
Topic classification (complaint vs. inquiry vs. feedback)
Compliance sorting (regulated vs. non-regulated)
Triage and prioritization

Creating a Custom Classifier

Create Processor

Navigate to Processors → Custom Processors and click Create processor on the Custom Classifier card. Enter a name (e.g., “Document Type Classifier”) and click Create.

Define Labels

Open your classifier and navigate to the Label Management tab. Create labels for each document class:

Label Management tab showing label list with colors, training counts, and status badges

Setting	Description	Example
Label Name	Machine-friendly identifier	`invoice`
Display Name	Human-readable label	”Invoice”
Description	What documents belong to this class	”Purchase invoices from vendors”
Color	Visual indicator in labeling UI	Blue swatch

Labels can be reordered by dragging the handle. Enable/disable labels with the toggle. Status badges show green when a label has 10+ training samples.

Aim for at least 10 labeled examples per class for initial training. More examples and balanced class sizes produce better models.

Import Training Documents

Navigate to the Documents tab and import example documents. Include a mix of all document types you want to classify.

Supported formats: PDF, PNG, JPEG, TIFF, DOCX
Upload via file picker, drag-and-drop, or S3 import

Label Documents

Click a document to open the Labeling Interface, which uses a two-panel layout:

Classifier labeling interface with document preview on left and clickable label list on right

Left panel — Document preview (3/4 width):

Full-sized document image
Zoom controls (0.25x to 3x)
Rotate document button
Refresh/reload document

Right panel — Label assignment (1/4 width):

Scrollable list of all labels with color indicators
Click a label to assign it to the current document
Selected label shows highlighted with border and background
Each label shows its keyboard shortcut (1-9)

Keyboard shortcuts:

1-9 — Toggle label assignment
Enter — Confirm label and move to next unlabeled document
N or → — Next document (without saving)
P or ← — Previous document
Esc — Close labeling interface
? — Show all shortcuts

The labeling workflow is designed for speed — assign a label and press Enter to move through documents quickly.

Review Dataset

The Dataset Overview tab shows:

Dataset Overview showing document counts, labeling progress bar, and per-label statistics

Total documents — All imported documents
Labeled count — Documents with assigned labels
Unlabeled count — Documents still needing labels
Needs Review — Documents flagged for review
Labeling progress — Percentage bar
Per-label statistics — Document count per label with color indicators
Train/Test split — 80/20 split showing assignment counts

Click Auto-assign to randomly distribute unassigned documents into training and test sets.

Train Model

Navigate to Training Jobs and click Start Training. The classifier trains a LayoutLMv3-based model that learns from both text content and document layout.

See Training for detailed configuration.

Evaluate Results

After training, the Evaluation dashboard provides:

Summary metrics (4-column grid):

Final Accuracy (green)
F1 Score (blue)
Precision (info)
Recall (yellow)

Tabs:

Tab	Content
Overview	Training summary, epochs, loss reduction, metrics table
Training Curves	Bar charts for Loss and Accuracy over epochs
Confusion Matrix	Heatmap showing Actual vs. Predicted labels with intensity scaling
Per-Class	Per-label breakdown with support count, metrics, and mini progress bars

Classifier evaluation showing confusion matrix heatmap and per-class metrics with progress bars

The confusion matrix helps identify commonly confused document types. Click cells to see misclassified examples.

Deploy

Activate the trained version for production use. See Training — Production Deployment.

Dashboard

The classifier dashboard provides:

Dataset metrics — Total documents, labeled, unlabeled, needs review
Annotation progress — Percentage bar
Train/Test split — 80/20 split view with assignment counts
Per-label statistics — Document counts per label with color badges
Quick actions — Import Documents, Start Annotating

Backend Architecture

Custom classifiers use the TransformersDocumentClassifier in marie-ai:

Model: LayoutLMv3 via HuggingFace AutoModelForSequenceClassification
Input: Document images + OCR text + bounding boxes
Processing: Per-page classification with batch support
Output: Predicted label with confidence score per page

The classifier processes both visual layout and text content, making it robust to OCR noise and layout variations.

Best Practices

Balance your classes — Aim for similar numbers of examples per label
Include edge cases — Add documents that are hard to classify
Label consistently — Use the same criteria across all documents
Start with 10+ per class — More examples improve accuracy
Review the confusion matrix — Focus on commonly confused classes
Use keyboard shortcuts — Process 50+ documents per hour with 1-9 + Enter
Auto-assign train/test — Use the auto-assign button for random splitting

Next Steps

Learn about Training job management and evaluation
Route low-confidence classifications to HITL review
Use classifiers to route documents in Workflows
Combine with Custom Splitter for split-then-classify pipelines