Custom Splitter

Identify document boundaries in a large file. Train a custom ML model to identify document boundaries for page separation in large bundle PDFs.

Overview

The Custom Splitter detects where one document ends and another begins inside multi-document files. Define page types, configure split rules, mark boundaries on example documents, and train a model that learns to split your specific document bundles accurately.

Use cases:

Batch scan separation (scanner produces one large PDF)
Email attachment bundles (multiple documents in one file)
Lending document packages (dozens of document types in one bundle)
Procurement document bundles
Medical record compilations

Creating a Custom Splitter

Create Processor

Navigate to Processors → Custom Processors and click Create processor on the Custom Splitter card. Enter a name (e.g., “Lending Doc Splitter”) and click Create.

Define Page Types

Open your splitter and navigate to the Page Types tab. Create page type classes to categorize individual pages:

Setting	Description	Example
Page Type Name	Machine-friendly identifier	`check_page`
Display Name	Human-readable label	”Check Page”
Description	What pages belong to this type	”Pages containing bank checks”
Color	Visual indicator	Green swatch

Examples of page types: “Cover Sheet”, “Check Page”, “Denial Letter”, “Application Form”, “Signature Page”.

Page types can be reordered by dragging. Status badges show green when a type has sufficient training samples.

Define Document Types

Navigate to the Document Types tab to categorize the logical documents that result from splitting:

Document types represent the complete documents that emerge after splitting — e.g., “Check Bundle”, “Denial Package”, “Application”. Each split segment gets assigned a document type.

Configure Split Rules

Navigate to the Split Rules tab. Rules define the logic for detecting document boundaries:

Split Rules tab showing rule list with type badges, conditions, priority, and enable toggles

Rule Type	Description	Example
PAGE_CLASSIFICATION	Split when transitioning from one page type to another	Split when “Check Page” follows “Cover Sheet”
SEQUENCE_PATTERN	Split when a specific page type sequence occurs	Split when “Cover Sheet” → “Application” pattern detected
CONTENT_BASED	Split when page contains specific content pattern	Split when page contains “Document ID:“
ALWAYS_SPLIT	Always split at a specific page type	Always split before “Cover Sheet” pages
NEVER_SPLIT	Never split a specific page type from the previous	Never split “Page 2 of 3” from previous page

Rule configuration fields:

Setting	Description
Rule Name	Descriptive name
Rule Type	One of the five types above
Source Page Type	The page type before the potential boundary
Target Page Type	The page type after the potential boundary
Content Pattern	Text pattern to match (for CONTENT_BASED rules)
Priority	Execution order (lower = higher priority)
Split Before	Whether to split before or after the target page
Enabled	Toggle rule on/off

The condition column auto-generates a human-readable description of each rule.

Start with ALWAYS_SPLIT rules for clear document starters (like cover sheets), then add NEVER_SPLIT rules to prevent false positives.

Import Training Documents

Navigate to the Documents tab and import multi-document bundle files. Use the dedicated splitter import interface for multi-page PDFs and TIFFs.

Mark Boundaries

Click a document to open the Boundary Marking Interface — a specialized two-panel labeling tool:

Boundary Marking interface with page thumbnail grid (showing scissors indicators for split points) and large page preview

Left panel — Page grid (resizable, 400-900px):

Thumbnail grid of all pages (toggle between 4 or 6 columns)
Pages displayed at 8.5:11 aspect ratio
Split indicator: scissors icon on left edge when a boundary is marked
Click the circle indicator to toggle split before a page
Page number and assigned label displayed below each thumbnail
Selected page highlighted with border

Right panel — Page preview:

High-resolution view of the selected page
Zoom controls: +/- buttons, 0.5x to 2x range, Fit Width, Fit Page, Reset
Zoom percentage display

Labeling workflow:

Navigate through pages using the grid or arrow keys
Assign page types using number keys (1-9) matching your defined page types
Toggle document boundaries by clicking the split indicator or pressing B
Assign document types to each split segment
Click Mark as Labeled to save and move to the next document

Keyboard shortcuts:

Key	Action
`1-9`	Label current page with page type
`B`	Toggle boundary before current page
`Enter`	Mark complete and go to next document
`←` / `→`	Navigate pages
`N` / `P`	Next/previous page
`?`	Show keyboard help

Train Model

Navigate to Training Jobs and click Start Training. The splitter trains a LayoutLMv3-based model for boundary detection.

See Training for detailed configuration.

Evaluate Results

The Evaluation dashboard shows:

Boundary detection accuracy — Precision, recall, and F1 for split point detection
Page type classification metrics — Per-type accuracy
Split precision — Correct splits / total predicted splits
Split recall — Correct splits / actual boundaries
Training progress — Epoch-by-epoch metrics

Deploy

Activate the trained version for production use.

Dashboard

The splitter dashboard provides seven tabs:

Tab	Purpose
Page Types	Define and manage page type classes
Document Types	Define logical document categories
Split Rules	Configure boundary detection logic
Dataset	Training data statistics and split assignment
Documents	Import and manage training documents
Training	Launch and monitor training jobs
Evaluate	Review split accuracy and boundary metrics

Backend Architecture

Custom splitters use the TransformersDocumentSplitter in marie-ai:

Model: LayoutLMv3 via HuggingFace AutoModelForSequenceClassification
Input: Document page images + OCR text + bounding boxes
Processing: Per-page boundary prediction with batch support
Output: Split points between pages with confidence scores

The splitter processes both visual cues (layout changes, cover sheets) and text content (document IDs, headers) to detect boundaries.

Best Practices

Define clear page types — Each type should be visually or textually distinct
Start with ALWAYS_SPLIT rules — Identify reliable split points first
Add NEVER_SPLIT to prevent false positives — Pages that should stay together
Mark boundaries carefully — Boundary accuracy directly impacts model quality
Include varied bundle sizes — Train on bundles with different numbers of documents
Test on real bundles — Verify split accuracy on production-like data
Use keyboard shortcuts — B for boundaries, 1-9 for page types, Enter to complete

Next Steps

Combine with Custom Classifier for split-then-classify pipelines
Learn about Training job management
Route uncertain splits to HITL review
Integrate splitters into Workflows