Custom Splitter
Identify document boundaries in a large file. Train a custom ML model to identify document boundaries for page separation in large bundle PDFs.
Overview
The Custom Splitter detects where one document ends and another begins inside multi-document files. Define page types, configure split rules, mark boundaries on example documents, and train a model that learns to split your specific document bundles accurately.
Use cases:
- Batch scan separation (scanner produces one large PDF)
- Email attachment bundles (multiple documents in one file)
- Lending document packages (dozens of document types in one bundle)
- Procurement document bundles
- Medical record compilations
Creating a Custom Splitter
Create Processor
Navigate to Processors → Custom Processors and click Create processor on the Custom Splitter card. Enter a name (e.g., “Lending Doc Splitter”) and click Create.
Define Page Types
Open your splitter and navigate to the Page Types tab. Create page type classes to categorize individual pages:
| Setting | Description | Example |
|---|---|---|
| Page Type Name | Machine-friendly identifier | check_page |
| Display Name | Human-readable label | ”Check Page” |
| Description | What pages belong to this type | ”Pages containing bank checks” |
| Color | Visual indicator | Green swatch |
Examples of page types: “Cover Sheet”, “Check Page”, “Denial Letter”, “Application Form”, “Signature Page”.
Page types can be reordered by dragging. Status badges show green when a type has sufficient training samples.
Define Document Types
Navigate to the Document Types tab to categorize the logical documents that result from splitting:
Document types represent the complete documents that emerge after splitting — e.g., “Check Bundle”, “Denial Package”, “Application”. Each split segment gets assigned a document type.
Configure Split Rules
Navigate to the Split Rules tab. Rules define the logic for detecting document boundaries:

| Rule Type | Description | Example |
|---|---|---|
| PAGE_CLASSIFICATION | Split when transitioning from one page type to another | Split when “Check Page” follows “Cover Sheet” |
| SEQUENCE_PATTERN | Split when a specific page type sequence occurs | Split when “Cover Sheet” → “Application” pattern detected |
| CONTENT_BASED | Split when page contains specific content pattern | Split when page contains “Document ID:“ |
| ALWAYS_SPLIT | Always split at a specific page type | Always split before “Cover Sheet” pages |
| NEVER_SPLIT | Never split a specific page type from the previous | Never split “Page 2 of 3” from previous page |
Rule configuration fields:
| Setting | Description |
|---|---|
| Rule Name | Descriptive name |
| Rule Type | One of the five types above |
| Source Page Type | The page type before the potential boundary |
| Target Page Type | The page type after the potential boundary |
| Content Pattern | Text pattern to match (for CONTENT_BASED rules) |
| Priority | Execution order (lower = higher priority) |
| Split Before | Whether to split before or after the target page |
| Enabled | Toggle rule on/off |
The condition column auto-generates a human-readable description of each rule.
Start with ALWAYS_SPLIT rules for clear document starters (like cover sheets), then add NEVER_SPLIT rules to prevent false positives.
Import Training Documents
Navigate to the Documents tab and import multi-document bundle files. Use the dedicated splitter import interface for multi-page PDFs and TIFFs.
Mark Boundaries
Click a document to open the Boundary Marking Interface — a specialized two-panel labeling tool:

Left panel — Page grid (resizable, 400-900px):
- Thumbnail grid of all pages (toggle between 4 or 6 columns)
- Pages displayed at 8.5:11 aspect ratio
- Split indicator: scissors icon on left edge when a boundary is marked
- Click the circle indicator to toggle split before a page
- Page number and assigned label displayed below each thumbnail
- Selected page highlighted with border
Right panel — Page preview:
- High-resolution view of the selected page
- Zoom controls: +/- buttons, 0.5x to 2x range, Fit Width, Fit Page, Reset
- Zoom percentage display
Labeling workflow:
- Navigate through pages using the grid or arrow keys
- Assign page types using number keys (1-9) matching your defined page types
- Toggle document boundaries by clicking the split indicator or pressing
B - Assign document types to each split segment
- Click Mark as Labeled to save and move to the next document
Keyboard shortcuts:
| Key | Action |
|---|---|
1-9 | Label current page with page type |
B | Toggle boundary before current page |
Enter | Mark complete and go to next document |
← / → | Navigate pages |
N / P | Next/previous page |
? | Show keyboard help |
Train Model
Navigate to Training Jobs and click Start Training. The splitter trains a LayoutLMv3-based model for boundary detection.
See Training for detailed configuration.
Evaluate Results
The Evaluation dashboard shows:
- Boundary detection accuracy — Precision, recall, and F1 for split point detection
- Page type classification metrics — Per-type accuracy
- Split precision — Correct splits / total predicted splits
- Split recall — Correct splits / actual boundaries
- Training progress — Epoch-by-epoch metrics
Deploy
Activate the trained version for production use.
Dashboard
The splitter dashboard provides seven tabs:
| Tab | Purpose |
|---|---|
| Page Types | Define and manage page type classes |
| Document Types | Define logical document categories |
| Split Rules | Configure boundary detection logic |
| Dataset | Training data statistics and split assignment |
| Documents | Import and manage training documents |
| Training | Launch and monitor training jobs |
| Evaluate | Review split accuracy and boundary metrics |
Backend Architecture
Custom splitters use the TransformersDocumentSplitter in marie-ai:
- Model: LayoutLMv3 via HuggingFace
AutoModelForSequenceClassification - Input: Document page images + OCR text + bounding boxes
- Processing: Per-page boundary prediction with batch support
- Output: Split points between pages with confidence scores
The splitter processes both visual cues (layout changes, cover sheets) and text content (document IDs, headers) to detect boundaries.
Best Practices
- Define clear page types — Each type should be visually or textually distinct
- Start with ALWAYS_SPLIT rules — Identify reliable split points first
- Add NEVER_SPLIT to prevent false positives — Pages that should stay together
- Mark boundaries carefully — Boundary accuracy directly impacts model quality
- Include varied bundle sizes — Train on bundles with different numbers of documents
- Test on real bundles — Verify split accuracy on production-like data
- Use keyboard shortcuts — B for boundaries, 1-9 for page types, Enter to complete
Next Steps
- Combine with Custom Classifier for split-then-classify pipelines
- Learn about Training job management
- Route uncertain splits to HITL review
- Integrate splitters into Workflows