Skip to Content
ProcessorsSplitter

Custom Splitter

Identify document boundaries in a large file. Train a custom ML model to identify document boundaries for page separation in large bundle PDFs.

Overview

The Custom Splitter detects where one document ends and another begins inside multi-document files. Define page types, configure split rules, mark boundaries on example documents, and train a model that learns to split your specific document bundles accurately.

Use cases:

  • Batch scan separation (scanner produces one large PDF)
  • Email attachment bundles (multiple documents in one file)
  • Lending document packages (dozens of document types in one bundle)
  • Procurement document bundles
  • Medical record compilations

Creating a Custom Splitter

Create Processor

Navigate to ProcessorsCustom Processors and click Create processor on the Custom Splitter card. Enter a name (e.g., “Lending Doc Splitter”) and click Create.

Define Page Types

Open your splitter and navigate to the Page Types tab. Create page type classes to categorize individual pages:

SettingDescriptionExample
Page Type NameMachine-friendly identifiercheck_page
Display NameHuman-readable label”Check Page”
DescriptionWhat pages belong to this type”Pages containing bank checks”
ColorVisual indicatorGreen swatch

Examples of page types: “Cover Sheet”, “Check Page”, “Denial Letter”, “Application Form”, “Signature Page”.

Page types can be reordered by dragging. Status badges show green when a type has sufficient training samples.

Define Document Types

Navigate to the Document Types tab to categorize the logical documents that result from splitting:

Document types represent the complete documents that emerge after splitting — e.g., “Check Bundle”, “Denial Package”, “Application”. Each split segment gets assigned a document type.

Configure Split Rules

Navigate to the Split Rules tab. Rules define the logic for detecting document boundaries:

Split Rules tab showing rule list with type badges, conditions, priority, and enable toggles
Rule TypeDescriptionExample
PAGE_CLASSIFICATIONSplit when transitioning from one page type to anotherSplit when “Check Page” follows “Cover Sheet”
SEQUENCE_PATTERNSplit when a specific page type sequence occursSplit when “Cover Sheet” → “Application” pattern detected
CONTENT_BASEDSplit when page contains specific content patternSplit when page contains “Document ID:“
ALWAYS_SPLITAlways split at a specific page typeAlways split before “Cover Sheet” pages
NEVER_SPLITNever split a specific page type from the previousNever split “Page 2 of 3” from previous page

Rule configuration fields:

SettingDescription
Rule NameDescriptive name
Rule TypeOne of the five types above
Source Page TypeThe page type before the potential boundary
Target Page TypeThe page type after the potential boundary
Content PatternText pattern to match (for CONTENT_BASED rules)
PriorityExecution order (lower = higher priority)
Split BeforeWhether to split before or after the target page
EnabledToggle rule on/off

The condition column auto-generates a human-readable description of each rule.

Start with ALWAYS_SPLIT rules for clear document starters (like cover sheets), then add NEVER_SPLIT rules to prevent false positives.

Import Training Documents

Navigate to the Documents tab and import multi-document bundle files. Use the dedicated splitter import interface for multi-page PDFs and TIFFs.

Mark Boundaries

Click a document to open the Boundary Marking Interface — a specialized two-panel labeling tool:

Boundary Marking interface with page thumbnail grid (showing scissors indicators for split points) and large page preview

Left panel — Page grid (resizable, 400-900px):

  • Thumbnail grid of all pages (toggle between 4 or 6 columns)
  • Pages displayed at 8.5:11 aspect ratio
  • Split indicator: scissors icon on left edge when a boundary is marked
  • Click the circle indicator to toggle split before a page
  • Page number and assigned label displayed below each thumbnail
  • Selected page highlighted with border

Right panel — Page preview:

  • High-resolution view of the selected page
  • Zoom controls: +/- buttons, 0.5x to 2x range, Fit Width, Fit Page, Reset
  • Zoom percentage display

Labeling workflow:

  1. Navigate through pages using the grid or arrow keys
  2. Assign page types using number keys (1-9) matching your defined page types
  3. Toggle document boundaries by clicking the split indicator or pressing B
  4. Assign document types to each split segment
  5. Click Mark as Labeled to save and move to the next document

Keyboard shortcuts:

KeyAction
1-9Label current page with page type
BToggle boundary before current page
EnterMark complete and go to next document
/ Navigate pages
N / PNext/previous page
?Show keyboard help

Train Model

Navigate to Training Jobs and click Start Training. The splitter trains a LayoutLMv3-based model for boundary detection.

See Training for detailed configuration.

Evaluate Results

The Evaluation dashboard shows:

  • Boundary detection accuracy — Precision, recall, and F1 for split point detection
  • Page type classification metrics — Per-type accuracy
  • Split precision — Correct splits / total predicted splits
  • Split recall — Correct splits / actual boundaries
  • Training progress — Epoch-by-epoch metrics

Deploy

Activate the trained version for production use.

Dashboard

The splitter dashboard provides seven tabs:

TabPurpose
Page TypesDefine and manage page type classes
Document TypesDefine logical document categories
Split RulesConfigure boundary detection logic
DatasetTraining data statistics and split assignment
DocumentsImport and manage training documents
TrainingLaunch and monitor training jobs
EvaluateReview split accuracy and boundary metrics

Backend Architecture

Custom splitters use the TransformersDocumentSplitter in marie-ai:

  • Model: LayoutLMv3 via HuggingFace AutoModelForSequenceClassification
  • Input: Document page images + OCR text + bounding boxes
  • Processing: Per-page boundary prediction with batch support
  • Output: Split points between pages with confidence scores

The splitter processes both visual cues (layout changes, cover sheets) and text content (document IDs, headers) to detect boundaries.

Best Practices

  1. Define clear page types — Each type should be visually or textually distinct
  2. Start with ALWAYS_SPLIT rules — Identify reliable split points first
  3. Add NEVER_SPLIT to prevent false positives — Pages that should stay together
  4. Mark boundaries carefully — Boundary accuracy directly impacts model quality
  5. Include varied bundle sizes — Train on bundles with different numbers of documents
  6. Test on real bundles — Verify split accuracy on production-like data
  7. Use keyboard shortcuts — B for boundaries, 1-9 for page types, Enter to complete

Next Steps

Last updated on