Training

Training transforms labeled examples into production-ready machine learning models. M3 Forge automates training configuration, execution, and evaluation, delivering high-accuracy processors without machine learning expertise.

Training Overview

Training involves:

Data Preparation — Split labeled dataset into train/validation/test sets
Model Initialization — Start from scratch or fine-tune existing model
Training Execution — Iteratively optimize model on training data
Evaluation — Measure performance on held-out validation set
Deployment — Activate trained version for production use

M3 Forge handles all steps automatically with sensible defaults. Advanced users can customize hyperparameters and training strategies.

Creating Training Jobs

Navigate to Training Tab

Open your custom processor and click the Training tab.

Click Create Training Job

Configure training parameters:

Parameter	Description	Default
Job Name	Descriptive name for this run	Auto-generated timestamp
Training Type	Full training vs incremental	Full
Base Model	Start from pre-trained model	Latest gallery model
Dataset Split	Train/validation/test ratio	70/15/15
Max Epochs	Maximum training iterations	Auto (early stopping)
Batch Size	Documents per training step	Auto (based on memory)

Review Dataset Statistics

Preview shows:

Total labeled documents
Train/validation/test split sizes
Class distribution (for classifiers)
Field coverage (for extractors)

Ensure sufficient examples for each class or field.

Launch Training

Click Start Training to begin execution.

Training job enters queue and starts within 1-2 minutes (depending on cluster load).

Monitor Progress

Real-time metrics display:

Current epoch and time elapsed
Training loss and validation loss
Accuracy, F1 score, precision, recall
Estimated time remaining

Click job row to view detailed metrics and live charts.

Training duration depends on dataset size and processor type. Typical ranges: 30 minutes (100 docs) to 4 hours (10,000 docs).

Training Types

Full Training

Trains model from scratch (or from base model):

Use Case — First training run or major dataset changes
Duration — Longest training time
Accuracy — Best performance when sufficient data available
Resource Cost — Highest compute cost

Incremental Training

Updates existing model with new examples:

Use Case — Adding new classes, fields, or edge cases
Duration — Faster than full training
Accuracy — Preserves existing knowledge while learning new patterns
Resource Cost — Lower compute cost

When to use incremental:

Adding 10-20% new examples
Introducing new classes to existing classifier
Fine-tuning on domain-specific examples
Correcting errors on difficult cases

When to use full training:

Initial model creation
Major dataset changes (>50% new examples)
Changing schema (fields or classes)
Model performance degraded

Training Configuration

Dataset Split

Control how data is divided:

Standard

Standard Split (70/15/15):

70% training — Used to optimize model
15% validation — Monitor overfitting during training
15% test — Final evaluation after training

Default for most use cases.

Hyperparameters

Advanced users can tune training hyperparameters:

Parameter	Description	Default	Tuning Guidance
Learning Rate	Step size for optimization	0.001	Lower if training unstable, higher if too slow
Batch Size	Examples per training step	16	Larger for faster training (if memory allows)
Max Epochs	Training iterations	50	Increase if validation improving at end
Early Stopping	Stop if no improvement	5 epochs	Patience for validation improvement
Dropout	Regularization strength	0.1	Increase if overfitting (train >> val accuracy)
Weight Decay	L2 regularization	0.01	Increase if overfitting

Hyperparameter tuning requires ML expertise. Default settings work well for 90% of use cases. Only adjust if evaluation metrics are unsatisfactory.

Transfer Learning

Start from pre-trained models:

Gallery Models:

Use general-purpose processors as starting point
Faster convergence than training from scratch
Requires less labeled data (can work with 50-100 examples)

Public Models:

LayoutLMv3, BERT, RoBERTa for document understanding
Vision transformers for layout analysis
Language models for text extraction

Your Previous Models:

Use older version as base for incremental updates
Preserves learned patterns while adding new capabilities

Select base model in training configuration.

Monitoring Training

Live Metrics

Training dashboard shows:

Loss Curves:

Training loss (should decrease steadily)
Validation loss (should decrease, then plateau)
Divergence indicates overfitting

Accuracy Metrics:

Overall accuracy over epochs
Per-class/field accuracy (detailed view)
F1 score progression

Learning Curves:

Performance vs dataset size
Indicates if more data would help

Interpreting Metrics

Good Training:

Training and validation loss both decreasing
Validation accuracy improving or stable
Minimal gap between train and validation metrics

Overfitting:

Training loss decreasing, validation increasing
Large gap between train accuracy (high) and validation (low)
Solution: Add regularization, reduce model complexity, or increase dataset

Underfitting:

Both training and validation loss high
Poor accuracy on both sets
Solution: Increase model capacity, train longer, or improve data quality

Data Issues:

Validation loss fluctuating wildly
Accuracy not improving despite low loss
Solution: Check label quality, increase dataset, balance classes

Validation metrics are the true measure of model quality. Training metrics can be misleadingly high due to overfitting.

Training States

Jobs progress through states:

State	Description	Duration
Queued	Waiting for compute resources	1-5 minutes
Initializing	Loading data, preparing model	2-5 minutes
Training	Active model optimization	30 min - 4 hours
Evaluating	Running final test set	5-10 minutes
Completed	Successfully finished	-
Failed	Error during training	-
Cancelled	Manually stopped	-

Failed jobs show error logs. Common causes:

Insufficient memory (reduce batch size)
Invalid data (check labels)
Infrastructure issues (retry)

Evaluation Dashboards

After training completes, view comprehensive evaluation:

Classification Metrics

Overall Performance:

Accuracy (percentage of correct predictions)
Weighted F1 (accounts for class imbalance)
Top-3 accuracy (correct class in top 3 predictions)

Per-Class Metrics:

Precision (correct predictions / total predictions)
Recall (correct predictions / actual instances)
F1 score (harmonic mean of precision/recall)
Support (number of examples in test set)

Confusion Matrix:

Visual heatmap of predictions vs ground truth
Identify commonly confused classes
Click cells to see example misclassifications

Extraction Metrics

Field-Level Accuracy:

Exact match accuracy (predicted == ground truth)
Partial match accuracy (overlap > threshold)
Character error rate (edit distance)
Missing field rate (fields not extracted)

Aggregate Metrics:

Micro-averaged F1 (all fields weighted equally)
Macro-averaged F1 (per-field F1 averaged)
Entity-level precision/recall

Error Analysis:

Common extraction errors by type
Confidence distribution for correct vs incorrect
Difficult examples requiring review

Splitting Metrics

Boundary Detection:

Precision (correct splits / predicted splits)
Recall (correct splits / actual boundaries)
F1 score (harmonic mean)

Page Classification:

Accuracy per page type
Confusion matrix for page types
Split rule effectiveness

Layout Metrics

Region Detection:

IoU (Intersection over Union) scores
Average precision at IoU thresholds
Per-region-type performance

Reading Order:

Kendall’s tau correlation
Percentage of correct orderings

Model Versioning

Each training run creates a versioned model:

Version Attributes:

Version number (v1, v2, v3, etc.)
Training date and duration
Performance metrics (accuracy, F1)
Training configuration (hyperparameters, dataset size)
Base model (if transfer learning used)

Version Management:

View all versions in Versions tab
Compare metrics across versions
Activate any version for production use
Roll back to previous version if needed
Delete old versions to save storage

Activating a new version replaces the production model. Test thoroughly before activating. Previous version remains available for rollback.

Production Deployment

Activating a Version

Review Evaluation

Ensure metrics meet your accuracy requirements.

Test on Real Data

Use Test tab to run model on production-like documents.

Set Confidence Thresholds

Configure minimum confidence for automatic processing vs HITL routing.

Activate Version

Click Activate on desired version. Confirm deployment.

Monitor Performance

After activation, monitor real-world accuracy in Metrics tab.

A/B Testing

Compare two versions in production:

Deploy both versions
Route 50% of traffic to each version
Measure accuracy, latency, user corrections
Promote winner to 100% traffic

Enable A/B testing in processor settings.

Gradual Rollout

Reduce deployment risk with phased rollout:

Activate new version at 10% traffic
Monitor for errors or degradation
Increase to 50%, then 100% if stable
Automatic rollback if metrics drop

Configure rollout strategy in deployment settings.

Continuous Training

Keep models up-to-date with continuous training:

Active Learning Loop

Model processes documents and predicts
Low-confidence predictions routed to HITL
Humans review and correct predictions
Corrections added to training dataset
Periodic retraining with expanded dataset

Automate with scheduled training jobs (weekly, monthly).

Drift Detection

Monitor for data drift:

Confidence score distribution changes
Increased HITL routing rate
User correction frequency

Trigger retraining when drift exceeds threshold.

Performance Tracking

Dashboard tracks model performance over time:

Accuracy trends
Confidence score distribution
Processing latency
HITL routing rate

Identify when retraining is needed.

For production systems, retrain at least quarterly to account for evolving document formats and business processes.

Troubleshooting Training Issues

Training Not Improving

Symptoms: Accuracy stuck at low value, loss not decreasing.

Solutions:

Increase model capacity (more layers, larger embeddings)
Check label quality (inter-annotator agreement)
Increase dataset size
Try different base model
Adjust learning rate

Overfitting

Symptoms: Training accuracy high, validation accuracy low.

Solutions:

Increase dropout (0.1 → 0.3)
Add weight decay regularization
Reduce model capacity
Increase dataset size
Use data augmentation

Slow Training

Symptoms: Training takes excessive time.

Solutions:

Increase batch size (if memory allows)
Reduce dataset size for initial experiments
Use smaller base model
Simplify processor schema

Out of Memory

Symptoms: Training fails with OOM error.

Solutions:

Reduce batch size (16 → 8 → 4)
Reduce max sequence length
Use gradient accumulation
Request larger instance type

Next Steps

Deploy trained processors in Workflows
Route low-confidence predictions to HITL
Publish high-performing models to Processor Gallery
Monitor production metrics in Monitoring

Training

Training Overview

Creating Training Jobs

Navigate to Training Tab

Click Create Training Job

Review Dataset Statistics

Launch Training

Monitor Progress

Training Types

Full Training

Incremental Training

Training Configuration

Dataset Split

Standard

Custom

Cross-Validation

Hyperparameters

Transfer Learning

Monitoring Training

Live Metrics

Interpreting Metrics

Training States

Evaluation Dashboards

Classification Metrics

Extraction Metrics

Splitting Metrics

Layout Metrics

Model Versioning

Production Deployment

Activating a Version

Review Evaluation

Test on Real Data

Set Confidence Thresholds

Activate Version

Monitor Performance

A/B Testing

Gradual Rollout

Continuous Training

Active Learning Loop

Drift Detection

Performance Tracking

Troubleshooting Training Issues

Training Not Improving

Overfitting

Slow Training

Out of Memory

Next Steps