Training
Training transforms labeled examples into production-ready machine learning models. M3 Forge automates training configuration, execution, and evaluation, delivering high-accuracy processors without machine learning expertise.
Training Overview
Training involves:
- Data Preparation — Split labeled dataset into train/validation/test sets
- Model Initialization — Start from scratch or fine-tune existing model
- Training Execution — Iteratively optimize model on training data
- Evaluation — Measure performance on held-out validation set
- Deployment — Activate trained version for production use
M3 Forge handles all steps automatically with sensible defaults. Advanced users can customize hyperparameters and training strategies.
Creating Training Jobs
Navigate to Training Tab
Open your custom processor and click the Training tab.
Click Create Training Job
Configure training parameters:
| Parameter | Description | Default |
|---|---|---|
| Job Name | Descriptive name for this run | Auto-generated timestamp |
| Training Type | Full training vs incremental | Full |
| Base Model | Start from pre-trained model | Latest gallery model |
| Dataset Split | Train/validation/test ratio | 70/15/15 |
| Max Epochs | Maximum training iterations | Auto (early stopping) |
| Batch Size | Documents per training step | Auto (based on memory) |
Review Dataset Statistics
Preview shows:
- Total labeled documents
- Train/validation/test split sizes
- Class distribution (for classifiers)
- Field coverage (for extractors)
Ensure sufficient examples for each class or field.
Launch Training
Click Start Training to begin execution.
Training job enters queue and starts within 1-2 minutes (depending on cluster load).
Monitor Progress
Real-time metrics display:
- Current epoch and time elapsed
- Training loss and validation loss
- Accuracy, F1 score, precision, recall
- Estimated time remaining
Click job row to view detailed metrics and live charts.
Training duration depends on dataset size and processor type. Typical ranges: 30 minutes (100 docs) to 4 hours (10,000 docs).
Training Types
Full Training
Trains model from scratch (or from base model):
- Use Case — First training run or major dataset changes
- Duration — Longest training time
- Accuracy — Best performance when sufficient data available
- Resource Cost — Highest compute cost
Incremental Training
Updates existing model with new examples:
- Use Case — Adding new classes, fields, or edge cases
- Duration — Faster than full training
- Accuracy — Preserves existing knowledge while learning new patterns
- Resource Cost — Lower compute cost
When to use incremental:
- Adding 10-20% new examples
- Introducing new classes to existing classifier
- Fine-tuning on domain-specific examples
- Correcting errors on difficult cases
When to use full training:
- Initial model creation
- Major dataset changes (>50% new examples)
- Changing schema (fields or classes)
- Model performance degraded
Training Configuration
Dataset Split
Control how data is divided:
Standard
Standard Split (70/15/15):
- 70% training — Used to optimize model
- 15% validation — Monitor overfitting during training
- 15% test — Final evaluation after training
Default for most use cases.
Hyperparameters
Advanced users can tune training hyperparameters:
| Parameter | Description | Default | Tuning Guidance |
|---|---|---|---|
| Learning Rate | Step size for optimization | 0.001 | Lower if training unstable, higher if too slow |
| Batch Size | Examples per training step | 16 | Larger for faster training (if memory allows) |
| Max Epochs | Training iterations | 50 | Increase if validation improving at end |
| Early Stopping | Stop if no improvement | 5 epochs | Patience for validation improvement |
| Dropout | Regularization strength | 0.1 | Increase if overfitting (train >> val accuracy) |
| Weight Decay | L2 regularization | 0.01 | Increase if overfitting |
Hyperparameter tuning requires ML expertise. Default settings work well for 90% of use cases. Only adjust if evaluation metrics are unsatisfactory.
Transfer Learning
Start from pre-trained models:
Gallery Models:
- Use general-purpose processors as starting point
- Faster convergence than training from scratch
- Requires less labeled data (can work with 50-100 examples)
Public Models:
- LayoutLMv3, BERT, RoBERTa for document understanding
- Vision transformers for layout analysis
- Language models for text extraction
Your Previous Models:
- Use older version as base for incremental updates
- Preserves learned patterns while adding new capabilities
Select base model in training configuration.
Monitoring Training
Live Metrics
Training dashboard shows:
Loss Curves:
- Training loss (should decrease steadily)
- Validation loss (should decrease, then plateau)
- Divergence indicates overfitting
Accuracy Metrics:
- Overall accuracy over epochs
- Per-class/field accuracy (detailed view)
- F1 score progression
Learning Curves:
- Performance vs dataset size
- Indicates if more data would help
Interpreting Metrics
Good Training:
- Training and validation loss both decreasing
- Validation accuracy improving or stable
- Minimal gap between train and validation metrics
Overfitting:
- Training loss decreasing, validation increasing
- Large gap between train accuracy (high) and validation (low)
- Solution: Add regularization, reduce model complexity, or increase dataset
Underfitting:
- Both training and validation loss high
- Poor accuracy on both sets
- Solution: Increase model capacity, train longer, or improve data quality
Data Issues:
- Validation loss fluctuating wildly
- Accuracy not improving despite low loss
- Solution: Check label quality, increase dataset, balance classes
Validation metrics are the true measure of model quality. Training metrics can be misleadingly high due to overfitting.
Training States
Jobs progress through states:
| State | Description | Duration |
|---|---|---|
| Queued | Waiting for compute resources | 1-5 minutes |
| Initializing | Loading data, preparing model | 2-5 minutes |
| Training | Active model optimization | 30 min - 4 hours |
| Evaluating | Running final test set | 5-10 minutes |
| Completed | Successfully finished | - |
| Failed | Error during training | - |
| Cancelled | Manually stopped | - |
Failed jobs show error logs. Common causes:
- Insufficient memory (reduce batch size)
- Invalid data (check labels)
- Infrastructure issues (retry)
Evaluation Dashboards
After training completes, view comprehensive evaluation:
Classification Metrics
Overall Performance:
- Accuracy (percentage of correct predictions)
- Weighted F1 (accounts for class imbalance)
- Top-3 accuracy (correct class in top 3 predictions)
Per-Class Metrics:
- Precision (correct predictions / total predictions)
- Recall (correct predictions / actual instances)
- F1 score (harmonic mean of precision/recall)
- Support (number of examples in test set)
Confusion Matrix:
- Visual heatmap of predictions vs ground truth
- Identify commonly confused classes
- Click cells to see example misclassifications
Extraction Metrics
Field-Level Accuracy:
- Exact match accuracy (predicted == ground truth)
- Partial match accuracy (overlap > threshold)
- Character error rate (edit distance)
- Missing field rate (fields not extracted)
Aggregate Metrics:
- Micro-averaged F1 (all fields weighted equally)
- Macro-averaged F1 (per-field F1 averaged)
- Entity-level precision/recall
Error Analysis:
- Common extraction errors by type
- Confidence distribution for correct vs incorrect
- Difficult examples requiring review
Splitting Metrics
Boundary Detection:
- Precision (correct splits / predicted splits)
- Recall (correct splits / actual boundaries)
- F1 score (harmonic mean)
Page Classification:
- Accuracy per page type
- Confusion matrix for page types
- Split rule effectiveness
Layout Metrics
Region Detection:
- IoU (Intersection over Union) scores
- Average precision at IoU thresholds
- Per-region-type performance
Reading Order:
- Kendall’s tau correlation
- Percentage of correct orderings
Model Versioning
Each training run creates a versioned model:
Version Attributes:
- Version number (v1, v2, v3, etc.)
- Training date and duration
- Performance metrics (accuracy, F1)
- Training configuration (hyperparameters, dataset size)
- Base model (if transfer learning used)
Version Management:
- View all versions in Versions tab
- Compare metrics across versions
- Activate any version for production use
- Roll back to previous version if needed
- Delete old versions to save storage
Activating a new version replaces the production model. Test thoroughly before activating. Previous version remains available for rollback.
Production Deployment
Activating a Version
Review Evaluation
Ensure metrics meet your accuracy requirements.
Test on Real Data
Use Test tab to run model on production-like documents.
Set Confidence Thresholds
Configure minimum confidence for automatic processing vs HITL routing.
Activate Version
Click Activate on desired version. Confirm deployment.
Monitor Performance
After activation, monitor real-world accuracy in Metrics tab.
A/B Testing
Compare two versions in production:
- Deploy both versions
- Route 50% of traffic to each version
- Measure accuracy, latency, user corrections
- Promote winner to 100% traffic
Enable A/B testing in processor settings.
Gradual Rollout
Reduce deployment risk with phased rollout:
- Activate new version at 10% traffic
- Monitor for errors or degradation
- Increase to 50%, then 100% if stable
- Automatic rollback if metrics drop
Configure rollout strategy in deployment settings.
Continuous Training
Keep models up-to-date with continuous training:
Active Learning Loop
- Model processes documents and predicts
- Low-confidence predictions routed to HITL
- Humans review and correct predictions
- Corrections added to training dataset
- Periodic retraining with expanded dataset
Automate with scheduled training jobs (weekly, monthly).
Drift Detection
Monitor for data drift:
- Confidence score distribution changes
- Increased HITL routing rate
- User correction frequency
Trigger retraining when drift exceeds threshold.
Performance Tracking
Dashboard tracks model performance over time:
- Accuracy trends
- Confidence score distribution
- Processing latency
- HITL routing rate
Identify when retraining is needed.
For production systems, retrain at least quarterly to account for evolving document formats and business processes.
Troubleshooting Training Issues
Training Not Improving
Symptoms: Accuracy stuck at low value, loss not decreasing.
Solutions:
- Increase model capacity (more layers, larger embeddings)
- Check label quality (inter-annotator agreement)
- Increase dataset size
- Try different base model
- Adjust learning rate
Overfitting
Symptoms: Training accuracy high, validation accuracy low.
Solutions:
- Increase dropout (0.1 → 0.3)
- Add weight decay regularization
- Reduce model capacity
- Increase dataset size
- Use data augmentation
Slow Training
Symptoms: Training takes excessive time.
Solutions:
- Increase batch size (if memory allows)
- Reduce dataset size for initial experiments
- Use smaller base model
- Simplify processor schema
Out of Memory
Symptoms: Training fails with OOM error.
Solutions:
- Reduce batch size (16 → 8 → 4)
- Reduce max sequence length
- Use gradient accumulation
- Request larger instance type
Next Steps
- Deploy trained processors in Workflows
- Route low-confidence predictions to HITL
- Publish high-performing models to Processor Gallery
- Monitor production metrics in Monitoring