Skip to Content
ProcessorsTraining

Training

Training transforms labeled examples into production-ready machine learning models. M3 Forge automates training configuration, execution, and evaluation, delivering high-accuracy processors without machine learning expertise.

Training Overview

Training involves:

  1. Data Preparation — Split labeled dataset into train/validation/test sets
  2. Model Initialization — Start from scratch or fine-tune existing model
  3. Training Execution — Iteratively optimize model on training data
  4. Evaluation — Measure performance on held-out validation set
  5. Deployment — Activate trained version for production use

M3 Forge handles all steps automatically with sensible defaults. Advanced users can customize hyperparameters and training strategies.

Creating Training Jobs

Open your custom processor and click the Training tab.

Click Create Training Job

Configure training parameters:

ParameterDescriptionDefault
Job NameDescriptive name for this runAuto-generated timestamp
Training TypeFull training vs incrementalFull
Base ModelStart from pre-trained modelLatest gallery model
Dataset SplitTrain/validation/test ratio70/15/15
Max EpochsMaximum training iterationsAuto (early stopping)
Batch SizeDocuments per training stepAuto (based on memory)

Review Dataset Statistics

Preview shows:

  • Total labeled documents
  • Train/validation/test split sizes
  • Class distribution (for classifiers)
  • Field coverage (for extractors)

Ensure sufficient examples for each class or field.

Launch Training

Click Start Training to begin execution.

Training job enters queue and starts within 1-2 minutes (depending on cluster load).

Monitor Progress

Real-time metrics display:

  • Current epoch and time elapsed
  • Training loss and validation loss
  • Accuracy, F1 score, precision, recall
  • Estimated time remaining

Click job row to view detailed metrics and live charts.

Training duration depends on dataset size and processor type. Typical ranges: 30 minutes (100 docs) to 4 hours (10,000 docs).

Training Types

Full Training

Trains model from scratch (or from base model):

  • Use Case — First training run or major dataset changes
  • Duration — Longest training time
  • Accuracy — Best performance when sufficient data available
  • Resource Cost — Highest compute cost

Incremental Training

Updates existing model with new examples:

  • Use Case — Adding new classes, fields, or edge cases
  • Duration — Faster than full training
  • Accuracy — Preserves existing knowledge while learning new patterns
  • Resource Cost — Lower compute cost

When to use incremental:

  • Adding 10-20% new examples
  • Introducing new classes to existing classifier
  • Fine-tuning on domain-specific examples
  • Correcting errors on difficult cases

When to use full training:

  • Initial model creation
  • Major dataset changes (>50% new examples)
  • Changing schema (fields or classes)
  • Model performance degraded

Training Configuration

Dataset Split

Control how data is divided:

Standard Split (70/15/15):

  • 70% training — Used to optimize model
  • 15% validation — Monitor overfitting during training
  • 15% test — Final evaluation after training

Default for most use cases.

Hyperparameters

Advanced users can tune training hyperparameters:

ParameterDescriptionDefaultTuning Guidance
Learning RateStep size for optimization0.001Lower if training unstable, higher if too slow
Batch SizeExamples per training step16Larger for faster training (if memory allows)
Max EpochsTraining iterations50Increase if validation improving at end
Early StoppingStop if no improvement5 epochsPatience for validation improvement
DropoutRegularization strength0.1Increase if overfitting (train >> val accuracy)
Weight DecayL2 regularization0.01Increase if overfitting

Hyperparameter tuning requires ML expertise. Default settings work well for 90% of use cases. Only adjust if evaluation metrics are unsatisfactory.

Transfer Learning

Start from pre-trained models:

Gallery Models:

  • Use general-purpose processors as starting point
  • Faster convergence than training from scratch
  • Requires less labeled data (can work with 50-100 examples)

Public Models:

  • LayoutLMv3, BERT, RoBERTa for document understanding
  • Vision transformers for layout analysis
  • Language models for text extraction

Your Previous Models:

  • Use older version as base for incremental updates
  • Preserves learned patterns while adding new capabilities

Select base model in training configuration.

Monitoring Training

Live Metrics

Training dashboard shows:

Loss Curves:

  • Training loss (should decrease steadily)
  • Validation loss (should decrease, then plateau)
  • Divergence indicates overfitting

Accuracy Metrics:

  • Overall accuracy over epochs
  • Per-class/field accuracy (detailed view)
  • F1 score progression

Learning Curves:

  • Performance vs dataset size
  • Indicates if more data would help

Interpreting Metrics

Good Training:

  • Training and validation loss both decreasing
  • Validation accuracy improving or stable
  • Minimal gap between train and validation metrics

Overfitting:

  • Training loss decreasing, validation increasing
  • Large gap between train accuracy (high) and validation (low)
  • Solution: Add regularization, reduce model complexity, or increase dataset

Underfitting:

  • Both training and validation loss high
  • Poor accuracy on both sets
  • Solution: Increase model capacity, train longer, or improve data quality

Data Issues:

  • Validation loss fluctuating wildly
  • Accuracy not improving despite low loss
  • Solution: Check label quality, increase dataset, balance classes

Validation metrics are the true measure of model quality. Training metrics can be misleadingly high due to overfitting.

Training States

Jobs progress through states:

StateDescriptionDuration
QueuedWaiting for compute resources1-5 minutes
InitializingLoading data, preparing model2-5 minutes
TrainingActive model optimization30 min - 4 hours
EvaluatingRunning final test set5-10 minutes
CompletedSuccessfully finished-
FailedError during training-
CancelledManually stopped-

Failed jobs show error logs. Common causes:

  • Insufficient memory (reduce batch size)
  • Invalid data (check labels)
  • Infrastructure issues (retry)

Evaluation Dashboards

After training completes, view comprehensive evaluation:

Classification Metrics

Overall Performance:

  • Accuracy (percentage of correct predictions)
  • Weighted F1 (accounts for class imbalance)
  • Top-3 accuracy (correct class in top 3 predictions)

Per-Class Metrics:

  • Precision (correct predictions / total predictions)
  • Recall (correct predictions / actual instances)
  • F1 score (harmonic mean of precision/recall)
  • Support (number of examples in test set)

Confusion Matrix:

  • Visual heatmap of predictions vs ground truth
  • Identify commonly confused classes
  • Click cells to see example misclassifications

Extraction Metrics

Field-Level Accuracy:

  • Exact match accuracy (predicted == ground truth)
  • Partial match accuracy (overlap > threshold)
  • Character error rate (edit distance)
  • Missing field rate (fields not extracted)

Aggregate Metrics:

  • Micro-averaged F1 (all fields weighted equally)
  • Macro-averaged F1 (per-field F1 averaged)
  • Entity-level precision/recall

Error Analysis:

  • Common extraction errors by type
  • Confidence distribution for correct vs incorrect
  • Difficult examples requiring review

Splitting Metrics

Boundary Detection:

  • Precision (correct splits / predicted splits)
  • Recall (correct splits / actual boundaries)
  • F1 score (harmonic mean)

Page Classification:

  • Accuracy per page type
  • Confusion matrix for page types
  • Split rule effectiveness

Layout Metrics

Region Detection:

  • IoU (Intersection over Union) scores
  • Average precision at IoU thresholds
  • Per-region-type performance

Reading Order:

  • Kendall’s tau correlation
  • Percentage of correct orderings

Model Versioning

Each training run creates a versioned model:

Version Attributes:

  • Version number (v1, v2, v3, etc.)
  • Training date and duration
  • Performance metrics (accuracy, F1)
  • Training configuration (hyperparameters, dataset size)
  • Base model (if transfer learning used)

Version Management:

  • View all versions in Versions tab
  • Compare metrics across versions
  • Activate any version for production use
  • Roll back to previous version if needed
  • Delete old versions to save storage

Activating a new version replaces the production model. Test thoroughly before activating. Previous version remains available for rollback.

Production Deployment

Activating a Version

Review Evaluation

Ensure metrics meet your accuracy requirements.

Test on Real Data

Use Test tab to run model on production-like documents.

Set Confidence Thresholds

Configure minimum confidence for automatic processing vs HITL routing.

Activate Version

Click Activate on desired version. Confirm deployment.

Monitor Performance

After activation, monitor real-world accuracy in Metrics tab.

A/B Testing

Compare two versions in production:

  1. Deploy both versions
  2. Route 50% of traffic to each version
  3. Measure accuracy, latency, user corrections
  4. Promote winner to 100% traffic

Enable A/B testing in processor settings.

Gradual Rollout

Reduce deployment risk with phased rollout:

  1. Activate new version at 10% traffic
  2. Monitor for errors or degradation
  3. Increase to 50%, then 100% if stable
  4. Automatic rollback if metrics drop

Configure rollout strategy in deployment settings.

Continuous Training

Keep models up-to-date with continuous training:

Active Learning Loop

  1. Model processes documents and predicts
  2. Low-confidence predictions routed to HITL
  3. Humans review and correct predictions
  4. Corrections added to training dataset
  5. Periodic retraining with expanded dataset

Automate with scheduled training jobs (weekly, monthly).

Drift Detection

Monitor for data drift:

  • Confidence score distribution changes
  • Increased HITL routing rate
  • User correction frequency

Trigger retraining when drift exceeds threshold.

Performance Tracking

Dashboard tracks model performance over time:

  • Accuracy trends
  • Confidence score distribution
  • Processing latency
  • HITL routing rate

Identify when retraining is needed.

For production systems, retrain at least quarterly to account for evolving document formats and business processes.

Troubleshooting Training Issues

Training Not Improving

Symptoms: Accuracy stuck at low value, loss not decreasing.

Solutions:

  • Increase model capacity (more layers, larger embeddings)
  • Check label quality (inter-annotator agreement)
  • Increase dataset size
  • Try different base model
  • Adjust learning rate

Overfitting

Symptoms: Training accuracy high, validation accuracy low.

Solutions:

  • Increase dropout (0.1 → 0.3)
  • Add weight decay regularization
  • Reduce model capacity
  • Increase dataset size
  • Use data augmentation

Slow Training

Symptoms: Training takes excessive time.

Solutions:

  • Increase batch size (if memory allows)
  • Reduce dataset size for initial experiments
  • Use smaller base model
  • Simplify processor schema

Out of Memory

Symptoms: Training fails with OOM error.

Solutions:

  • Reduce batch size (16 → 8 → 4)
  • Reduce max sequence length
  • Use gradient accumulation
  • Request larger instance type

Next Steps

Last updated on