Evaluations
Quality assurance for AI systems with evaluation metrics, guardrail gates, and continuous monitoring.
What are Evaluations?
Evaluations measure the quality and correctness of AI outputs across workflows, prompts, and models. M3 Forge provides:
- Evaluation Metrics - Built-in and custom measures of output quality
- Guardrail Nodes - Runtime quality gates in workflows
- Offline Evaluation - Test datasets and A/B comparisons
- Online Monitoring - Production quality tracking
- Model Evaluation - Processor and ML model performance dashboards
Quality assurance is critical for AI systems because:
- LLMs are non-deterministic - Same input may produce different outputs
- Hallucination - Models generate plausible but incorrect information
- Degradation - Output quality declines over time or with distribution shift
- Context sensitivity - Performance varies across domains, languages, formats
Evaluation Approaches
Offline Evaluation
Test AI components against datasets before deployment:
Use cases:
- Prompt engineering - Compare prompt variants on test cases
- Model selection - Evaluate multiple models on same task
- Regression testing - Ensure changes don’t degrade quality
- A/B testing - Measure impact of configuration changes
Workflow:
- Create test dataset (inputs + expected outputs)
- Run workflow on each test case
- Apply evaluation metrics to outputs
- Aggregate scores across dataset
- Compare to baseline or threshold
- Deploy if quality is acceptable
Online Evaluation
Monitor quality in production:
Use cases:
- Drift detection - Catch quality degradation early
- SLA monitoring - Track performance against targets
- User feedback - Correlate metrics with user ratings
- Continuous improvement - Identify improvement opportunities
Implementation:
- Add Guardrail nodes to workflows for inline evaluation
- Log metrics to monitoring dashboard
- Set up alerts for quality degradation
- Review dashboards regularly
Combine offline evaluation (pre-deployment testing) with online evaluation (production monitoring) for comprehensive quality assurance.
Evaluation Metrics
M3 Forge supports multiple evaluation approaches:
Reference-Based Metrics
Compare output to ground truth reference:
| Metric | Measures | Use Case |
|---|---|---|
| Exact Match | Output == reference (binary) | Classification, structured extraction |
| F1 Score | Precision + recall balance | Entity extraction, multi-label classification |
| BLEU | N-gram overlap with reference | Translation, text generation |
| ROUGE | Recall-oriented overlap | Summarization |
Example:
{
"evaluator": "exact_match",
"reference": "$.expected_output",
"prediction": "$.nodes.llm_node.output.text",
"threshold": 1.0
}Reference-Free Metrics
Evaluate output quality without ground truth:
| Metric | Measures | Use Case |
|---|---|---|
| Faithfulness | Output grounded in source | RAG, summarization |
| Relevance | Output addresses query | Question answering, search |
| Coherence | Logical flow and consistency | Long-form generation |
| Fluency | Grammatical correctness | Text generation |
Example:
{
"evaluator": "faithfulness",
"context": "$.nodes.rag_retrieval.output.chunks",
"answer": "$.nodes.llm_node.output.text",
"threshold": 0.8
}Schema Validation
Ensure output conforms to structure:
| Validator | Checks | Use Case |
|---|---|---|
| JSON Schema | Output matches schema | Structured data extraction |
| Regex Match | Pattern matching | Email, phone, ID validation |
| Length Check | Character/token count | Summarization, constraints |
| Contains Keywords | Required terms present | Content requirements |
Example:
{
"evaluator": "json_schema",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"}
},
"required": ["name", "age"]
},
"output": "$.nodes.extraction_node.output.data"
}LLM-as-Judge
Use LLM to evaluate other LLM outputs:
{
"evaluator": "llm_judge",
"criteria": "Is the response helpful, accurate, and concise?",
"input": "$.data.question",
"output": "$.nodes.llm_node.output.text",
"model": "gpt-4",
"threshold": 0.7
}Advantages:
- Flexible criteria (natural language definition)
- Handles subjective qualities (helpfulness, tone)
- No reference needed
Disadvantages:
- Adds latency and cost
- LLM judges can be biased or inconsistent
- Requires prompt engineering for judge criteria
Custom Executors
Run arbitrary Python/JavaScript code for evaluation:
{
"evaluator": "executor",
"language": "python",
"code": "score = 1.0 if len(output) < 100 else 0.5\nreturn {'score': score}",
"threshold": 0.7
}Use cases:
- Domain-specific validation (medical, legal)
- Complex scoring logic (weighted multi-metric)
- External API calls (toxicity detection, fact-checking)
Guardrails
Quality gates within workflows that route execution based on evaluation results.
Configuration
{
"type": "guardrail",
"config": {
"evaluators": [
{
"type": "faithfulness",
"threshold": 0.8,
"weight": 0.5
},
{
"type": "json_schema",
"schema": {...},
"weight": 0.5
}
],
"aggregation": "weighted_average",
"pass_threshold": 0.7,
"paths": [
{"path_id": "pass", "target_node_ids": ["format_output"]},
{"path_id": "fail", "target_node_ids": ["retry_llm"]}
]
}
}Behavior:
- Evaluate output against all configured evaluators
- Compute aggregate score (weighted average)
- If score >=
pass_threshold, route to “pass” path - If score <
pass_threshold, route to “fail” path
Retry Patterns
Use guardrails to implement retry logic:
{
"type": "guardrail",
"config": {
"evaluators": [...],
"paths": [
{"path_id": "pass", "target_node_ids": ["next_step"]},
{"path_id": "fail", "target_node_ids": ["retry_with_better_prompt"]}
],
"max_retries": 3,
"fallback_path": "human_review"
}
}Retry strategies:
- Retry same prompt - Rely on LLM non-determinism
- Retry with better prompt - Add instructions based on failure mode
- Retry with different model - Fallback to more capable model
- Human review - Escalate after max retries
See Guardrails for detailed configuration.
Evaluation Dashboards
Workflow Quality Dashboard
Track metrics across workflow runs:
- Success Rate - Percentage of runs passing guardrails
- Average Scores - Mean metric scores over time
- P95 Latency - Evaluation overhead (ms)
- Failure Breakdown - Which evaluators fail most often
Filtering:
- Time range (last 24h, 7d, 30d)
- Workflow ID
- Evaluator type
- Input characteristics (language, document type)
Processor Evaluation Dashboard
For ML models (classifiers, extractors):
- Accuracy - Overall correctness
- Precision/Recall - Per-class performance
- Confusion Matrix - Error patterns
- Calibration - Confidence vs accuracy
Use cases:
- Monitor model drift
- Identify weak classes
- Decide when to retrain
A/B Test Results
Compare two configurations side-by-side:
| Metric | Variant A | Variant B | Winner |
|---|---|---|---|
| Faithfulness | 0.82 | 0.89 | B |
| Relevance | 0.91 | 0.87 | A |
| Latency (p95) | 1200ms | 2100ms | A |
| Cost per 1k | $0.05 | $0.12 | A |
Statistical significance testing included.
When to Use Evaluations
| Scenario | Offline | Online | Both |
|---|---|---|---|
| Prompt engineering | Test variants on dataset | - | - |
| Model selection | Compare on test set | - | - |
| Regression testing | Pre-deployment checks | - | - |
| Production monitoring | - | Track drift, SLA | - |
| Critical workflows | Pre-deploy validation | Runtime guardrails | Both |
| Continuous improvement | Benchmark against new data | Collect real-world metrics | Both |
Not all workflows need evaluations. Add guardrails to critical paths (financial decisions, user-facing content) but skip for low-stakes operations (logging, notifications).
Integration Points
Workflow Nodes
Add Guardrail nodes between processing steps:
[LLM Node] → [Guardrail Node] → [Format Output]
↓
[Retry Logic]Configure in workflow canvas editor.
Prompt Testing
Evaluate prompts in the Prompts & Testing UI:
- Create test dataset (inputs + expected outputs)
- Run prompt variants
- View evaluation scores side-by-side
- Select best-performing prompt
API Evaluation
Evaluate via tRPC or REST API:
const result = await trpc.evaluations.evaluate.mutate({
evaluators: [
{ type: 'faithfulness', threshold: 0.8 }
],
context: chunks,
answer: llmOutput,
});
if (result.score >= result.threshold) {
// Pass
} else {
// Fail
}Getting Started
Evaluators
Configure built-in and custom evaluation metrics for quality measurement.
Guardrails
Add runtime quality gates to workflows with pass/fail routing.
Best Practices
Metric Selection
- Use multiple metrics - Single metric can be gamed or misleading
- Balance speed and quality - Complex evaluators add latency
- Reference-free when possible - Ground truth is expensive to maintain
- Domain-specific - Generic metrics may not capture your quality definition
Threshold Tuning
- Start conservative - High thresholds (0.8-0.9) for critical paths
- Tune on data - Evaluate threshold on validation set, not test set
- Monitor false positives - Thresholds too high reject good outputs
- Monitor false negatives - Thresholds too low allow bad outputs
Performance
- Evaluate selectively - Not every output needs all evaluators
- Cache results - Same input/output pair = same score
- Batch when possible - Some evaluators support batch evaluation
- Async evaluation - Don’t block workflow on non-critical checks
Continuous Improvement
- Log all evaluations - Build dataset for analysis
- Review failures - Understand why outputs fail
- Update metrics - Quality definition evolves
- Retrain judges - LLM-as-judge prompts need iteration
Next Steps
- Configure evaluation metrics for your use case
- Add guardrail nodes to critical workflows
- Monitor quality in the evaluation dashboard
- Compare prompts in the Prompts & Testing UI