Evaluations

Quality assurance for AI systems with evaluation metrics, guardrail gates, and continuous monitoring.

What are Evaluations?

Evaluations measure the quality and correctness of AI outputs across workflows, prompts, and models. M3 Forge provides:

Evaluation Metrics - Built-in and custom measures of output quality
Guardrail Nodes - Runtime quality gates in workflows
Offline Evaluation - Test datasets and A/B comparisons
Online Monitoring - Production quality tracking
Model Evaluation - Processor and ML model performance dashboards

Quality assurance is critical for AI systems because:

LLMs are non-deterministic - Same input may produce different outputs
Hallucination - Models generate plausible but incorrect information
Degradation - Output quality declines over time or with distribution shift
Context sensitivity - Performance varies across domains, languages, formats

Evaluation Approaches

Offline Evaluation

Test AI components against datasets before deployment:

Use cases:

Prompt engineering - Compare prompt variants on test cases
Model selection - Evaluate multiple models on same task
Regression testing - Ensure changes don’t degrade quality
A/B testing - Measure impact of configuration changes

Workflow:

Create test dataset (inputs + expected outputs)
Run workflow on each test case
Apply evaluation metrics to outputs
Aggregate scores across dataset
Compare to baseline or threshold
Deploy if quality is acceptable

Online Evaluation

Monitor quality in production:

Use cases:

Drift detection - Catch quality degradation early
SLA monitoring - Track performance against targets
User feedback - Correlate metrics with user ratings
Continuous improvement - Identify improvement opportunities

Implementation:

Add Guardrail nodes to workflows for inline evaluation
Log metrics to monitoring dashboard
Set up alerts for quality degradation
Review dashboards regularly

Combine offline evaluation (pre-deployment testing) with online evaluation (production monitoring) for comprehensive quality assurance.

Evaluation Metrics

M3 Forge supports multiple evaluation approaches:

Reference-Based Metrics

Compare output to ground truth reference:

Metric	Measures	Use Case
Exact Match	Output == reference (binary)	Classification, structured extraction
F1 Score	Precision + recall balance	Entity extraction, multi-label classification
BLEU	N-gram overlap with reference	Translation, text generation
ROUGE	Recall-oriented overlap	Summarization

Example:


{
  "evaluator": "exact_match",
  "reference": "$.expected_output",
  "prediction": "$.nodes.llm_node.output.text",
  "threshold": 1.0
}

Reference-Free Metrics

Evaluate output quality without ground truth:

Metric	Measures	Use Case
Faithfulness	Output grounded in source	RAG, summarization
Relevance	Output addresses query	Question answering, search
Coherence	Logical flow and consistency	Long-form generation
Fluency	Grammatical correctness	Text generation

Example:


{
  "evaluator": "faithfulness",
  "context": "$.nodes.rag_retrieval.output.chunks",
  "answer": "$.nodes.llm_node.output.text",
  "threshold": 0.8
}

Schema Validation

Ensure output conforms to structure:

Validator	Checks	Use Case
JSON Schema	Output matches schema	Structured data extraction
Regex Match	Pattern matching	Email, phone, ID validation
Length Check	Character/token count	Summarization, constraints
Contains Keywords	Required terms present	Content requirements

Example:


{
  "evaluator": "json_schema",
  "schema": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "age": {"type": "number"}
    },
    "required": ["name", "age"]
  },
  "output": "$.nodes.extraction_node.output.data"
}

LLM-as-Judge

Use LLM to evaluate other LLM outputs:


{
  "evaluator": "llm_judge",
  "criteria": "Is the response helpful, accurate, and concise?",
  "input": "$.data.question",
  "output": "$.nodes.llm_node.output.text",
  "model": "gpt-4",
  "threshold": 0.7
}

Advantages:

Flexible criteria (natural language definition)
Handles subjective qualities (helpfulness, tone)
No reference needed

Disadvantages:

Adds latency and cost
LLM judges can be biased or inconsistent
Requires prompt engineering for judge criteria

Custom Executors

Run arbitrary Python/JavaScript code for evaluation:


{
  "evaluator": "executor",
  "language": "python",
  "code": "score = 1.0 if len(output) < 100 else 0.5\nreturn {'score': score}",
  "threshold": 0.7
}

Use cases:

Domain-specific validation (medical, legal)
Complex scoring logic (weighted multi-metric)
External API calls (toxicity detection, fact-checking)

Guardrails

Quality gates within workflows that route execution based on evaluation results.

Configuration


{
  "type": "guardrail",
  "config": {
    "evaluators": [
      {
        "type": "faithfulness",
        "threshold": 0.8,
        "weight": 0.5
      },
      {
        "type": "json_schema",
        "schema": {...},
        "weight": 0.5
      }
    ],
    "aggregation": "weighted_average",
    "pass_threshold": 0.7,
    "paths": [
      {"path_id": "pass", "target_node_ids": ["format_output"]},
      {"path_id": "fail", "target_node_ids": ["retry_llm"]}
    ]
  }
}

Behavior:

Evaluate output against all configured evaluators
Compute aggregate score (weighted average)
If score >= pass_threshold, route to “pass” path
If score < pass_threshold, route to “fail” path

Retry Patterns

Use guardrails to implement retry logic:


{
  "type": "guardrail",
  "config": {
    "evaluators": [...],
    "paths": [
      {"path_id": "pass", "target_node_ids": ["next_step"]},
      {"path_id": "fail", "target_node_ids": ["retry_with_better_prompt"]}
    ],
    "max_retries": 3,
    "fallback_path": "human_review"
  }
}

Retry strategies:

Retry same prompt - Rely on LLM non-determinism
Retry with better prompt - Add instructions based on failure mode
Retry with different model - Fallback to more capable model
Human review - Escalate after max retries

See Guardrails for detailed configuration.

Evaluation Dashboards

Workflow Quality Dashboard

Track metrics across workflow runs:

Success Rate - Percentage of runs passing guardrails
Average Scores - Mean metric scores over time
P95 Latency - Evaluation overhead (ms)
Failure Breakdown - Which evaluators fail most often

Filtering:

Time range (last 24h, 7d, 30d)
Workflow ID
Evaluator type
Input characteristics (language, document type)

Processor Evaluation Dashboard

For ML models (classifiers, extractors):

Accuracy - Overall correctness
Precision/Recall - Per-class performance
Confusion Matrix - Error patterns
Calibration - Confidence vs accuracy

Use cases:

Monitor model drift
Identify weak classes
Decide when to retrain

A/B Test Results

Compare two configurations side-by-side:

Metric	Variant A	Variant B	Winner
Faithfulness	0.82	0.89	B
Relevance	0.91	0.87	A
Latency (p95)	1200ms	2100ms	A
Cost per 1k	$0.05	$0.12	A

Statistical significance testing included.

When to Use Evaluations

Scenario	Offline	Online	Both
Prompt engineering	Test variants on dataset	-	-
Model selection	Compare on test set	-	-
Regression testing	Pre-deployment checks	-	-
Production monitoring	-	Track drift, SLA	-
Critical workflows	Pre-deploy validation	Runtime guardrails	Both
Continuous improvement	Benchmark against new data	Collect real-world metrics	Both

Not all workflows need evaluations. Add guardrails to critical paths (financial decisions, user-facing content) but skip for low-stakes operations (logging, notifications).

Integration Points

Workflow Nodes

Add Guardrail nodes between processing steps:


[LLM Node] → [Guardrail Node] → [Format Output]
                   ↓
              [Retry Logic]

Configure in workflow canvas editor.

Prompt Testing

Evaluate prompts in the Prompts & Testing UI:

Create test dataset (inputs + expected outputs)
Run prompt variants
View evaluation scores side-by-side
Select best-performing prompt

API Evaluation

Evaluate via tRPC or REST API:


const result = await trpc.evaluations.evaluate.mutate({
  evaluators: [
    { type: 'faithfulness', threshold: 0.8 }
  ],
  context: chunks,
  answer: llmOutput,
});
 
if (result.score >= result.threshold) {
  // Pass
} else {
  // Fail
}

Getting Started

Evaluators

Configure built-in and custom evaluation metrics for quality measurement.

Guardrails

Add runtime quality gates to workflows with pass/fail routing.

Best Practices

Metric Selection

Use multiple metrics - Single metric can be gamed or misleading
Balance speed and quality - Complex evaluators add latency
Reference-free when possible - Ground truth is expensive to maintain
Domain-specific - Generic metrics may not capture your quality definition

Threshold Tuning

Start conservative - High thresholds (0.8-0.9) for critical paths
Tune on data - Evaluate threshold on validation set, not test set
Monitor false positives - Thresholds too high reject good outputs
Monitor false negatives - Thresholds too low allow bad outputs

Performance

Evaluate selectively - Not every output needs all evaluators
Cache results - Same input/output pair = same score
Batch when possible - Some evaluators support batch evaluation
Async evaluation - Don’t block workflow on non-critical checks

Continuous Improvement

Log all evaluations - Build dataset for analysis
Review failures - Understand why outputs fail
Update metrics - Quality definition evolves
Retrain judges - LLM-as-judge prompts need iteration

Next Steps

Configure evaluation metrics for your use case
Add guardrail nodes to critical workflows
Monitor quality in the evaluation dashboard
Compare prompts in the Prompts & Testing UI