Skip to Content
EvaluationsOverview

Evaluations

Quality assurance for AI systems with evaluation metrics, guardrail gates, and continuous monitoring.

What are Evaluations?

Evaluations measure the quality and correctness of AI outputs across workflows, prompts, and models. M3 Forge provides:

  • Evaluation Metrics - Built-in and custom measures of output quality
  • Guardrail Nodes - Runtime quality gates in workflows
  • Offline Evaluation - Test datasets and A/B comparisons
  • Online Monitoring - Production quality tracking
  • Model Evaluation - Processor and ML model performance dashboards

Quality assurance is critical for AI systems because:

  • LLMs are non-deterministic - Same input may produce different outputs
  • Hallucination - Models generate plausible but incorrect information
  • Degradation - Output quality declines over time or with distribution shift
  • Context sensitivity - Performance varies across domains, languages, formats

Evaluation Approaches

Offline Evaluation

Test AI components against datasets before deployment:

Use cases:

  • Prompt engineering - Compare prompt variants on test cases
  • Model selection - Evaluate multiple models on same task
  • Regression testing - Ensure changes don’t degrade quality
  • A/B testing - Measure impact of configuration changes

Workflow:

  1. Create test dataset (inputs + expected outputs)
  2. Run workflow on each test case
  3. Apply evaluation metrics to outputs
  4. Aggregate scores across dataset
  5. Compare to baseline or threshold
  6. Deploy if quality is acceptable

Online Evaluation

Monitor quality in production:

Use cases:

  • Drift detection - Catch quality degradation early
  • SLA monitoring - Track performance against targets
  • User feedback - Correlate metrics with user ratings
  • Continuous improvement - Identify improvement opportunities

Implementation:

  • Add Guardrail nodes to workflows for inline evaluation
  • Log metrics to monitoring dashboard
  • Set up alerts for quality degradation
  • Review dashboards regularly

Combine offline evaluation (pre-deployment testing) with online evaluation (production monitoring) for comprehensive quality assurance.

Evaluation Metrics

M3 Forge supports multiple evaluation approaches:

Reference-Based Metrics

Compare output to ground truth reference:

MetricMeasuresUse Case
Exact MatchOutput == reference (binary)Classification, structured extraction
F1 ScorePrecision + recall balanceEntity extraction, multi-label classification
BLEUN-gram overlap with referenceTranslation, text generation
ROUGERecall-oriented overlapSummarization

Example:

{ "evaluator": "exact_match", "reference": "$.expected_output", "prediction": "$.nodes.llm_node.output.text", "threshold": 1.0 }

Reference-Free Metrics

Evaluate output quality without ground truth:

MetricMeasuresUse Case
FaithfulnessOutput grounded in sourceRAG, summarization
RelevanceOutput addresses queryQuestion answering, search
CoherenceLogical flow and consistencyLong-form generation
FluencyGrammatical correctnessText generation

Example:

{ "evaluator": "faithfulness", "context": "$.nodes.rag_retrieval.output.chunks", "answer": "$.nodes.llm_node.output.text", "threshold": 0.8 }

Schema Validation

Ensure output conforms to structure:

ValidatorChecksUse Case
JSON SchemaOutput matches schemaStructured data extraction
Regex MatchPattern matchingEmail, phone, ID validation
Length CheckCharacter/token countSummarization, constraints
Contains KeywordsRequired terms presentContent requirements

Example:

{ "evaluator": "json_schema", "schema": { "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "number"} }, "required": ["name", "age"] }, "output": "$.nodes.extraction_node.output.data" }

LLM-as-Judge

Use LLM to evaluate other LLM outputs:

{ "evaluator": "llm_judge", "criteria": "Is the response helpful, accurate, and concise?", "input": "$.data.question", "output": "$.nodes.llm_node.output.text", "model": "gpt-4", "threshold": 0.7 }

Advantages:

  • Flexible criteria (natural language definition)
  • Handles subjective qualities (helpfulness, tone)
  • No reference needed

Disadvantages:

  • Adds latency and cost
  • LLM judges can be biased or inconsistent
  • Requires prompt engineering for judge criteria

Custom Executors

Run arbitrary Python/JavaScript code for evaluation:

{ "evaluator": "executor", "language": "python", "code": "score = 1.0 if len(output) < 100 else 0.5\nreturn {'score': score}", "threshold": 0.7 }

Use cases:

  • Domain-specific validation (medical, legal)
  • Complex scoring logic (weighted multi-metric)
  • External API calls (toxicity detection, fact-checking)

Guardrails

Quality gates within workflows that route execution based on evaluation results.

Configuration

{ "type": "guardrail", "config": { "evaluators": [ { "type": "faithfulness", "threshold": 0.8, "weight": 0.5 }, { "type": "json_schema", "schema": {...}, "weight": 0.5 } ], "aggregation": "weighted_average", "pass_threshold": 0.7, "paths": [ {"path_id": "pass", "target_node_ids": ["format_output"]}, {"path_id": "fail", "target_node_ids": ["retry_llm"]} ] } }

Behavior:

  1. Evaluate output against all configured evaluators
  2. Compute aggregate score (weighted average)
  3. If score >= pass_threshold, route to “pass” path
  4. If score < pass_threshold, route to “fail” path

Retry Patterns

Use guardrails to implement retry logic:

{ "type": "guardrail", "config": { "evaluators": [...], "paths": [ {"path_id": "pass", "target_node_ids": ["next_step"]}, {"path_id": "fail", "target_node_ids": ["retry_with_better_prompt"]} ], "max_retries": 3, "fallback_path": "human_review" } }

Retry strategies:

  • Retry same prompt - Rely on LLM non-determinism
  • Retry with better prompt - Add instructions based on failure mode
  • Retry with different model - Fallback to more capable model
  • Human review - Escalate after max retries

See Guardrails for detailed configuration.

Evaluation Dashboards

Workflow Quality Dashboard

Track metrics across workflow runs:

  • Success Rate - Percentage of runs passing guardrails
  • Average Scores - Mean metric scores over time
  • P95 Latency - Evaluation overhead (ms)
  • Failure Breakdown - Which evaluators fail most often

Filtering:

  • Time range (last 24h, 7d, 30d)
  • Workflow ID
  • Evaluator type
  • Input characteristics (language, document type)

Processor Evaluation Dashboard

For ML models (classifiers, extractors):

  • Accuracy - Overall correctness
  • Precision/Recall - Per-class performance
  • Confusion Matrix - Error patterns
  • Calibration - Confidence vs accuracy

Use cases:

  • Monitor model drift
  • Identify weak classes
  • Decide when to retrain

A/B Test Results

Compare two configurations side-by-side:

MetricVariant AVariant BWinner
Faithfulness0.820.89B
Relevance0.910.87A
Latency (p95)1200ms2100msA
Cost per 1k$0.05$0.12A

Statistical significance testing included.

When to Use Evaluations

ScenarioOfflineOnlineBoth
Prompt engineeringTest variants on dataset--
Model selectionCompare on test set--
Regression testingPre-deployment checks--
Production monitoring-Track drift, SLA-
Critical workflowsPre-deploy validationRuntime guardrailsBoth
Continuous improvementBenchmark against new dataCollect real-world metricsBoth

Not all workflows need evaluations. Add guardrails to critical paths (financial decisions, user-facing content) but skip for low-stakes operations (logging, notifications).

Integration Points

Workflow Nodes

Add Guardrail nodes between processing steps:

[LLM Node] → [Guardrail Node] → [Format Output] [Retry Logic]

Configure in workflow canvas editor.

Prompt Testing

Evaluate prompts in the Prompts & Testing UI:

  1. Create test dataset (inputs + expected outputs)
  2. Run prompt variants
  3. View evaluation scores side-by-side
  4. Select best-performing prompt

API Evaluation

Evaluate via tRPC or REST API:

const result = await trpc.evaluations.evaluate.mutate({ evaluators: [ { type: 'faithfulness', threshold: 0.8 } ], context: chunks, answer: llmOutput, }); if (result.score >= result.threshold) { // Pass } else { // Fail }

Getting Started

Best Practices

Metric Selection

  • Use multiple metrics - Single metric can be gamed or misleading
  • Balance speed and quality - Complex evaluators add latency
  • Reference-free when possible - Ground truth is expensive to maintain
  • Domain-specific - Generic metrics may not capture your quality definition

Threshold Tuning

  • Start conservative - High thresholds (0.8-0.9) for critical paths
  • Tune on data - Evaluate threshold on validation set, not test set
  • Monitor false positives - Thresholds too high reject good outputs
  • Monitor false negatives - Thresholds too low allow bad outputs

Performance

  • Evaluate selectively - Not every output needs all evaluators
  • Cache results - Same input/output pair = same score
  • Batch when possible - Some evaluators support batch evaluation
  • Async evaluation - Don’t block workflow on non-critical checks

Continuous Improvement

  • Log all evaluations - Build dataset for analysis
  • Review failures - Understand why outputs fail
  • Update metrics - Quality definition evolves
  • Retrain judges - LLM-as-judge prompts need iteration

Next Steps

Last updated on