Evaluators
Built-in and custom evaluation metrics for measuring AI output quality.
Evaluator Types
M3 Forge provides multiple evaluation strategies, each optimized for different quality dimensions.
Faithfulness
Measures whether generated text is grounded in source documents (no hallucination).
Configuration
{
"type": "faithfulness",
"context": "$.nodes.rag_retrieval.output.chunks",
"answer": "$.nodes.llm_node.output.text",
"threshold": 0.8,
"model": "gpt-4-turbo"
}Parameters:
- context - Source documents (array of text chunks)
- answer - Generated text to evaluate
- threshold - Minimum score to pass (0.0-1.0)
- model - LLM to use for evaluation (default: gpt-3.5-turbo)
How It Works
- Extract claims from generated answer
- For each claim, check if supported by context
- Score = (supported claims) / (total claims)
Example:
Context:
The Eiffel Tower was completed in 1889 and is 330 meters tall.Answer:
The Eiffel Tower, completed in 1889, stands at 330 meters tall
and is located in Paris, France.Claims:
- “completed in 1889” → Supported ✓
- “330 meters tall” → Supported ✓
- “located in Paris, France” → Not in context ✗
Score: 2/3 = 0.67 (below threshold 0.8, fails)
Use Cases
- RAG pipelines - Ensure LLM doesn’t hallucinate beyond retrieved context
- Summarization - Verify summary is faithful to source document
- Question answering - Check answer is grounded in provided passages
Faithfulness evaluates groundedness, not factual correctness. An answer can be faithful to incorrect context. Combine with relevance and fact-checking for complete QA.
Relevance
Measures whether output addresses the input query or task.
Configuration
{
"type": "relevance",
"query": "$.data.question",
"answer": "$.nodes.llm_node.output.text",
"threshold": 0.7,
"model": "gpt-4-turbo"
}Parameters:
- query - User question or task description
- answer - Generated output to evaluate
- threshold - Minimum score to pass (0.0-1.0)
- model - LLM for evaluation (default: gpt-3.5-turbo)
How It Works
LLM judges whether answer is relevant to query on scale of 0-1:
- 1.0 - Completely relevant, directly answers query
- 0.7 - Mostly relevant, some tangential content
- 0.5 - Partially relevant, misses key aspects
- 0.0 - Irrelevant, doesn’t address query
Example:
Query:
How do I configure SSL certificates in nginx?Answer:
To configure SSL in nginx, add the following to your server block:
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;Score: 1.0 (directly answers the question)
Answer (poor):
Nginx is a popular web server known for high performance...Score: 0.2 (mentions nginx but doesn’t answer the question)
Use Cases
- Question answering - Ensure response addresses the question
- Search results - Verify retrieved documents match query intent
- Conversational AI - Check response is on-topic
JSON Schema
Validates output conforms to expected JSON structure.
Configuration
{
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number", "minimum": 0},
"email": {"type": "string", "format": "email"}
},
"required": ["name", "age"]
},
"output": "$.nodes.extraction_node.output.data",
"threshold": 1.0
}Parameters:
- schema - JSON Schema definition (v7)
- output - JSON object to validate
- threshold - Must be 1.0 (binary pass/fail)
Validation Rules
Types
{
"properties": {
"string_field": {"type": "string"},
"number_field": {"type": "number"},
"integer_field": {"type": "integer"},
"boolean_field": {"type": "boolean"},
"array_field": {"type": "array"},
"object_field": {"type": "object"},
"null_field": {"type": "null"}
}
}Use Cases
- Structured extraction - Validate LLM extracted correct fields
- API responses - Ensure output matches downstream API schema
- Database inserts - Verify data conforms before insertion
Example: Invoice extraction
{
"schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "pattern": "^INV-[0-9]{6}$"},
"date": {"type": "string", "format": "date"},
"total": {"type": "number", "minimum": 0},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"price": {"type": "number", "minimum": 0}
},
"required": ["description", "quantity", "price"]
}
}
},
"required": ["invoice_number", "date", "total", "line_items"]
}
}Regex Match
Validates output matches a regular expression pattern.
Configuration
{
"type": "regex_match",
"pattern": "^[A-Z]{2}[0-9]{6}$",
"text": "$.nodes.extraction_node.output.document_id",
"threshold": 1.0
}Parameters:
- pattern - Regular expression (Python/JavaScript syntax)
- text - String to validate
- threshold - Must be 1.0 (binary pass/fail)
- flags - Optional regex flags (i=case-insensitive, m=multiline)
Common Patterns
Identifiers
# US Social Security Number
^\d{3}-\d{2}-\d{4}$
# UUID
^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$
# Product SKU
^[A-Z]{3}-[0-9]{6}$
# License plate (US)
^[A-Z0-9]{2,7}$Use Cases
- Format validation - Ensure extracted values match expected format
- Input sanitization - Verify user input is safe
- Data quality - Check data conforms to standards
Length Check
Validates text length is within bounds.
Configuration
{
"type": "length_check",
"text": "$.nodes.summarization_node.output.summary",
"min_length": 50,
"max_length": 200,
"unit": "characters",
"threshold": 1.0
}Parameters:
- text - String to measure
- min_length - Minimum length (inclusive)
- max_length - Maximum length (inclusive)
- unit -
characters,words, ortokens - threshold - Must be 1.0 (binary pass/fail)
Units
| Unit | Count Method | Use Case |
|---|---|---|
| characters | UTF-8 characters | Tweet length, SMS, UI constraints |
| words | Space-separated tokens | Summaries, abstracts |
| tokens | LLM tokenizer (GPT-4) | LLM context limits |
Example:
Text: "The quick brown fox jumps over the lazy dog."
- Characters: 44
- Words: 9
- Tokens: 10 (depends on tokenizer)
Use Cases
- Summarization - Ensure summary is concise (50-200 words)
- Tweet generation - Limit to 280 characters
- Input validation - Reject overly long user input
- Token budgets - Stay within LLM context window
Contains Keywords
Checks if text contains required keywords or phrases.
Configuration
{
"type": "contains_keywords",
"text": "$.nodes.llm_node.output.text",
"keywords": ["certificate", "SSL", "nginx"],
"match_all": true,
"case_sensitive": false,
"threshold": 1.0
}Parameters:
- text - Text to search
- keywords - Array of required keywords/phrases
- match_all - If true, all keywords required. If false, any keyword sufficient
- case_sensitive - Case-sensitive matching (default: false)
- threshold - Must be 1.0 (binary pass/fail)
Use Cases
- Content requirements - Ensure response covers required topics
- SEO - Verify target keywords are present
- Compliance - Check legal disclaimers are included
- Quality control - Ensure instructions followed
Example: Legal disclaimer
{
"type": "contains_keywords",
"text": "$.nodes.contract_generation.output.text",
"keywords": [
"This contract is governed by",
"dispute resolution",
"effective date"
],
"match_all": true
}LLM Judge
Use LLM to evaluate output based on natural language criteria.
Configuration
{
"type": "llm_judge",
"criteria": "Is the response helpful, accurate, and written in a professional tone?",
"input": "$.data.question",
"output": "$.nodes.llm_node.output.text",
"context": "$.nodes.rag_retrieval.output.chunks",
"model": "gpt-4",
"threshold": 0.7,
"scale": "0-1"
}Parameters:
- criteria - Natural language evaluation criteria
- input - Original input/query (optional)
- output - Text to evaluate (required)
- context - Additional context for evaluation (optional)
- model - LLM to use as judge (default: gpt-4-turbo)
- threshold - Minimum score to pass (0.0-1.0)
- scale - Scoring scale (
0-1,1-5,1-10)
Criteria Design
Good criteria are:
- Specific - “Is the tone professional?” not “Is this good?”
- Objective - “Contains all required fields” not “Seems correct”
- Evaluable - Judge can determine yes/no from provided context
Examples:
# Helpfulness
Is the response helpful, directly answering the user's question
without including irrelevant information?
# Accuracy (with context)
Based on the provided context, is the response factually accurate
and free from hallucinations?
# Tone
Is the response written in a professional, respectful tone appropriate
for customer service?
# Completeness
Does the response address all aspects of the multi-part question?
# Conciseness
Is the response concise, providing a complete answer in minimal words
without excessive elaboration?Multi-Criteria Evaluation
Evaluate multiple criteria simultaneously:
{
"type": "llm_judge",
"criteria": {
"helpfulness": "Does the response help the user accomplish their goal?",
"accuracy": "Is the information factually correct?",
"clarity": "Is the response easy to understand?"
},
"aggregation": "average",
"threshold": 0.7
}Returns individual scores and aggregate.
Use Cases
- Subjective quality - Tone, style, helpfulness
- Complex criteria - Multi-dimensional quality assessment
- Rapid prototyping - Define criteria in natural language without coding
- Human-LLM agreement - LLM judges often correlate with human ratings
LLM judges add latency (200-1000ms) and cost ($0.001-0.01 per evaluation). Use for offline evaluation or non-latency-sensitive workflows. For production, consider caching or using simpler evaluators.
Custom Executor
Run arbitrary Python or JavaScript code for evaluation.
Configuration
{
"type": "executor",
"language": "python",
"code": "score = 1.0 if 'ERROR' not in output else 0.0\nreturn {'score': score, 'reason': 'No errors found' if score == 1.0 else 'Errors detected'}",
"threshold": 1.0,
"inputs": {
"output": "$.nodes.llm_node.output.text",
"expected": "$.data.expected_format"
}
}Parameters:
- language -
pythonorjavascript - code - Evaluation code (must return
{score: number}) - threshold - Minimum score to pass (0.0-1.0)
- inputs - JSONPath mappings to variables in code
Python Example
# Domain-specific validation
def evaluate(output, context):
# Check medical terminology
required_terms = ['diagnosis', 'treatment', 'prognosis']
found = sum(1 for term in required_terms if term in output.lower())
# Check citation format
has_citations = bool(re.search(r'\[\d+\]', output))
score = (found / len(required_terms)) * 0.7 + (0.3 if has_citations else 0)
return {
'score': score,
'details': {
'terms_found': found,
'has_citations': has_citations
}
}JavaScript Example
// External API call for toxicity detection
async function evaluate(output) {
const response = await fetch('https://api.example.com/toxicity', {
method: 'POST',
body: JSON.stringify({ text: output }),
headers: { 'Content-Type': 'application/json' }
});
const result = await response.json();
const score = 1.0 - result.toxicity_score;
return {
score: score,
details: result
};
}Use Cases
- Domain-specific logic - Medical, legal, financial validation
- External services - Call toxicity, fact-checking, PII detection APIs
- Complex scoring - Weighted multi-metric aggregation
- Custom formats - Validate proprietary data structures
Combining Evaluators
Use multiple evaluators with weighted aggregation:
{
"evaluators": [
{
"type": "faithfulness",
"threshold": 0.8,
"weight": 0.4
},
{
"type": "relevance",
"threshold": 0.7,
"weight": 0.3
},
{
"type": "json_schema",
"schema": {...},
"weight": 0.3
}
],
"aggregation": "weighted_average",
"pass_threshold": 0.75
}Aggregation methods:
- weighted_average -
sum(score_i * weight_i) / sum(weight_i) - min -
min(score_1, score_2, ...) - max -
max(score_1, score_2, ...) - all - Pass only if all evaluators pass
- any - Pass if any evaluator passes
Example:
- Faithfulness: 0.9 * 0.4 = 0.36
- Relevance: 0.8 * 0.3 = 0.24
- JSON Schema: 1.0 * 0.3 = 0.30
- Total: 0.90 (passes threshold 0.75)
Performance Considerations
Latency
| Evaluator | Typical Latency | Notes |
|---|---|---|
| Regex Match | < 1ms | Very fast, local |
| JSON Schema | 1-5ms | Fast, local validation |
| Length Check | < 1ms | Very fast |
| Contains Keywords | 1-10ms | Fast, depends on text size |
| Faithfulness | 500-2000ms | LLM call required |
| Relevance | 500-2000ms | LLM call required |
| LLM Judge | 200-1000ms | LLM call, varies by criteria complexity |
| Custom Executor | Varies | Depends on code complexity |
Optimization strategies:
- Use fast evaluators (regex, schema) for initial filtering
- Reserve slow evaluators (LLM-based) for final validation
- Cache evaluation results for identical inputs
- Run evaluators in parallel when possible
Cost
LLM-based evaluators incur API costs:
| Evaluator | Cost per Evaluation | Notes |
|---|---|---|
| Faithfulness | $0.001-0.005 | Depends on context length |
| Relevance | $0.0005-0.002 | Small prompt, fast |
| LLM Judge | $0.001-0.01 | Varies by criteria complexity |
Cost control:
- Use LLM evaluators selectively (not on every run)
- Choose cheaper models for judges (gpt-3.5-turbo vs gpt-4)
- Sample evaluations (evaluate 10% of production traffic)
- Cache results for common inputs
Best Practices
Evaluator Selection
- Start simple - Use regex/schema before LLM judges
- Reference-free preferred - Ground truth is expensive to maintain
- Multiple dimensions - Combine evaluators for comprehensive quality
- Domain-specific - Generic metrics may miss important qualities
Threshold Tuning
- Collect validation dataset with human labels
- Run evaluator on dataset
- Plot precision-recall curve vs threshold
- Choose threshold balancing false positives and false negatives
- Monitor production metrics and adjust
Example: Faithfulness threshold
| Threshold | Precision | Recall | F1 |
|---|---|---|---|
| 0.9 | 0.95 | 0.72 | 0.82 |
| 0.8 | 0.89 | 0.85 | 0.87 |
| 0.7 | 0.81 | 0.93 | 0.86 |
Choose 0.8 for best F1 score.
Evaluation Strategy
Offline (pre-deployment):
- Test on diverse dataset (edge cases, failure modes)
- Compare multiple configurations
- Iterate until quality targets met
Online (production):
- Sample evaluations (reduce cost)
- Log all results for analysis
- Alert on quality degradation
- Review failures regularly
Next Steps
- Add guardrail nodes to workflows
- Monitor evaluation metrics in the dashboard
- Compare prompts with evaluators in Prompts & Testing
- Build custom evaluators with executor code