Evaluators

Built-in and custom evaluation metrics for measuring AI output quality.

Evaluator Types

M3 Forge provides multiple evaluation strategies, each optimized for different quality dimensions.

Faithfulness

Measures whether generated text is grounded in source documents (no hallucination).

Configuration


{
  "type": "faithfulness",
  "context": "$.nodes.rag_retrieval.output.chunks",
  "answer": "$.nodes.llm_node.output.text",
  "threshold": 0.8,
  "model": "gpt-4-turbo"
}

Parameters:

context - Source documents (array of text chunks)
answer - Generated text to evaluate
threshold - Minimum score to pass (0.0-1.0)
model - LLM to use for evaluation (default: gpt-3.5-turbo)

How It Works

Extract claims from generated answer
For each claim, check if supported by context
Score = (supported claims) / (total claims)

Example:

Context:


The Eiffel Tower was completed in 1889 and is 330 meters tall.

Answer:


The Eiffel Tower, completed in 1889, stands at 330 meters tall
and is located in Paris, France.

Claims:

“completed in 1889” → Supported ✓
“330 meters tall” → Supported ✓
“located in Paris, France” → Not in context ✗

Score: 2/3 = 0.67 (below threshold 0.8, fails)

Use Cases

RAG pipelines - Ensure LLM doesn’t hallucinate beyond retrieved context
Summarization - Verify summary is faithful to source document
Question answering - Check answer is grounded in provided passages

Faithfulness evaluates groundedness, not factual correctness. An answer can be faithful to incorrect context. Combine with relevance and fact-checking for complete QA.

Relevance

Measures whether output addresses the input query or task.

Configuration


{
  "type": "relevance",
  "query": "$.data.question",
  "answer": "$.nodes.llm_node.output.text",
  "threshold": 0.7,
  "model": "gpt-4-turbo"
}

Parameters:

query - User question or task description
answer - Generated output to evaluate
threshold - Minimum score to pass (0.0-1.0)
model - LLM for evaluation (default: gpt-3.5-turbo)

How It Works

LLM judges whether answer is relevant to query on scale of 0-1:

1.0 - Completely relevant, directly answers query
0.7 - Mostly relevant, some tangential content
0.5 - Partially relevant, misses key aspects
0.0 - Irrelevant, doesn’t address query

Example:

Query:


How do I configure SSL certificates in nginx?

Answer:


To configure SSL in nginx, add the following to your server block:
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;

Score: 1.0 (directly answers the question)

Answer (poor):


Nginx is a popular web server known for high performance...

Score: 0.2 (mentions nginx but doesn’t answer the question)

Use Cases

Question answering - Ensure response addresses the question
Search results - Verify retrieved documents match query intent
Conversational AI - Check response is on-topic

JSON Schema

Validates output conforms to expected JSON structure.

Configuration


{
  "type": "json_schema",
  "schema": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "age": {"type": "number", "minimum": 0},
      "email": {"type": "string", "format": "email"}
    },
    "required": ["name", "age"]
  },
  "output": "$.nodes.extraction_node.output.data",
  "threshold": 1.0
}

Parameters:

schema - JSON Schema definition (v7)
output - JSON object to validate
threshold - Must be 1.0 (binary pass/fail)

Validation Rules

Types


{
  "properties": {
    "string_field": {"type": "string"},
    "number_field": {"type": "number"},
    "integer_field": {"type": "integer"},
    "boolean_field": {"type": "boolean"},
    "array_field": {"type": "array"},
    "object_field": {"type": "object"},
    "null_field": {"type": "null"}
  }
}

Constraints


{
  "properties": {
    "age": {
      "type": "number",
      "minimum": 0,
      "maximum": 120
    },
    "name": {
      "type": "string",
      "minLength": 1,
      "maxLength": 100
    },
    "tags": {
      "type": "array",
      "minItems": 1,
      "maxItems": 10,
      "uniqueItems": true
    }
  }
}

Formats


{
  "properties": {
    "email": {
      "type": "string",
      "format": "email"
    },
    "url": {
      "type": "string",
      "format": "uri"
    },
    "date": {
      "type": "string",
      "format": "date"
    },
    "datetime": {
      "type": "string",
      "format": "date-time"
    }
  }
}

Use Cases

Structured extraction - Validate LLM extracted correct fields
API responses - Ensure output matches downstream API schema
Database inserts - Verify data conforms before insertion

Example: Invoice extraction


{
  "schema": {
    "type": "object",
    "properties": {
      "invoice_number": {"type": "string", "pattern": "^INV-[0-9]{6}$"},
      "date": {"type": "string", "format": "date"},
      "total": {"type": "number", "minimum": 0},
      "line_items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "description": {"type": "string"},
            "quantity": {"type": "integer", "minimum": 1},
            "price": {"type": "number", "minimum": 0}
          },
          "required": ["description", "quantity", "price"]
        }
      }
    },
    "required": ["invoice_number", "date", "total", "line_items"]
  }
}

Regex Match

Validates output matches a regular expression pattern.

Configuration


{
  "type": "regex_match",
  "pattern": "^[A-Z]{2}[0-9]{6}$",
  "text": "$.nodes.extraction_node.output.document_id",
  "threshold": 1.0
}

Parameters:

pattern - Regular expression (Python/JavaScript syntax)
text - String to validate
threshold - Must be 1.0 (binary pass/fail)
flags - Optional regex flags (i=case-insensitive, m=multiline)

Common Patterns

Identifiers


# US Social Security Number
^\d{3}-\d{2}-\d{4}$
 
# UUID
^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$
 
# Product SKU
^[A-Z]{3}-[0-9]{6}$
 
# License plate (US)
^[A-Z0-9]{2,7}$

Dates


# ISO 8601 date
^\d{4}-\d{2}-\d{2}$
 
# US date (MM/DD/YYYY)
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$
 
# Time (24-hour)
^([01]\d|2[0-3]):[0-5]\d$
 
# Datetime with timezone
^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}[+-]\d{2}:\d{2}$

Contact


# Email
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
 
# US phone (10 digits)
^\d{3}-\d{3}-\d{4}$
 
# International phone
^\+[1-9]\d{1,14}$
 
# URL
^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/.*)?$

Financial


# Credit card (simple)
^\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}$
 
# US dollar amount
^\$?[0-9]{1,3}(,[0-9]{3})*(\.[0-9]{2})?$
 
# Account number
^[0-9]{8,12}$
 
# Routing number (US)
^\d{9}$

Use Cases

Format validation - Ensure extracted values match expected format
Input sanitization - Verify user input is safe
Data quality - Check data conforms to standards

Length Check

Validates text length is within bounds.

Configuration


{
  "type": "length_check",
  "text": "$.nodes.summarization_node.output.summary",
  "min_length": 50,
  "max_length": 200,
  "unit": "characters",
  "threshold": 1.0
}

Parameters:

text - String to measure
min_length - Minimum length (inclusive)
max_length - Maximum length (inclusive)
unit - characters, words, or tokens
threshold - Must be 1.0 (binary pass/fail)

Units

Unit	Count Method	Use Case
characters	UTF-8 characters	Tweet length, SMS, UI constraints
words	Space-separated tokens	Summaries, abstracts
tokens	LLM tokenizer (GPT-4)	LLM context limits

Example:

Text: "The quick brown fox jumps over the lazy dog."

Characters: 44
Words: 9
Tokens: 10 (depends on tokenizer)

Use Cases

Summarization - Ensure summary is concise (50-200 words)
Tweet generation - Limit to 280 characters
Input validation - Reject overly long user input
Token budgets - Stay within LLM context window

Contains Keywords

Checks if text contains required keywords or phrases.

Configuration


{
  "type": "contains_keywords",
  "text": "$.nodes.llm_node.output.text",
  "keywords": ["certificate", "SSL", "nginx"],
  "match_all": true,
  "case_sensitive": false,
  "threshold": 1.0
}

Parameters:

text - Text to search
keywords - Array of required keywords/phrases
match_all - If true, all keywords required. If false, any keyword sufficient
case_sensitive - Case-sensitive matching (default: false)
threshold - Must be 1.0 (binary pass/fail)

Use Cases

Content requirements - Ensure response covers required topics
SEO - Verify target keywords are present
Compliance - Check legal disclaimers are included
Quality control - Ensure instructions followed

Example: Legal disclaimer


{
  "type": "contains_keywords",
  "text": "$.nodes.contract_generation.output.text",
  "keywords": [
    "This contract is governed by",
    "dispute resolution",
    "effective date"
  ],
  "match_all": true
}

LLM Judge

Use LLM to evaluate output based on natural language criteria.

Configuration


{
  "type": "llm_judge",
  "criteria": "Is the response helpful, accurate, and written in a professional tone?",
  "input": "$.data.question",
  "output": "$.nodes.llm_node.output.text",
  "context": "$.nodes.rag_retrieval.output.chunks",
  "model": "gpt-4",
  "threshold": 0.7,
  "scale": "0-1"
}

Parameters:

criteria - Natural language evaluation criteria
input - Original input/query (optional)
output - Text to evaluate (required)
context - Additional context for evaluation (optional)
model - LLM to use as judge (default: gpt-4-turbo)
threshold - Minimum score to pass (0.0-1.0)
scale - Scoring scale (0-1, 1-5, 1-10)

Criteria Design

Good criteria are:

Specific - “Is the tone professional?” not “Is this good?”
Objective - “Contains all required fields” not “Seems correct”
Evaluable - Judge can determine yes/no from provided context

Examples:


# Helpfulness
Is the response helpful, directly answering the user's question
without including irrelevant information?

# Accuracy (with context)
Based on the provided context, is the response factually accurate
and free from hallucinations?

# Tone
Is the response written in a professional, respectful tone appropriate
for customer service?

# Completeness
Does the response address all aspects of the multi-part question?

# Conciseness
Is the response concise, providing a complete answer in minimal words
without excessive elaboration?

Multi-Criteria Evaluation

Evaluate multiple criteria simultaneously:


{
  "type": "llm_judge",
  "criteria": {
    "helpfulness": "Does the response help the user accomplish their goal?",
    "accuracy": "Is the information factually correct?",
    "clarity": "Is the response easy to understand?"
  },
  "aggregation": "average",
  "threshold": 0.7
}

Returns individual scores and aggregate.

Use Cases

Subjective quality - Tone, style, helpfulness
Complex criteria - Multi-dimensional quality assessment
Rapid prototyping - Define criteria in natural language without coding
Human-LLM agreement - LLM judges often correlate with human ratings

LLM judges add latency (200-1000ms) and cost ($0.001-0.01 per evaluation). Use for offline evaluation or non-latency-sensitive workflows. For production, consider caching or using simpler evaluators.

Custom Executor

Run arbitrary Python or JavaScript code for evaluation.

Configuration


{
  "type": "executor",
  "language": "python",
  "code": "score = 1.0 if 'ERROR' not in output else 0.0\nreturn {'score': score, 'reason': 'No errors found' if score == 1.0 else 'Errors detected'}",
  "threshold": 1.0,
  "inputs": {
    "output": "$.nodes.llm_node.output.text",
    "expected": "$.data.expected_format"
  }
}

Parameters:

language - python or javascript
code - Evaluation code (must return {score: number})
threshold - Minimum score to pass (0.0-1.0)
inputs - JSONPath mappings to variables in code

Python Example


# Domain-specific validation
def evaluate(output, context):
    # Check medical terminology
    required_terms = ['diagnosis', 'treatment', 'prognosis']
    found = sum(1 for term in required_terms if term in output.lower())
 
    # Check citation format
    has_citations = bool(re.search(r'\[\d+\]', output))
 
    score = (found / len(required_terms)) * 0.7 + (0.3 if has_citations else 0)
 
    return {
        'score': score,
        'details': {
            'terms_found': found,
            'has_citations': has_citations
        }
    }

JavaScript Example


// External API call for toxicity detection
async function evaluate(output) {
  const response = await fetch('https://api.example.com/toxicity', {
    method: 'POST',
    body: JSON.stringify({ text: output }),
    headers: { 'Content-Type': 'application/json' }
  });
 
  const result = await response.json();
  const score = 1.0 - result.toxicity_score;
 
  return {
    score: score,
    details: result
  };
}

Use Cases

Domain-specific logic - Medical, legal, financial validation
External services - Call toxicity, fact-checking, PII detection APIs
Complex scoring - Weighted multi-metric aggregation
Custom formats - Validate proprietary data structures

Combining Evaluators

Use multiple evaluators with weighted aggregation:


{
  "evaluators": [
    {
      "type": "faithfulness",
      "threshold": 0.8,
      "weight": 0.4
    },
    {
      "type": "relevance",
      "threshold": 0.7,
      "weight": 0.3
    },
    {
      "type": "json_schema",
      "schema": {...},
      "weight": 0.3
    }
  ],
  "aggregation": "weighted_average",
  "pass_threshold": 0.75
}

Aggregation methods:

weighted_average - sum(score_i * weight_i) / sum(weight_i)
min - min(score_1, score_2, ...)
max - max(score_1, score_2, ...)
all - Pass only if all evaluators pass
any - Pass if any evaluator passes

Example:

Faithfulness: 0.9 * 0.4 = 0.36
Relevance: 0.8 * 0.3 = 0.24
JSON Schema: 1.0 * 0.3 = 0.30
Total: 0.90 (passes threshold 0.75)

Performance Considerations

Latency

Evaluator	Typical Latency	Notes
Regex Match	< 1ms	Very fast, local
JSON Schema	1-5ms	Fast, local validation
Length Check	< 1ms	Very fast
Contains Keywords	1-10ms	Fast, depends on text size
Faithfulness	500-2000ms	LLM call required
Relevance	500-2000ms	LLM call required
LLM Judge	200-1000ms	LLM call, varies by criteria complexity
Custom Executor	Varies	Depends on code complexity

Optimization strategies:

Use fast evaluators (regex, schema) for initial filtering
Reserve slow evaluators (LLM-based) for final validation
Cache evaluation results for identical inputs
Run evaluators in parallel when possible

Cost

LLM-based evaluators incur API costs:

Evaluator	Cost per Evaluation	Notes
Faithfulness	$0.001-0.005	Depends on context length
Relevance	$0.0005-0.002	Small prompt, fast
LLM Judge	$0.001-0.01	Varies by criteria complexity

Cost control:

Use LLM evaluators selectively (not on every run)
Choose cheaper models for judges (gpt-3.5-turbo vs gpt-4)
Sample evaluations (evaluate 10% of production traffic)
Cache results for common inputs

Best Practices

Evaluator Selection

Start simple - Use regex/schema before LLM judges
Reference-free preferred - Ground truth is expensive to maintain
Multiple dimensions - Combine evaluators for comprehensive quality
Domain-specific - Generic metrics may miss important qualities

Threshold Tuning

Collect validation dataset with human labels
Run evaluator on dataset
Plot precision-recall curve vs threshold
Choose threshold balancing false positives and false negatives
Monitor production metrics and adjust

Example: Faithfulness threshold

Threshold	Precision	Recall	F1
0.9	0.95	0.72	0.82
0.8	0.89	0.85	0.87
0.7	0.81	0.93	0.86

Choose 0.8 for best F1 score.

Evaluation Strategy

Offline (pre-deployment):

Test on diverse dataset (edge cases, failure modes)
Compare multiple configurations
Iterate until quality targets met

Online (production):

Sample evaluations (reduce cost)
Log all results for analysis
Alert on quality degradation
Review failures regularly

Next Steps

Add guardrail nodes to workflows
Monitor evaluation metrics in the dashboard
Compare prompts with evaluators in Prompts & Testing
Build custom evaluators with executor code