Skip to Content
EvaluationsEvaluators

Evaluators

Built-in and custom evaluation metrics for measuring AI output quality.

Evaluator Types

M3 Forge provides multiple evaluation strategies, each optimized for different quality dimensions.

Faithfulness

Measures whether generated text is grounded in source documents (no hallucination).

Configuration

{ "type": "faithfulness", "context": "$.nodes.rag_retrieval.output.chunks", "answer": "$.nodes.llm_node.output.text", "threshold": 0.8, "model": "gpt-4-turbo" }

Parameters:

  • context - Source documents (array of text chunks)
  • answer - Generated text to evaluate
  • threshold - Minimum score to pass (0.0-1.0)
  • model - LLM to use for evaluation (default: gpt-3.5-turbo)

How It Works

  1. Extract claims from generated answer
  2. For each claim, check if supported by context
  3. Score = (supported claims) / (total claims)

Example:

Context:

The Eiffel Tower was completed in 1889 and is 330 meters tall.

Answer:

The Eiffel Tower, completed in 1889, stands at 330 meters tall and is located in Paris, France.

Claims:

  • “completed in 1889” → Supported âś“
  • “330 meters tall” → Supported âś“
  • “located in Paris, France” → Not in context âś—

Score: 2/3 = 0.67 (below threshold 0.8, fails)

Use Cases

  • RAG pipelines - Ensure LLM doesn’t hallucinate beyond retrieved context
  • Summarization - Verify summary is faithful to source document
  • Question answering - Check answer is grounded in provided passages

Faithfulness evaluates groundedness, not factual correctness. An answer can be faithful to incorrect context. Combine with relevance and fact-checking for complete QA.

Relevance

Measures whether output addresses the input query or task.

Configuration

{ "type": "relevance", "query": "$.data.question", "answer": "$.nodes.llm_node.output.text", "threshold": 0.7, "model": "gpt-4-turbo" }

Parameters:

  • query - User question or task description
  • answer - Generated output to evaluate
  • threshold - Minimum score to pass (0.0-1.0)
  • model - LLM for evaluation (default: gpt-3.5-turbo)

How It Works

LLM judges whether answer is relevant to query on scale of 0-1:

  • 1.0 - Completely relevant, directly answers query
  • 0.7 - Mostly relevant, some tangential content
  • 0.5 - Partially relevant, misses key aspects
  • 0.0 - Irrelevant, doesn’t address query

Example:

Query:

How do I configure SSL certificates in nginx?

Answer:

To configure SSL in nginx, add the following to your server block: ssl_certificate /path/to/cert.pem; ssl_certificate_key /path/to/key.pem;

Score: 1.0 (directly answers the question)

Answer (poor):

Nginx is a popular web server known for high performance...

Score: 0.2 (mentions nginx but doesn’t answer the question)

Use Cases

  • Question answering - Ensure response addresses the question
  • Search results - Verify retrieved documents match query intent
  • Conversational AI - Check response is on-topic

JSON Schema

Validates output conforms to expected JSON structure.

Configuration

{ "type": "json_schema", "schema": { "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "number", "minimum": 0}, "email": {"type": "string", "format": "email"} }, "required": ["name", "age"] }, "output": "$.nodes.extraction_node.output.data", "threshold": 1.0 }

Parameters:

  • schema - JSON Schema definition (v7)
  • output - JSON object to validate
  • threshold - Must be 1.0 (binary pass/fail)

Validation Rules

{ "properties": { "string_field": {"type": "string"}, "number_field": {"type": "number"}, "integer_field": {"type": "integer"}, "boolean_field": {"type": "boolean"}, "array_field": {"type": "array"}, "object_field": {"type": "object"}, "null_field": {"type": "null"} } }

Use Cases

  • Structured extraction - Validate LLM extracted correct fields
  • API responses - Ensure output matches downstream API schema
  • Database inserts - Verify data conforms before insertion

Example: Invoice extraction

{ "schema": { "type": "object", "properties": { "invoice_number": {"type": "string", "pattern": "^INV-[0-9]{6}$"}, "date": {"type": "string", "format": "date"}, "total": {"type": "number", "minimum": 0}, "line_items": { "type": "array", "items": { "type": "object", "properties": { "description": {"type": "string"}, "quantity": {"type": "integer", "minimum": 1}, "price": {"type": "number", "minimum": 0} }, "required": ["description", "quantity", "price"] } } }, "required": ["invoice_number", "date", "total", "line_items"] } }

Regex Match

Validates output matches a regular expression pattern.

Configuration

{ "type": "regex_match", "pattern": "^[A-Z]{2}[0-9]{6}$", "text": "$.nodes.extraction_node.output.document_id", "threshold": 1.0 }

Parameters:

  • pattern - Regular expression (Python/JavaScript syntax)
  • text - String to validate
  • threshold - Must be 1.0 (binary pass/fail)
  • flags - Optional regex flags (i=case-insensitive, m=multiline)

Common Patterns

# US Social Security Number ^\d{3}-\d{2}-\d{4}$ # UUID ^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$ # Product SKU ^[A-Z]{3}-[0-9]{6}$ # License plate (US) ^[A-Z0-9]{2,7}$

Use Cases

  • Format validation - Ensure extracted values match expected format
  • Input sanitization - Verify user input is safe
  • Data quality - Check data conforms to standards

Length Check

Validates text length is within bounds.

Configuration

{ "type": "length_check", "text": "$.nodes.summarization_node.output.summary", "min_length": 50, "max_length": 200, "unit": "characters", "threshold": 1.0 }

Parameters:

  • text - String to measure
  • min_length - Minimum length (inclusive)
  • max_length - Maximum length (inclusive)
  • unit - characters, words, or tokens
  • threshold - Must be 1.0 (binary pass/fail)

Units

UnitCount MethodUse Case
charactersUTF-8 charactersTweet length, SMS, UI constraints
wordsSpace-separated tokensSummaries, abstracts
tokensLLM tokenizer (GPT-4)LLM context limits

Example:

Text: "The quick brown fox jumps over the lazy dog."

  • Characters: 44
  • Words: 9
  • Tokens: 10 (depends on tokenizer)

Use Cases

  • Summarization - Ensure summary is concise (50-200 words)
  • Tweet generation - Limit to 280 characters
  • Input validation - Reject overly long user input
  • Token budgets - Stay within LLM context window

Contains Keywords

Checks if text contains required keywords or phrases.

Configuration

{ "type": "contains_keywords", "text": "$.nodes.llm_node.output.text", "keywords": ["certificate", "SSL", "nginx"], "match_all": true, "case_sensitive": false, "threshold": 1.0 }

Parameters:

  • text - Text to search
  • keywords - Array of required keywords/phrases
  • match_all - If true, all keywords required. If false, any keyword sufficient
  • case_sensitive - Case-sensitive matching (default: false)
  • threshold - Must be 1.0 (binary pass/fail)

Use Cases

  • Content requirements - Ensure response covers required topics
  • SEO - Verify target keywords are present
  • Compliance - Check legal disclaimers are included
  • Quality control - Ensure instructions followed

Example: Legal disclaimer

{ "type": "contains_keywords", "text": "$.nodes.contract_generation.output.text", "keywords": [ "This contract is governed by", "dispute resolution", "effective date" ], "match_all": true }

LLM Judge

Use LLM to evaluate output based on natural language criteria.

Configuration

{ "type": "llm_judge", "criteria": "Is the response helpful, accurate, and written in a professional tone?", "input": "$.data.question", "output": "$.nodes.llm_node.output.text", "context": "$.nodes.rag_retrieval.output.chunks", "model": "gpt-4", "threshold": 0.7, "scale": "0-1" }

Parameters:

  • criteria - Natural language evaluation criteria
  • input - Original input/query (optional)
  • output - Text to evaluate (required)
  • context - Additional context for evaluation (optional)
  • model - LLM to use as judge (default: gpt-4-turbo)
  • threshold - Minimum score to pass (0.0-1.0)
  • scale - Scoring scale (0-1, 1-5, 1-10)

Criteria Design

Good criteria are:

  • Specific - “Is the tone professional?” not “Is this good?”
  • Objective - “Contains all required fields” not “Seems correct”
  • Evaluable - Judge can determine yes/no from provided context

Examples:

# Helpfulness Is the response helpful, directly answering the user's question without including irrelevant information? # Accuracy (with context) Based on the provided context, is the response factually accurate and free from hallucinations? # Tone Is the response written in a professional, respectful tone appropriate for customer service? # Completeness Does the response address all aspects of the multi-part question? # Conciseness Is the response concise, providing a complete answer in minimal words without excessive elaboration?

Multi-Criteria Evaluation

Evaluate multiple criteria simultaneously:

{ "type": "llm_judge", "criteria": { "helpfulness": "Does the response help the user accomplish their goal?", "accuracy": "Is the information factually correct?", "clarity": "Is the response easy to understand?" }, "aggregation": "average", "threshold": 0.7 }

Returns individual scores and aggregate.

Use Cases

  • Subjective quality - Tone, style, helpfulness
  • Complex criteria - Multi-dimensional quality assessment
  • Rapid prototyping - Define criteria in natural language without coding
  • Human-LLM agreement - LLM judges often correlate with human ratings

LLM judges add latency (200-1000ms) and cost ($0.001-0.01 per evaluation). Use for offline evaluation or non-latency-sensitive workflows. For production, consider caching or using simpler evaluators.

Custom Executor

Run arbitrary Python or JavaScript code for evaluation.

Configuration

{ "type": "executor", "language": "python", "code": "score = 1.0 if 'ERROR' not in output else 0.0\nreturn {'score': score, 'reason': 'No errors found' if score == 1.0 else 'Errors detected'}", "threshold": 1.0, "inputs": { "output": "$.nodes.llm_node.output.text", "expected": "$.data.expected_format" } }

Parameters:

  • language - python or javascript
  • code - Evaluation code (must return {score: number})
  • threshold - Minimum score to pass (0.0-1.0)
  • inputs - JSONPath mappings to variables in code

Python Example

# Domain-specific validation def evaluate(output, context): # Check medical terminology required_terms = ['diagnosis', 'treatment', 'prognosis'] found = sum(1 for term in required_terms if term in output.lower()) # Check citation format has_citations = bool(re.search(r'\[\d+\]', output)) score = (found / len(required_terms)) * 0.7 + (0.3 if has_citations else 0) return { 'score': score, 'details': { 'terms_found': found, 'has_citations': has_citations } }

JavaScript Example

// External API call for toxicity detection async function evaluate(output) { const response = await fetch('https://api.example.com/toxicity', { method: 'POST', body: JSON.stringify({ text: output }), headers: { 'Content-Type': 'application/json' } }); const result = await response.json(); const score = 1.0 - result.toxicity_score; return { score: score, details: result }; }

Use Cases

  • Domain-specific logic - Medical, legal, financial validation
  • External services - Call toxicity, fact-checking, PII detection APIs
  • Complex scoring - Weighted multi-metric aggregation
  • Custom formats - Validate proprietary data structures

Combining Evaluators

Use multiple evaluators with weighted aggregation:

{ "evaluators": [ { "type": "faithfulness", "threshold": 0.8, "weight": 0.4 }, { "type": "relevance", "threshold": 0.7, "weight": 0.3 }, { "type": "json_schema", "schema": {...}, "weight": 0.3 } ], "aggregation": "weighted_average", "pass_threshold": 0.75 }

Aggregation methods:

  • weighted_average - sum(score_i * weight_i) / sum(weight_i)
  • min - min(score_1, score_2, ...)
  • max - max(score_1, score_2, ...)
  • all - Pass only if all evaluators pass
  • any - Pass if any evaluator passes

Example:

  • Faithfulness: 0.9 * 0.4 = 0.36
  • Relevance: 0.8 * 0.3 = 0.24
  • JSON Schema: 1.0 * 0.3 = 0.30
  • Total: 0.90 (passes threshold 0.75)

Performance Considerations

Latency

EvaluatorTypical LatencyNotes
Regex Match< 1msVery fast, local
JSON Schema1-5msFast, local validation
Length Check< 1msVery fast
Contains Keywords1-10msFast, depends on text size
Faithfulness500-2000msLLM call required
Relevance500-2000msLLM call required
LLM Judge200-1000msLLM call, varies by criteria complexity
Custom ExecutorVariesDepends on code complexity

Optimization strategies:

  • Use fast evaluators (regex, schema) for initial filtering
  • Reserve slow evaluators (LLM-based) for final validation
  • Cache evaluation results for identical inputs
  • Run evaluators in parallel when possible

Cost

LLM-based evaluators incur API costs:

EvaluatorCost per EvaluationNotes
Faithfulness$0.001-0.005Depends on context length
Relevance$0.0005-0.002Small prompt, fast
LLM Judge$0.001-0.01Varies by criteria complexity

Cost control:

  • Use LLM evaluators selectively (not on every run)
  • Choose cheaper models for judges (gpt-3.5-turbo vs gpt-4)
  • Sample evaluations (evaluate 10% of production traffic)
  • Cache results for common inputs

Best Practices

Evaluator Selection

  • Start simple - Use regex/schema before LLM judges
  • Reference-free preferred - Ground truth is expensive to maintain
  • Multiple dimensions - Combine evaluators for comprehensive quality
  • Domain-specific - Generic metrics may miss important qualities

Threshold Tuning

  1. Collect validation dataset with human labels
  2. Run evaluator on dataset
  3. Plot precision-recall curve vs threshold
  4. Choose threshold balancing false positives and false negatives
  5. Monitor production metrics and adjust

Example: Faithfulness threshold

ThresholdPrecisionRecallF1
0.90.950.720.82
0.80.890.850.87
0.70.810.930.86

Choose 0.8 for best F1 score.

Evaluation Strategy

Offline (pre-deployment):

  • Test on diverse dataset (edge cases, failure modes)
  • Compare multiple configurations
  • Iterate until quality targets met

Online (production):

  • Sample evaluations (reduce cost)
  • Log all results for analysis
  • Alert on quality degradation
  • Review failures regularly

Next Steps

Last updated on