Prompts & Testing
M3 Forge provides a comprehensive prompt management system with version control, interactive testing, and A/B experimentation. Manage all your LLM prompts in a centralized repository with Git integration, test them in real-time against multiple models, and optimize performance through systematic experiments.
Why Prompt Management Matters
LLM prompts are critical business logic. As your AI applications scale, you need:
- Version control for prompt changes
- Centralized repository for team collaboration
- Safe testing before production deployment
- Data-driven optimization through A/B testing
- Comparison tools to evaluate prompt variants
M3 Forge treats prompts as first-class code artifacts with full lifecycle management.

Key Features
Repository-Based Management
Store prompts in Git repositories with full version control. Every change is tracked, branches enable parallel development, and rollbacks are instant. The built-in file browser provides VS Code-like editing with Monaco editor integration.
Supported workflows:
- Create and edit prompts with syntax highlighting
- Branch management for isolated development
- Commit history with detailed diffs
- Pull changes from workspace files
- Export prompts to deployment environments
Prompts sync bidirectionally. Edit in the UI or commit from your IDE — both workflows stay synchronized through Git.
Multi-Provider Testing
Test prompts against multiple LLM providers simultaneously:
- OpenAI (GPT-4, GPT-3.5)
- Anthropic (Claude 3 Opus, Sonnet, Haiku)
- Qwen (Qwen2.5)
- Hugging Face models
- Custom endpoints
Variable substitution enables reusable templates. Define variables like {customer_name} or {product_id} and fill them at test time.
Statistical Experimentation
Run controlled A/B tests with traffic splitting, metric tracking, and statistical significance analysis. Compare multiple prompt variants under identical conditions to identify the highest-performing version.
Core Concepts
Prompt Templates
Templates use variable syntax for dynamic content:
You are an expert customer service agent.
Customer: {customer_name}
Issue: {issue_description}
Priority: {priority_level}
Provide a professional response.Variables are auto-detected from curly brace syntax and presented as fillable fields in the Playground.
System vs. Message Prompts
- System prompts define agent behavior and constraints
- Message prompts contain the user input or task description
Both support variables and can be tested independently or together.
Experiments and Variants
Experiments define:
- Variants — Different versions of the same prompt (A, B, C)
- Traffic split — Percentage allocation across variants
- Metrics — Quality scores, latency, cost per request
- Sample size — Number of test runs required for statistical significance
M3 Forge automatically calculates confidence intervals and recommends winning variants.
Quick Start
Common Workflows
Testing a New Prompt
- Navigate to your prompt repository
- Select a prompt file from the file tree
- Click “Test in Playground”
- Choose a model and fill variable values
- Run and review the response
Comparing Branches
- Open a prompt in the editor
- Click “Compare” in the toolbar
- Select a branch to compare against
- View side-by-side diff with highlighted changes
- Copy content from either version
Creating an Experiment
- Navigate to Experiments section
- Click “New A/B Test”
- Select the prompt to test
- Define variants (different versions or parameters)
- Set traffic split and run size
- Launch experiment and monitor results
Integration with Workflows
Prompts tested in M3 Forge can be deployed directly to workflows:
- Select a prompt from the repository
- Reference it in a workflow node by path
- Variables map automatically to workflow inputs
- Changes to the prompt propagate to all workflows using it
This enables prompt engineering as a separate discipline from workflow design.