Monitoring

Track AI workload performance, cost, and system health with real-time observability dashboards.

What is Monitoring?

M3 Forge provides production-grade observability for AI workflows, LLM invocations, and system performance. Monitoring enables you to:

All monitoring data is stored in ClickHouse for high-performance analytics on large-scale event streams.

M3 Forge follows the three pillars of observability:

Metrics - Aggregated measurements (token counts, costs, latency percentiles, failure rates)
Logs - Detailed event streams from workflow executions and LLM calls
Traces - End-to-end request flows with timing and context for each step

These combine to give you full visibility into production AI operations without manual instrumentation.

Every LLM invocation is automatically instrumented with:

Workflow runs capture:

Infrastructure metrics include:

API server health - Request throughput, error rates, response times
Database performance - Query latency, connection pool usage
Gateway status - Connectivity to Marie-AI backend instances
Job queue depth - Pending workflow executions, backlog trends
LLM dispatch state - Pending and in-flight executor LLM requests, dispatcher health, producer drops, and completed execution history

M3 Forge provides specialized dashboards for different observability needs:

Monitoring data is retained according to these policies:

Data Type	Retention	Storage
LLM traces	90 days	ClickHouse (hot), S3 (cold archive)
Workflow logs	30 days	PostgreSQL
Aggregated metrics	1 year	ClickHouse materialized views
SLA violation events	2 years	PostgreSQL

Retention policies are configurable via environment variables. See Configuration for details.

M3 Forge automatically masks personally identifiable information in logs and traces:

Monitoring data access is role-based:

All monitoring data access is logged to the audit trail, including:

Use LLM Observability to reduce AI infrastructure costs:

Identify expensive workflows - Sort by total cost to find optimization opportunities
Compare models - Analyze cost vs. quality tradeoffs between GPT-4, Claude, and smaller models
Detect token waste - Find prompts with excessive input tokens or unnecessary context
Track prompt engineering - Measure cost impact of system prompt changes
Set budgets - Configure alerts when daily/monthly spending exceeds thresholds

Use Traces and Insights to diagnose slow workflows: