SLA Monitoring

Monitor service level agreement compliance and threshold violations for critical workflows.

Overview

SLA (Service Level Agreement) monitoring ensures that your AI workflows meet defined performance and reliability commitments. It provides:

Real-time compliance tracking against SLA targets
Automated alerting when thresholds are breached
Historical compliance reports for auditing and accountability
Multi-dimensional SLAs (latency, availability, error rate, throughput)

SLA monitoring bridges operational metrics (Insights) with business commitments, enabling you to proactively manage service quality.

SLA Monitoring dashboard showing compliance gauges, violation alerts, and threshold configuration

What is an SLA?

An SLA defines measurable performance targets for a workflow or service:

SLA Type	Metric	Example Target
Availability	Uptime percentage	99.9% (43 min downtime/month)
Latency	Response time percentile	p95 latency < 5 seconds
Error Rate	Failed executions	Failure rate < 2%
Throughput	Requests per second	Handle 100 runs/minute peak load

M3 Forge monitors these targets in real-time and alerts when violations occur.

SLA Configuration

Navigate to Monitoring → SLA Monitoring and click “Create SLA” to define a new agreement.

Basic Configuration


{
  "name": "Invoice Extraction SLA",
  "workflow_id": "invoice-extraction-v2",
  "enabled": true,
  "description": "Production SLA for invoice processing pipeline"
}

Targets

Define one or more performance targets:

Latency


{
  "type": "latency",
  "metric": "p95_runtime",
  "threshold": 5000,
  "unit": "ms",
  "evaluation_window": "5m"
}

Explanation:

95th percentile runtime must be under 5 seconds
Evaluated over a rolling 5-minute window
Violation if p95 exceeds threshold for entire window

Availability


{
  "type": "availability",
  "metric": "uptime_percentage",
  "threshold": 99.9,
  "unit": "%",
  "evaluation_window": "1h"
}

Explanation:

Workflow must be available 99.9% of the time
Evaluated hourly (max 36 seconds downtime per hour)
Downtime = periods when all executions fail or timeout

Error Rate


{
  "type": "error_rate",
  "metric": "failure_percentage",
  "threshold": 2.0,
  "unit": "%",
  "evaluation_window": "15m"
}

Explanation:

Failure rate must be under 2%
Evaluated over a rolling 15-minute window
Violation if failure rate exceeds 2% for entire window

Throughput


{
  "type": "throughput",
  "metric": "runs_per_minute",
  "threshold": 100,
  "unit": "rpm",
  "evaluation_window": "1m"
}

Explanation:

Workflow must handle at least 100 runs per minute
Evaluated every minute
Violation if throughput drops below threshold

Alerting

Configure how and when to notify on SLA violations:


{
  "alert": {
    "channels": ["slack", "email", "pagerduty"],
    "severity": "high",
    "notification_delay": "5m",
    "escalation_policy": "on-call-rotation"
  }
}

Options:

channels - Where to send alerts (Slack, email, PagerDuty, webhooks)
severity - low, medium, high, critical (affects urgency)
notification_delay - Grace period before alerting (reduces noise from transient issues)
escalation_policy - Who to notify if violation persists (defined in Settings → On-Call)

Time-Based Rules

Define SLA windows that vary by time of day or day of week:


{
  "schedules": [
    {
      "name": "Business hours",
      "cron": "0 8-18 * * 1-5",
      "timezone": "America/New_York",
      "targets": [
        {"type": "latency", "threshold": 3000}
      ]
    },
    {
      "name": "Off-hours",
      "cron": "0 0-7,19-23 * * *",
      "timezone": "America/New_York",
      "targets": [
        {"type": "latency", "threshold": 10000}
      ]
    }
  ]
}

Use case: Stricter latency requirements during business hours when users are actively waiting for results.

SLA Dashboard

The SLA Monitoring dashboard displays real-time compliance status:

Summary Cards

Active SLAs - Count of enabled SLAs
Violations (Last 24h) - Number of threshold breaches
Compliance Score - Percentage of time SLAs were met
MTTD (Mean Time to Detect) - Average delay before violation detection

SLA Table

SLA	Workflow	Status	Current Value	Target	Last Violation
Invoice latency	invoice-extraction-v2	Healthy	3.2s	< 5s	2 days ago
Customer onboarding error rate	customer-onboarding	Warning	1.8%	< 2%	3 hours ago
Document classification availability	doc-classify	Violated	98.5%	99.9%	12 min ago

Status indicators:

Healthy (green) - Currently meeting SLA targets
Warning (yellow) - Approaching threshold (within 10%)
Violated (red) - Currently breaching SLA

Click any row to see detailed compliance history and violation timeline.

Compliance Chart

Time-series visualization showing:

Target threshold - Horizontal line showing SLA limit
Actual metric - Line chart of measured performance
Violation periods - Red shaded regions where SLA was breached
Warning zone - Yellow shaded region near threshold

Zoom into specific time ranges to analyze violation patterns.

Violation Detection

M3 Forge uses a time-window-based detection algorithm:

Collect metric samples

Every 30 seconds, the monitoring system queries the database for workflow execution metrics.

Calculate aggregate

Compute the metric value over the evaluation window (e.g., p95 latency for last 5 minutes).

Compare to threshold

If aggregate value exceeds (or falls below for throughput) the SLA target, mark as potential violation.

Apply grace period

If the violation persists for the configured notification_delay, trigger an alert.

Send notifications

Dispatch alerts to all configured channels (Slack, email, PagerDuty).

Track violation

Record violation event in database with start time, end time, duration, and max deviation from target.

Violation Events

Each SLA violation is logged as an event with full context:


{
  "violation_id": "v_abc123",
  "sla_id": "sla_invoice_latency",
  "workflow_id": "invoice-extraction-v2",
  "started_at": "2024-03-19T14:23:15Z",
  "ended_at": "2024-03-19T14:31:42Z",
  "duration_seconds": 507,
  "metric": "p95_runtime",
  "target": 5000,
  "measured_value": 8234,
  "max_deviation": 3234,
  "severity": "high",
  "root_cause": "Gateway timeout spike",
  "resolution": "Increased gateway concurrency limit"
}

Violation Details

View full details for any violation:

Timeline - Minute-by-minute metric values during violation
Impacted runs - List of workflow executions affected
Correlated events - Other system events at the same time (deployments, alerts)
Root cause analysis - AI-suggested probable causes based on patterns

Acknowledging Violations

Mark violations as acknowledged to track incident response:

Acknowledge - Team is aware and investigating
Assign - Assign to specific team member
Resolve - Mark as fixed with resolution notes

This creates an audit trail for post-incident review.

Compliance Reports

Generate compliance reports for stakeholder communication:

Monthly SLA Report

Automatically generated report showing:

SLA compliance percentage for each workflow
Number of violations and total downtime
MTTR (Mean Time to Recovery) - How quickly violations were resolved
Trend analysis - Compliance improving or degrading?

Custom Reports

Create ad-hoc reports for specific date ranges:

Select date range

Choose start and end dates for the report period.

Select SLAs

Include all SLAs or filter to specific workflows.

Choose format

Download as PDF, CSV, or JSON.

Add annotations

Include notes about known incidents, planned maintenance, or process changes.

Exporting Data


# Export SLA compliance data as JSON
curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:3500/api/sla/report?start=2024-03-01&end=2024-03-31&format=json" \
  -o sla-report.json
 
# Export violation events as CSV
curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:3500/api/sla/violations?start=2024-03-01&end=2024-03-31&format=csv" \
  -o violations.csv

Alerting Integrations

M3 Forge integrates with popular incident management platforms:

Slack

Post SLA violation alerts to Slack channels:


{
  "channel": "#production-alerts",
  "webhook_url": "https://hooks.slack.com/services/...",
  "message_template": "⚠️ SLA Violation: {{sla.name}}\nWorkflow: {{workflow.name}}\nMetric: {{metric}} = {{value}} (target: {{target}})\nDuration: {{duration}}"
}

PagerDuty

Trigger incidents with automatic escalation:


{
  "integration_key": "R_KPMLK...",
  "service_id": "PSERVICE123",
  "severity": "high",
  "escalation_policy": "engineering-on-call"
}

Email

Send detailed violation emails to team distribution lists:


{
  "recipients": ["team@example.com", "oncall@example.com"],
  "subject_template": "[SLA VIOLATION] {{sla.name}} - {{workflow.name}}",
  "include_attachments": true
}

Webhooks

POST violation events to custom endpoints:


POST https://your-api.com/sla-webhook
Content-Type: application/json
 
{
  "event_type": "sla.violation.started",
  "sla_id": "sla_invoice_latency",
  "workflow_id": "invoice-extraction-v2",
  "metric": "p95_runtime",
  "target": 5000,
  "measured_value": 8234,
  "timestamp": "2024-03-19T14:23:15Z"
}

Best Practices

Setting Realistic Targets

Avoid setting SLA targets that are too strict. Use historical performance data to establish achievable thresholds.

Recommended approach:

Baseline measurement - Run workflow for 1 week without SLAs to collect data
Analyze percentiles - Use p95 or p99 latency as baseline, not average
Add buffer - Set SLA threshold 20-30% looser than observed baseline
Iterate - Tighten thresholds gradually as performance improves

Evaluation Windows

Choose evaluation windows that balance responsiveness and noise reduction:

Window	Use Case
1-5 minutes	Critical user-facing workflows requiring fast detection
15-30 minutes	Background batch processing workflows
1 hour	Availability SLAs with expected transient failures

Shorter windows detect issues faster but may trigger false alarms from momentary spikes.

Notification Delay

Add a grace period to reduce alert fatigue:

No delay - For critical SLAs where every second of violation matters
5 minutes - Filters transient spikes from brief gateway slowdowns
15 minutes - For less critical workflows or known-flaky dependencies

Multi-Tier SLAs

Define different SLAs for different workflow priority tiers:


{
  "critical_workflows": {
    "latency_target": 3000,
    "error_rate_target": 1.0,
    "availability_target": 99.95
  },
  "standard_workflows": {
    "latency_target": 10000,
    "error_rate_target": 5.0,
    "availability_target": 99.5
  }
}

Tag workflows with priority level and assign appropriate SLA tier.

Troubleshooting Violations

When an SLA violation occurs:

Check the violation timeline

View the compliance chart to see when the violation started and how severe it was.

Correlate with other metrics

Cross-reference with:

LLM Observability for cost or latency spikes
Insights for failure rate changes
Traces for specific failed calls

Inspect impacted runs

Navigate to the Runs view filtered to the violation time window.

Identify root cause

Common causes:

Gateway timeout - Marie-AI gateway overloaded or network issue
External dependency - Database slow query, S3 unavailability
Data quality - Input data triggered edge case or validation failure
Resource contention - Too many concurrent executions

Implement fix

Address root cause with code changes, configuration tuning, or infrastructure scaling.

Verify resolution

Monitor SLA dashboard to ensure metric returns below threshold.

Next Steps

Set up LLM Observability to track cost drivers
Configure Insights alerts for failure rate thresholds
Review Traces when investigating SLA violations
Define escalation policies for on-call rotation