Skip to Content
MonitoringSLA Monitoring

SLA Monitoring

Monitor service level agreement compliance and threshold violations for critical workflows.

Overview

SLA (Service Level Agreement) monitoring ensures that your AI workflows meet defined performance and reliability commitments. It provides:

  • Real-time compliance tracking against SLA targets
  • Automated alerting when thresholds are breached
  • Historical compliance reports for auditing and accountability
  • Multi-dimensional SLAs (latency, availability, error rate, throughput)

SLA monitoring bridges operational metrics (Insights) with business commitments, enabling you to proactively manage service quality.

SLA Monitoring dashboard showing compliance gauges, violation alerts, and threshold configuration

What is an SLA?

An SLA defines measurable performance targets for a workflow or service:

SLA TypeMetricExample Target
AvailabilityUptime percentage99.9% (43 min downtime/month)
LatencyResponse time percentilep95 latency < 5 seconds
Error RateFailed executionsFailure rate < 2%
ThroughputRequests per secondHandle 100 runs/minute peak load

M3 Forge monitors these targets in real-time and alerts when violations occur.

SLA Configuration

Navigate to Monitoring → SLA Monitoring and click “Create SLA” to define a new agreement.

Basic Configuration

{ "name": "Invoice Extraction SLA", "workflow_id": "invoice-extraction-v2", "enabled": true, "description": "Production SLA for invoice processing pipeline" }

Targets

Define one or more performance targets:

{ "type": "latency", "metric": "p95_runtime", "threshold": 5000, "unit": "ms", "evaluation_window": "5m" }

Explanation:

  • 95th percentile runtime must be under 5 seconds
  • Evaluated over a rolling 5-minute window
  • Violation if p95 exceeds threshold for entire window

Alerting

Configure how and when to notify on SLA violations:

{ "alert": { "channels": ["slack", "email", "pagerduty"], "severity": "high", "notification_delay": "5m", "escalation_policy": "on-call-rotation" } }

Options:

  • channels - Where to send alerts (Slack, email, PagerDuty, webhooks)
  • severity - low, medium, high, critical (affects urgency)
  • notification_delay - Grace period before alerting (reduces noise from transient issues)
  • escalation_policy - Who to notify if violation persists (defined in Settings → On-Call)

Time-Based Rules

Define SLA windows that vary by time of day or day of week:

{ "schedules": [ { "name": "Business hours", "cron": "0 8-18 * * 1-5", "timezone": "America/New_York", "targets": [ {"type": "latency", "threshold": 3000} ] }, { "name": "Off-hours", "cron": "0 0-7,19-23 * * *", "timezone": "America/New_York", "targets": [ {"type": "latency", "threshold": 10000} ] } ] }

Use case: Stricter latency requirements during business hours when users are actively waiting for results.

SLA Dashboard

The SLA Monitoring dashboard displays real-time compliance status:

Summary Cards

  • Active SLAs - Count of enabled SLAs
  • Violations (Last 24h) - Number of threshold breaches
  • Compliance Score - Percentage of time SLAs were met
  • MTTD (Mean Time to Detect) - Average delay before violation detection

SLA Table

SLAWorkflowStatusCurrent ValueTargetLast Violation
Invoice latencyinvoice-extraction-v2Healthy3.2s< 5s2 days ago
Customer onboarding error ratecustomer-onboardingWarning1.8%< 2%3 hours ago
Document classification availabilitydoc-classifyViolated98.5%99.9%12 min ago

Status indicators:

  • Healthy (green) - Currently meeting SLA targets
  • Warning (yellow) - Approaching threshold (within 10%)
  • Violated (red) - Currently breaching SLA

Click any row to see detailed compliance history and violation timeline.

Compliance Chart

Time-series visualization showing:

  • Target threshold - Horizontal line showing SLA limit
  • Actual metric - Line chart of measured performance
  • Violation periods - Red shaded regions where SLA was breached
  • Warning zone - Yellow shaded region near threshold

Zoom into specific time ranges to analyze violation patterns.

Violation Detection

M3 Forge uses a time-window-based detection algorithm:

Collect metric samples

Every 30 seconds, the monitoring system queries the database for workflow execution metrics.

Calculate aggregate

Compute the metric value over the evaluation window (e.g., p95 latency for last 5 minutes).

Compare to threshold

If aggregate value exceeds (or falls below for throughput) the SLA target, mark as potential violation.

Apply grace period

If the violation persists for the configured notification_delay, trigger an alert.

Send notifications

Dispatch alerts to all configured channels (Slack, email, PagerDuty).

Track violation

Record violation event in database with start time, end time, duration, and max deviation from target.

Violation Events

Each SLA violation is logged as an event with full context:

{ "violation_id": "v_abc123", "sla_id": "sla_invoice_latency", "workflow_id": "invoice-extraction-v2", "started_at": "2024-03-19T14:23:15Z", "ended_at": "2024-03-19T14:31:42Z", "duration_seconds": 507, "metric": "p95_runtime", "target": 5000, "measured_value": 8234, "max_deviation": 3234, "severity": "high", "root_cause": "Gateway timeout spike", "resolution": "Increased gateway concurrency limit" }

Violation Details

View full details for any violation:

  • Timeline - Minute-by-minute metric values during violation
  • Impacted runs - List of workflow executions affected
  • Correlated events - Other system events at the same time (deployments, alerts)
  • Root cause analysis - AI-suggested probable causes based on patterns

Acknowledging Violations

Mark violations as acknowledged to track incident response:

  • Acknowledge - Team is aware and investigating
  • Assign - Assign to specific team member
  • Resolve - Mark as fixed with resolution notes

This creates an audit trail for post-incident review.

Compliance Reports

Generate compliance reports for stakeholder communication:

Monthly SLA Report

Automatically generated report showing:

  • SLA compliance percentage for each workflow
  • Number of violations and total downtime
  • MTTR (Mean Time to Recovery) - How quickly violations were resolved
  • Trend analysis - Compliance improving or degrading?

Custom Reports

Create ad-hoc reports for specific date ranges:

Select date range

Choose start and end dates for the report period.

Select SLAs

Include all SLAs or filter to specific workflows.

Choose format

Download as PDF, CSV, or JSON.

Add annotations

Include notes about known incidents, planned maintenance, or process changes.

Exporting Data

# Export SLA compliance data as JSON curl -H "Authorization: Bearer $TOKEN" \ "http://localhost:3500/api/sla/report?start=2024-03-01&end=2024-03-31&format=json" \ -o sla-report.json # Export violation events as CSV curl -H "Authorization: Bearer $TOKEN" \ "http://localhost:3500/api/sla/violations?start=2024-03-01&end=2024-03-31&format=csv" \ -o violations.csv

Alerting Integrations

M3 Forge integrates with popular incident management platforms:

Slack

Post SLA violation alerts to Slack channels:

{ "channel": "#production-alerts", "webhook_url": "https://hooks.slack.com/services/...", "message_template": "⚠️ SLA Violation: {{sla.name}}\nWorkflow: {{workflow.name}}\nMetric: {{metric}} = {{value}} (target: {{target}})\nDuration: {{duration}}" }

PagerDuty

Trigger incidents with automatic escalation:

{ "integration_key": "R_KPMLK...", "service_id": "PSERVICE123", "severity": "high", "escalation_policy": "engineering-on-call" }

Email

Send detailed violation emails to team distribution lists:

{ "recipients": ["team@example.com", "oncall@example.com"], "subject_template": "[SLA VIOLATION] {{sla.name}} - {{workflow.name}}", "include_attachments": true }

Webhooks

POST violation events to custom endpoints:

POST https://your-api.com/sla-webhook Content-Type: application/json { "event_type": "sla.violation.started", "sla_id": "sla_invoice_latency", "workflow_id": "invoice-extraction-v2", "metric": "p95_runtime", "target": 5000, "measured_value": 8234, "timestamp": "2024-03-19T14:23:15Z" }

Best Practices

Setting Realistic Targets

Avoid setting SLA targets that are too strict. Use historical performance data to establish achievable thresholds.

Recommended approach:

  1. Baseline measurement - Run workflow for 1 week without SLAs to collect data
  2. Analyze percentiles - Use p95 or p99 latency as baseline, not average
  3. Add buffer - Set SLA threshold 20-30% looser than observed baseline
  4. Iterate - Tighten thresholds gradually as performance improves

Evaluation Windows

Choose evaluation windows that balance responsiveness and noise reduction:

WindowUse Case
1-5 minutesCritical user-facing workflows requiring fast detection
15-30 minutesBackground batch processing workflows
1 hourAvailability SLAs with expected transient failures

Shorter windows detect issues faster but may trigger false alarms from momentary spikes.

Notification Delay

Add a grace period to reduce alert fatigue:

  • No delay - For critical SLAs where every second of violation matters
  • 5 minutes - Filters transient spikes from brief gateway slowdowns
  • 15 minutes - For less critical workflows or known-flaky dependencies

Multi-Tier SLAs

Define different SLAs for different workflow priority tiers:

{ "critical_workflows": { "latency_target": 3000, "error_rate_target": 1.0, "availability_target": 99.95 }, "standard_workflows": { "latency_target": 10000, "error_rate_target": 5.0, "availability_target": 99.5 } }

Tag workflows with priority level and assign appropriate SLA tier.

Troubleshooting Violations

When an SLA violation occurs:

Check the violation timeline

View the compliance chart to see when the violation started and how severe it was.

Correlate with other metrics

Cross-reference with:

Inspect impacted runs

Navigate to the Runs view filtered to the violation time window.

Identify root cause

Common causes:

  • Gateway timeout - Marie-AI gateway overloaded or network issue
  • External dependency - Database slow query, S3 unavailability
  • Data quality - Input data triggered edge case or validation failure
  • Resource contention - Too many concurrent executions

Implement fix

Address root cause with code changes, configuration tuning, or infrastructure scaling.

Verify resolution

Monitor SLA dashboard to ensure metric returns below threshold.

Next Steps

Last updated on