Skip to Content
MonitoringInsights

Insights

Analyze workflow performance with metrics, trends, and failure rate tracking.

Overview

The Insights dashboard provides high-level analytics for workflow execution performance. It aggregates data across all workflow runs to help you:

  • Track success rates to ensure reliability
  • Measure time saved through automation
  • Identify problematic workflows with high failure rates
  • Analyze trends to detect degradation or improvement
  • Plan capacity based on execution volume

Unlike LLM Observability (which focuses on AI costs) and Traces (which focus on individual calls), Insights focuses on workflow-level operational metrics.

Insights dashboard showing total runs, failure rate, time saved metrics, and workflow breakdown table

Accessing the Dashboard

Navigate to Monitoring → Insights in the sidebar or visit /insights directly.

The dashboard displays:

  • Summary metrics - Total runs, failed count, failure rate, time saved, average runtime
  • Date range selector - Last 7 days, 30 days, custom range
  • Workflow breakdown table - Metrics per workflow with sorting and filtering
  • Trend charts - Time-series visualization of runs, failures, and latency

Key Metrics

Total Runs

The total number of workflow executions in the selected date range.

Includes:

  • Manual trigger executions
  • Scheduled (cron) executions
  • Event-triggered executions
  • API-initiated executions

Excludes:

  • Test runs in the canvas editor
  • Dry-run validations
  • Aborted executions before first node

Use this metric to:

  • Understand workload - Track execution volume over time
  • Validate triggers - Ensure scheduled workflows are running as expected
  • Measure adoption - See which workflows are actively used

Failed Count

The number of workflow runs that ended in failed status.

A run is considered failed if:

  • Any node throws an unhandled exception
  • A Guardrail node routes to a terminal failure path
  • Execution times out (exceeds configured max duration)
  • External dependency is unavailable (database, API, gateway)

Runs that are manually stopped or cancelled are not counted as failures.

Use this metric to:

  • Identify unstable workflows - High failure counts indicate reliability issues
  • Prioritize debugging - Focus on workflows with most frequent failures
  • Track improvement - Measure impact of bug fixes and error handling

Failure Rate

The percentage of runs that failed: (failed_count / total_runs) * 100.

Thresholds:

RateStatusAction
< 1%HealthyNo action needed
1-5%WarningMonitor for trends
5-10%DegradedInvestigate root causes
> 10%CriticalImmediate debugging required

Use this metric to:

  • Compare workflows - Which pipelines are most reliable?
  • Set SLA targets - Define acceptable failure rates for critical workflows
  • Detect regressions - Catch new bugs introduced by changes

Time Saved

The total time saved through automation, calculated as:

time_saved = total_runs * estimated_manual_time_per_run

Where estimated_manual_time_per_run is configured per workflow in Settings (default: 5 minutes).

Example:

  • 1,000 runs of invoice processing workflow
  • Manual processing time: 10 minutes per invoice
  • Time saved: 10,000 minutes (166.7 hours)

Use this metric to:

  • Justify automation - Quantify ROI for stakeholders
  • Prioritize optimization - Focus on high-volume workflows
  • Track efficiency - Measure cumulative time saved over months

Average Runtime

The mean duration of all successful workflow runs in the selected date range.

Calculated as:

average_runtime = SUM(duration of successful runs) / COUNT(successful runs)

Interpretation:

  • Increasing over time - Possible performance regression or data volume growth
  • High variance - Workflow performance is inconsistent (investigate node-level latency)
  • Correlation with failures - Slow runs may be timing out

Use this metric to:

  • Benchmark performance - Establish baseline for expected runtime
  • Detect bottlenecks - Compare with node-level timing from Traces
  • Optimize workflows - Identify candidates for parallelization or caching

Workflow Breakdown Table

The breakdown table lists all workflows with per-workflow metrics:

WorkflowRunsFailedFailure RateAvg RuntimeTime Saved
Invoice extraction2,341230.98%4.2s195 hrs
Customer onboarding1,567895.68%12.1s130 hrs
Document classification98730.30%1.8s82 hrs

Sorting

Click column headers to sort by:

  • Runs - Find most frequently executed workflows
  • Failed - Identify workflows with most failures
  • Failure Rate - Prioritize unreliable workflows
  • Avg Runtime - Find slow workflows
  • Time Saved - See which workflows provide most value

Filtering

Filter the table by:

  • Workflow name - Text search (autocomplete)
  • Status - Show only healthy, degraded, or critical workflows
  • Tags - Custom tags assigned to workflows (e.g., “production”, “experimental”)

Drill-Down

Click any workflow row to:

  • View DAG - Open the workflow canvas editor
  • See all runs - Navigate to filtered Runs view
  • Inspect traces - Jump to LLM Observability for cost analysis

Trend Analysis

The trend charts visualize metrics over time to detect patterns and anomalies.

Runs Over Time

Line chart showing daily execution volume:

  • Peaks - Identify days with unusually high activity
  • Valleys - Detect missing scheduled executions
  • Trend line - See if usage is growing or declining

Use this chart to:

  • Validate scheduled trigger configuration
  • Correlate spikes with external events (marketing campaigns, product launches)
  • Plan infrastructure scaling based on growth trends
Trend charts showing runs over time, failure rate over time, and average runtime over time

Failure Rate Over Time

Line chart with failure rate percentage:

  • Spikes - Indicate new bugs or external dependency outages
  • Gradual increase - May signal data quality degradation
  • Step changes - Correlate with deployments or config changes

Hover over any point to see the exact date, failure rate, and raw counts.

Use this chart to:

  • Detect regressions immediately after releases
  • Correlate failures with external events (API downtime, gateway issues)
  • Measure impact of bug fixes (failure rate should decrease)

Average Runtime Over Time

Line chart showing mean execution duration:

  • Increasing - Possible performance regression
  • Decreasing - Optimization improvements
  • Spikes - Transient latency issues (gateway slow, database contention)

Use this chart to:

  • Benchmark before/after optimization efforts
  • Detect performance degradation early
  • Correlate with LLM latency from Traces

Date Range Filtering

Select different time windows to analyze metrics:

Predefined Ranges

  • Last 7 days - For daily operational monitoring
  • Last 30 days - For monthly performance reviews
  • Last 90 days - For quarterly trend analysis

Custom Range

Pick arbitrary start and end dates for:

  • Comparing specific periods - Week before vs. after a release
  • Isolating incidents - Narrow to the exact time window of an outage
  • Quarterly reporting - Match fiscal calendar

Comparison Mode

Toggle “Compare with previous period” to overlay:

  • Current vs. previous week
  • Current vs. previous month
  • Current vs. same period last year

This highlights:

  • Growth - Is execution volume increasing?
  • Regression - Did failure rate get worse?
  • Seasonality - Are there predictable patterns?

Alerts and Thresholds

Configure automatic alerts based on Insights metrics:

Go to Settings → Monitoring → Insights Alerts.

Create alert rule

Define condition (e.g., “Failure rate > 5% for 1 hour”).

Choose notification channel

Select Slack, email, PagerDuty, or webhook.

Set severity

Choose warning, error, or critical based on business impact.

Example Alert Rules

{ "alerts": [ { "name": "High failure rate - Invoice extraction", "condition": "failure_rate > 5%", "workflow": "invoice-extraction", "duration": "1h", "notification": "slack", "severity": "error" }, { "name": "Slow execution - Customer onboarding", "condition": "avg_runtime > 30s", "workflow": "customer-onboarding", "duration": "30m", "notification": "email", "severity": "warning" } ] }

Exporting Data

Export Insights metrics for external analysis or reporting:

CSV Export

Download workflow breakdown table:

curl -H "Authorization: Bearer $TOKEN" \ "http://localhost:3500/api/insights/export?format=csv&start=2024-03-01&end=2024-03-19" \ -o insights.csv

JSON Export

Programmatic access to metrics:

curl -H "Authorization: Bearer $TOKEN" \ "http://localhost:3500/api/insights/export?format=json&start=2024-03-01&end=2024-03-19" \ | jq '.workflows[] | {name, failure_rate, avg_runtime}'

Dashboard Snapshots

Save a point-in-time snapshot for quarterly reviews:

  • Click “Export Snapshot” in the dashboard header
  • Select date range and workflows to include
  • Download as PDF or share link with stakeholders

Performance Optimization

Use Insights to identify optimization opportunities:

Finding Slow Workflows

Sort by Average Runtime

Click the “Avg Runtime” column header (descending).

Identify outliers

Look for workflows >10s when others are <5s.

Drill into Traces

Click the workflow row and select “View Traces” to see LLM latency breakdown.

Optimize bottlenecks

Reduce prompt size, enable caching, or parallelize independent nodes.

Reducing Failures

Sort by Failure Rate

Click the “Failure Rate” column header (descending).

Examine failure logs

Navigate to Runs view and filter by failed status.

Identify root causes

Common issues: invalid data, timeout, external API errors.

Add error handling

Implement retry logic, input validation, or graceful degradation.

Best Practices

Daily Monitoring

Check Insights dashboard daily to:

  • Verify overnight scheduled workflows completed successfully
  • Catch failure rate spikes early
  • Identify unexpected execution volume changes

Weekly Reviews

Run weekly performance reviews:

  • Compare metrics week-over-week
  • Investigate any workflows with degraded performance
  • Celebrate improvements (lower failure rates, faster runtimes)

Monthly Reporting

Generate monthly reports for stakeholders:

  • Total time saved through automation
  • Reliability improvements (failure rate trends)
  • Top-performing workflows (high runs, low failures)

Correlation with Deployments

After any release or configuration change:

  • Check Insights for failure rate changes
  • Compare average runtime before and after
  • Rollback if metrics degrade significantly

Next Steps

Last updated on