Skip to Content
InfrastructureOverview

Infrastructure

Runtime operations layer for monitoring deployed services, managing gateway connections, and debugging live systems.

What is Infrastructure?

The Infrastructure section provides real-time visibility into running M3 Forge services and their connections to Marie-AI backend gateways. Infrastructure focuses on operational monitoring — what services are deployed, how they’re performing, and what’s connected right now.

Infrastructure enables you to:

  • Monitor executor deployments across multiple gateways with health status and capacity tracking
  • Manage webapp runners that host Gradio, Streamlit, and custom web applications
  • Track gateway connections with health checks and bearer token authentication
  • Register MCP servers to extend agent capabilities with external tools and services
  • View live events from workflow executions and system operations
  • Inspect capacity utilization to plan scaling and resource allocation
  • Debug production issues with log aggregation and trace inspection
  • Query service registry for discovering available endpoints and capabilities
Infrastructure dashboard showing deployments, gateways, capacity metrics, and system health status

Infrastructure vs Administration vs Self-Hosting

Understanding the distinction between these three sections:

SectionFocusExamples
InfrastructureRUNTIME operations — monitoring what’s running NOWDeployment status, executor health, capacity utilization, live logs
AdministrationSETTINGS and configuration — managing users, roles, API keysUser management, RBAC, LLM provider connections, workspace settings
Self-HostingINSTALLATION and setup — how to deploy M3 ForgeDocker Compose, environment variables, database migrations, gateway config

If you’re asking “What’s running and how is it performing?” — that’s Infrastructure.

If you’re asking “Who has access and what are my settings?” — that’s Administration.

If you’re asking “How do I install and configure M3 Forge?” — that’s Self-Hosting.

Key Capabilities

Deployment Monitoring

Track executor deployments across all configured gateways. Each deployment shows:

  • Status metrics - Total executors, nodes, active deployments, healthy percentage
  • Executor details - Name, status (SERVING/STOPPED/ERROR), node count, heartbeat timestamps
  • Capacity tracking - Total/used/available slots, utilization percentages
  • Desired states - Expected executor configuration grouped by epoch
  • Multi-gateway support - Switch between gateways with health testing before data fetch

See Deployments for comprehensive monitoring features.

Webapp Management

Manage webapp runners and deployed web applications with full lifecycle control:

  • Webapp lifecycle - Create, start, stop, redeploy web applications
  • App types - Gradio, Streamlit, FastAPI, static sites, custom Docker containers
  • Git integration - Deploy from Git repositories with branch tracking and status monitoring
  • Log viewing - Real-time logs streamed via SSE for each webapp
  • Runner health - Monitor webapp runner status, capabilities, and active webapp count
  • Deployment history - View past deployments with timestamps and outcomes

See Webapps for webapp deployment and management.

Gateway Connections

Monitor and manage connections to Marie-AI backend gateways:

  • Health checks - Automatic 5-second interval health testing with status indicators
  • Gateway configuration - Read from environment variables, conventional paths, or config files
  • Bearer token authentication - Secure gateway access with GATEWAY_TOKEN_<ID> env vars
  • Multi-gateway switching - Select active gateway from dropdown with health validation
  • Connection diagnostics - Test connectivity before attempting data operations

See Gateways for gateway configuration and health monitoring.

Event Tracking

View and analyze events from workflow executions and system operations:

  • Event streams - Real-time display of workflow events, LLM calls, validation results
  • Event details - Timestamps, event types, payloads, associated workflow/node IDs
  • Filtering - By workflow, event type, time range, severity
  • Vector embeddings - Event payloads stored with embeddings for semantic search
  • Audit trail - Complete history of system events for debugging and compliance

See Events for event monitoring and inspection.

Capacity Management

Track resource utilization across executor deployments:

  • Capacity overview - Total, used, and available capacity across all gateways
  • Slot tracking - Individual capacity holders with utilization percentages
  • Resource planning - Identify over/under-provisioned executors
  • Historical trends - Capacity usage over time to inform scaling decisions

See Capacity for resource tracking and planning.

Debugging Tools

Diagnose production issues with comprehensive debugging capabilities:

  • Log aggregation - Centralized logs from all services and executors
  • Trace inspection - Follow request flows across distributed services
  • Real-time updates - SSE-based log streaming with auto-scroll
  • Error analysis - Stack traces, validation failures, timeout events
  • Context reconstruction - Full execution context for workflow runs

See Debug for troubleshooting and diagnostics.

Service Registry

Discover available services, endpoints, and capabilities:

  • Service catalog - All registered services with health status and uptime
  • Endpoint discovery - REST and tRPC endpoints with schemas
  • Capability listings - What each executor/runner supports
  • Version tracking - Service versions and compatibility matrix

See Registry for service discovery and documentation.

Infrastructure data is fetched in real-time from Marie-AI backend gateways. Ensure gateways are configured and healthy before using Infrastructure features.

Real-Time Updates

Infrastructure views use Server-Sent Events (SSE) and auto-refresh intervals for real-time monitoring:

  • Health checks - Every 5 seconds for gateway connectivity
  • Deployment status - Every 10 seconds for executor heartbeats
  • Log streams - Real-time SSE for webapp and workflow logs
  • Event updates - Live event tracking with auto-scroll to latest
  • Capacity metrics - Refreshed on-demand and every 30 seconds

This ensures you always see current system state without manual refresh.

Multi-Gateway Architecture

M3 Forge supports connections to multiple Marie-AI backend gateways simultaneously. Each gateway represents a separate cluster or environment:

Gateway selection is available in all Infrastructure views via a dropdown selector. The system tests gateway health before fetching data to avoid timeouts.

Getting Started

Best Practices

Gateway Configuration

For production deployments:

  • Use environment variables for bearer tokens - Store GATEWAY_TOKEN_<ID> in secure secrets management
  • Configure multiple gateways - Redundancy and geographic distribution
  • Test connectivity before use - Health checks prevent timeout errors
  • Monitor gateway health - Set up alerts for gateway downtime

Resource Monitoring

Track capacity proactively:

  • Set capacity alerts - Notify when utilization exceeds 80%
  • Review capacity trends - Identify growth patterns for scaling decisions
  • Monitor executor health - Unhealthy nodes reduce available capacity
  • Balance load across gateways - Distribute workloads geographically

Debugging Workflow

When investigating production issues:

  1. Check gateway health - Ensure connectivity to backend
  2. Review recent events - Look for errors or unusual patterns
  3. Inspect deployment status - Verify executors are SERVING and healthy
  4. Examine logs - Use Debug view for centralized log aggregation
  5. Compare capacity - Ensure sufficient resources for workload

Security Considerations

Gateway Authentication

  • Bearer tokens required - All gateway API calls use Authorization: Bearer <token> headers
  • Token storage - Tokens stored in environment variables, never in frontend code
  • Token rotation - Update GATEWAY_TOKEN_* env vars and restart API server
  • Per-gateway tokens - Each gateway has independent authentication

Access Control

Infrastructure data access is role-based:

  • Admins - Full access to all Infrastructure views and operations
  • Operators - View-only access to deployments, events, capacity
  • Developers - Access to logs and debugging tools for their workflows
  • Viewers - No access to Infrastructure section

Audit Logging

All Infrastructure operations are logged:

  • Gateway configuration changes
  • Webapp deployments and lifecycle operations
  • Debug log access with filter criteria
  • Capacity threshold violations

Next Steps

Last updated on