Scaling
The Agent Server supports progressive scaling from a single development machine to a fully distributed production deployment. Choose a deployment mode based on your workload requirements and operational complexity budget.
Deployment Modes
Single Host
All components run on one machine. The gateway, workers, PostgreSQL, and Redis share the same host. This is the simplest configuration and is suitable for development, testing, and small production workloads.
┌──────────────────────────────────┐
│ Single Host │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Gateway │ │ Workers │ │
│ └──────────┘ └──────────────┘ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ PostgreSQL│ │ Redis │ │
│ └──────────┘ └──────────────┘ │
└──────────────────────────────────┘- All components co-located on a single machine
- No network overhead between services
- Limited by the resources of a single host
Split API / Queue
Separate the gateway from the worker processes. The gateway handles HTTP and gRPC traffic while dedicated workers execute agent runs. This is the first step toward production scaling.
- Gateway servers handle client requests, authentication, and routing
- Worker servers pull jobs from the PostgreSQL queue and execute agent runs
- PostgreSQL and Redis can remain on a shared database host or move to managed services
Distributed
Full horizontal scaling with multiple gateways and worker pools distributed across machines. Consul-based service discovery enables automatic failover and load balancing.
- Multiple stateless gateways behind a load balancer
- Worker pools distributed across machines, optionally grouped by capability (CPU vs GPU)
- Shared PostgreSQL cluster for state and job queue
- Consul for service discovery and health checking
Concurrency
Slot Capacity Manager
Each worker has a configurable number of execution slots that determine how many agent runs it can process concurrently. The slot capacity manager tracks available slots and rejects new work when all slots are occupied.
- Default slot count: 4 per worker
- Configuration: Set via
WORKER_SLOTSenvironment variable or worker configuration file - Slot types: Slots can be typed (CPU, GPU) to route work to appropriate workers
Concurrent Runs per Worker
The number of concurrent runs a worker can handle depends on:
- Available slots — The configured slot count sets the upper bound
- Memory — Each agent run consumes memory for its execution context, tools, and model state
- GPU resources — Workers performing local model inference are typically limited to fewer concurrent runs
Executor Replicas
For flows that require parallel processing, configure executor replicas in the flow YAML:
executors:
- name: document-processor
replicas: 3
resources:
memory: 2GiEach replica runs as a separate process within the worker, enabling parallel execution of a single executor type.
Queue Backpressure
When all worker slots across the cluster are occupied, new jobs remain in the PostgreSQL queue until capacity becomes available. The queue provides natural backpressure without dropping requests.
- Jobs are ordered by priority and submission time
- No jobs are lost when workers are at capacity
- Clients receive a queued status and can poll for completion
Slot capacity is the primary scaling lever. Start with 4 slots per worker and adjust based on agent memory footprint and execution time.
Horizontal Scaling
Gateway Replicas
Gateways are stateless. Add more gateways behind a load balancer to increase request throughput. All gateways read from and write to the same PostgreSQL database.
- Any standard HTTP load balancer works (NGINX, HAProxy, cloud ALB)
- Health check endpoint:
GET /healthz - No session affinity required
Worker Replicas
Add more workers to increase total execution capacity. Each worker connects to the shared PostgreSQL queue and pulls jobs independently.
- Workers are self-registering — start a new worker and it begins pulling jobs
- No coordination required between workers
- Scale workers independently from gateways
Session Affinity Not Required
All execution state lives in PostgreSQL. Any worker can pick up any run, and any gateway can serve any request. This simplifies load balancing and failover because there is no sticky session requirement.
Auto-Scaling Triggers
Configure auto-scaling rules based on the following metrics:
- Queue depth — Scale up workers when pending jobs exceed a threshold
- Slot utilization — Scale up when average slot usage exceeds 80%
- Response latency — Scale up gateways when p95 latency exceeds target
- Scale down — Remove workers when slot utilization drops below 20% for a sustained period
Resource Planning
Use the following guidelines as a starting point. Actual requirements depend on agent complexity, model sizes, and concurrency targets.
| Component | CPU | Memory | Storage | Notes |
|---|---|---|---|---|
| Gateway | 2 cores | 2 GB | Minimal | Scales with request throughput |
| Worker (CPU) | 4 cores | 8 GB | Minimal | Per 4 concurrent runs |
| Worker (GPU) | 4 cores + 1 GPU | 16 GB | Minimal | For local model inference |
| PostgreSQL | 4 cores | 8 GB | 100 GB+ | Scales with checkpoint volume |
| Redis | 2 cores | 4 GB | — | Cache layer, optional |
GPU workers require NVIDIA drivers and the CUDA toolkit installed on the host. Verify GPU availability with nvidia-smi before starting GPU-enabled workers.
Multi-Gateway
Running multiple gateways provides high availability and geographic distribution.
Service Discovery
Gateways register with Consul on startup and deregister on shutdown. Other gateways and workers discover peers through Consul’s service catalog.
- Automatic registration with configurable health check intervals
- TTL-based deregistration for crashed processes
- DNS and HTTP discovery interfaces
Failover
When a gateway becomes unhealthy, Consul removes it from the service catalog. Load balancers that query Consul automatically route traffic to remaining healthy gateways.
- Health checks run every 10 seconds by default
- Deregistration after 3 consecutive failed checks
- No manual intervention required for failover
Geographic Distribution
Deploy gateways in multiple regions to reduce client latency. Each regional gateway connects to the same central PostgreSQL cluster, or to a read replica for query-heavy workloads.
Consistent Hashing
For workloads that benefit from request locality (e.g., caching assistant state in memory), configure optional consistent-hashing routing on the load balancer. This routes requests for the same assistant or session to the same gateway when possible, while still allowing failover.
Docker Deployment
The Agent Server provides Docker Compose configurations for common deployment scenarios.
Available Configurations
| File | Purpose |
|---|---|
docker-compose.yml | Standard deployment with separate gateway and worker services |
docker-compose.storage.yml | Storage services: PostgreSQL, Redis, S3-compatible (MinIO) |
docker-compose.allinone.yml | All-in-one single-host deployment for development and testing |
Starting the All-in-One Deployment
The all-in-one configuration runs the gateway, worker, PostgreSQL, and Redis in a single Compose stack:
docker compose -f docker-compose.allinone.yml up -dFor production, use the standard Compose file with separate service scaling:
# Start storage services
docker compose -f docker-compose.storage.yml up -d
# Start gateway and workers (scale workers independently)
docker compose -f docker-compose.yml up -d --scale worker=3Scaling Workers with Docker Compose
# Scale to 5 workers
docker compose -f docker-compose.yml up -d --scale worker=5
# Check running services
docker compose -f docker-compose.yml psBest Practices
- Start with single-host deployment and validate agent behavior before scaling out
- Monitor queue depth and slot utilization before adding workers
- Set per-assistant concurrency limits to prevent a single assistant from consuming all worker capacity
- Use connection pooling for PostgreSQL (PgBouncer recommended for more than 50 concurrent connections)
- Configure rate limits on gateway endpoints to protect against request floods
- Enable health checks for all services registered in Consul
- Use separate worker pools for CPU and GPU workloads to avoid resource contention
- Set execution timeouts to prevent runaway agents from holding slots indefinitely