Scaling

The Agent Server supports progressive scaling from a single development machine to a fully distributed production deployment. Choose a deployment mode based on your workload requirements and operational complexity budget.

Deployment Modes

Single Host

All components run on one machine. The gateway, workers, PostgreSQL, and Redis share the same host. This is the simplest configuration and is suitable for development, testing, and small production workloads.


┌──────────────────────────────────┐
│           Single Host            │
│  ┌──────────┐  ┌──────────────┐  │
│  │ Gateway  │  │   Workers    │  │
│  └──────────┘  └──────────────┘  │
│  ┌──────────┐  ┌──────────────┐  │
│  │ PostgreSQL│  │    Redis     │  │
│  └──────────┘  └──────────────┘  │
└──────────────────────────────────┘

All components co-located on a single machine
No network overhead between services
Limited by the resources of a single host

Split API / Queue

Separate the gateway from the worker processes. The gateway handles HTTP and gRPC traffic while dedicated workers execute agent runs. This is the first step toward production scaling.

Gateway servers handle client requests, authentication, and routing
Worker servers pull jobs from the PostgreSQL queue and execute agent runs
PostgreSQL and Redis can remain on a shared database host or move to managed services

Distributed

Full horizontal scaling with multiple gateways and worker pools distributed across machines. Consul-based service discovery enables automatic failover and load balancing.

Multiple stateless gateways behind a load balancer
Worker pools distributed across machines, optionally grouped by capability (CPU vs GPU)
Shared PostgreSQL cluster for state and job queue
Consul for service discovery and health checking

Concurrency

Slot Capacity Manager

Each worker has a configurable number of execution slots that determine how many agent runs it can process concurrently. The slot capacity manager tracks available slots and rejects new work when all slots are occupied.

Default slot count: 4 per worker
Configuration: Set via WORKER_SLOTS environment variable or worker configuration file
Slot types: Slots can be typed (CPU, GPU) to route work to appropriate workers

Concurrent Runs per Worker

The number of concurrent runs a worker can handle depends on:

Available slots — The configured slot count sets the upper bound
Memory — Each agent run consumes memory for its execution context, tools, and model state
GPU resources — Workers performing local model inference are typically limited to fewer concurrent runs

Executor Replicas

For flows that require parallel processing, configure executor replicas in the flow YAML:


executors:
  - name: document-processor
    replicas: 3
    resources:
      memory: 2Gi

Each replica runs as a separate process within the worker, enabling parallel execution of a single executor type.

Queue Backpressure

When all worker slots across the cluster are occupied, new jobs remain in the PostgreSQL queue until capacity becomes available. The queue provides natural backpressure without dropping requests.

Jobs are ordered by priority and submission time
No jobs are lost when workers are at capacity
Clients receive a queued status and can poll for completion

Slot capacity is the primary scaling lever. Start with 4 slots per worker and adjust based on agent memory footprint and execution time.

Horizontal Scaling

Gateway Replicas

Gateways are stateless. Add more gateways behind a load balancer to increase request throughput. All gateways read from and write to the same PostgreSQL database.

Any standard HTTP load balancer works (NGINX, HAProxy, cloud ALB)
Health check endpoint: GET /healthz
No session affinity required

Worker Replicas

Add more workers to increase total execution capacity. Each worker connects to the shared PostgreSQL queue and pulls jobs independently.

Workers are self-registering — start a new worker and it begins pulling jobs
No coordination required between workers
Scale workers independently from gateways

Session Affinity Not Required

All execution state lives in PostgreSQL. Any worker can pick up any run, and any gateway can serve any request. This simplifies load balancing and failover because there is no sticky session requirement.

Auto-Scaling Triggers

Configure auto-scaling rules based on the following metrics:

Queue depth — Scale up workers when pending jobs exceed a threshold
Slot utilization — Scale up when average slot usage exceeds 80%
Response latency — Scale up gateways when p95 latency exceeds target
Scale down — Remove workers when slot utilization drops below 20% for a sustained period

Resource Planning

Use the following guidelines as a starting point. Actual requirements depend on agent complexity, model sizes, and concurrency targets.

Component	CPU	Memory	Storage	Notes
Gateway	2 cores	2 GB	Minimal	Scales with request throughput
Worker (CPU)	4 cores	8 GB	Minimal	Per 4 concurrent runs
Worker (GPU)	4 cores + 1 GPU	16 GB	Minimal	For local model inference
PostgreSQL	4 cores	8 GB	100 GB+	Scales with checkpoint volume
Redis	2 cores	4 GB	—	Cache layer, optional

GPU workers require NVIDIA drivers and the CUDA toolkit installed on the host. Verify GPU availability with nvidia-smi before starting GPU-enabled workers.

Multi-Gateway

Running multiple gateways provides high availability and geographic distribution.

Service Discovery

Gateways register with Consul on startup and deregister on shutdown. Other gateways and workers discover peers through Consul’s service catalog.

Automatic registration with configurable health check intervals
TTL-based deregistration for crashed processes
DNS and HTTP discovery interfaces

Failover

When a gateway becomes unhealthy, Consul removes it from the service catalog. Load balancers that query Consul automatically route traffic to remaining healthy gateways.

Health checks run every 10 seconds by default
Deregistration after 3 consecutive failed checks
No manual intervention required for failover

Geographic Distribution

Deploy gateways in multiple regions to reduce client latency. Each regional gateway connects to the same central PostgreSQL cluster, or to a read replica for query-heavy workloads.

Consistent Hashing

For workloads that benefit from request locality (e.g., caching assistant state in memory), configure optional consistent-hashing routing on the load balancer. This routes requests for the same assistant or session to the same gateway when possible, while still allowing failover.

Docker Deployment

The Agent Server provides Docker Compose configurations for common deployment scenarios.

Available Configurations

File	Purpose
`docker-compose.yml`	Standard deployment with separate gateway and worker services
`docker-compose.storage.yml`	Storage services: PostgreSQL, Redis, S3-compatible (MinIO)
`docker-compose.allinone.yml`	All-in-one single-host deployment for development and testing

Starting the All-in-One Deployment

The all-in-one configuration runs the gateway, worker, PostgreSQL, and Redis in a single Compose stack:


docker compose -f docker-compose.allinone.yml up -d

For production, use the standard Compose file with separate service scaling:


# Start storage services
docker compose -f docker-compose.storage.yml up -d
 
# Start gateway and workers (scale workers independently)
docker compose -f docker-compose.yml up -d --scale worker=3

Scaling Workers with Docker Compose


# Scale to 5 workers
docker compose -f docker-compose.yml up -d --scale worker=5
 
# Check running services
docker compose -f docker-compose.yml ps

Best Practices

Start with single-host deployment and validate agent behavior before scaling out
Monitor queue depth and slot utilization before adding workers
Set per-assistant concurrency limits to prevent a single assistant from consuming all worker capacity
Use connection pooling for PostgreSQL (PgBouncer recommended for more than 50 concurrent connections)
Configure rate limits on gateway endpoints to protect against request floods
Enable health checks for all services registered in Consul
Use separate worker pools for CPU and GPU workloads to avoid resource contention
Set execution timeouts to prevent runaway agents from holding slots indefinitely