Data Privacy & Storage

M3 Forge processes documents, executes workflows, and interacts with LLM providers on your behalf. This page explains how your data is stored, transmitted, and protected, and how to configure retention policies to control data lifecycle.

How M3 Forge Stores Your Data

Interaction Storage

Every interaction with M3 Forge generates data that is stored for monitoring, debugging, quality assessment, and historical tracking:

Data Type	Examples	Storage
LLM Observability	Prompt executions, model responses, token usage, latency metrics	PostgreSQL + S3
Job Execution	Workflow runs, processor results, completion status	PostgreSQL
Conversations	Chat threads with AI assistants, HITL discussions	PostgreSQL
HITL Requests	Human review requests, approvals, corrections	PostgreSQL
Audit Trails	Configuration changes, user actions, RAG operations	PostgreSQL
Training Data	Training jobs, model metrics, dataset annotations	PostgreSQL + S3

Storage Architecture

PostgreSQL — Structured data including policies, metadata, audit logs, and job state. Uses two schemas: marie_studio for application data and marie_scheduler for job execution.
S3-compatible object storage — Documents, raw LLM event payloads, model artifacts, and training data. Supports MinIO, AWS S3, or any S3-compatible provider.

Data Retention Policies

M3 Forge provides configurable per-category retention policies that automatically delete data older than a specified period. This helps manage storage costs, maintain query performance, and meet compliance requirements.

All retention policies are disabled by default. Data is retained indefinitely until an administrator configures a retention period.

Retention Categories

Category	Data Covered	Tables Affected
Monitoring	LLM raw events, failed events, processor executions	LlmRawEvent, LlmFailedEvent, ProcessorExecution
Execution History	Completed job runs, job history, archives	Job, JobHistory, Archive
Conversations	Chat threads and all associated messages	ChatThread, ChatMessage
Human-in-the-Loop	HITL requests, responses, and notifications	HitlRequest, HitlResponse, HitlNotification
Audit Trail	Audit logs, configuration changes, RAG audit logs	AuditLog, ConfigurationAudit, RagAuditLog
Training	Training jobs and extractor training jobs	TrainingJob, ExtractorTrainingJob

Available Retention Periods

30 days
60 days
90 days
180 days
365 days
Indefinite (no automatic deletion)

Configuring Retention Policies

Navigate to Settings

Go to Settings → Data Retention from the main navigation.

Select a Category

Each data category has its own card with an enable/disable toggle.

Choose Retention Period

Select the desired retention period from the dropdown. Only data older than this period will be affected.

Optional: Monitoring Settings

For the Monitoring category, two additional options are available:

Aggregate usage stats before deletion — Rolls up raw LLM events into daily usage statistics (grouped by model and provider) before deleting the raw data. This preserves trend data for dashboards while freeing storage.
Clean up S3 storage objects — Deletes referenced S3 objects alongside database records.

Save

Click Save to apply the policy. Changes take effect at the next scheduled enforcement run (03:00 UTC daily).

How Enforcement Works

Retention policies are enforced automatically:

Schedule — Daily at 03:00 UTC via an internal cron job
Batch processing — Data is deleted in batches of 1,000 rows with short delays between batches to minimize database load
Safety caps — A maximum of 500,000 rows per table per enforcement run prevents runaway deletions
Manual trigger — Administrators can trigger enforcement runs from the Settings UI
Dry-run mode — Preview what would be deleted without actually removing any data

Data deleted by retention policies cannot be recovered. Always use dry-run mode to preview deletions before enabling a new policy.

Execution History: Special Handling

Jobs in the Execution History category respect the keepUntil field. A job will not be deleted until both conditions are met:

The job has reached a terminal state (completed, failed, cancelled, expired)
The job’s keepUntil timestamp has passed AND the retention period has elapsed

Monitoring: Aggregation

When Aggregate usage stats before deletion is enabled:

Raw LLM events are grouped by model, provider, and day
Aggregated counts and token usage are upserted into the LlmUsageStats table
The raw events are then deleted

This preserves long-term usage trends for cost analysis and capacity planning while removing verbose per-request data.

Data Transmission

LLM Provider Communication

When M3 Forge sends data to external LLM providers:

All data is transmitted via encrypted channels (TLS/HTTPS)
Only the data required for the specific execution is sent — no additional context or metadata
M3 Forge does not send your data to LLM providers for training or any purpose beyond generating the requested response
Provider API keys are stored encrypted in the database and never included in logs

Internal Communication

All internal service communication uses TLS when deployed with recommended configuration
API requests are authenticated via Bearer tokens or HMAC signatures
WebSocket connections for terminal access use the same authentication layer

Data Encryption

At Rest

Database — PostgreSQL supports transparent data encryption (TDE). Enable at the infrastructure level based on your compliance requirements.
Object storage — S3 objects are encrypted using server-side encryption (SSE-S3 or SSE-KMS depending on your storage provider configuration)
Secrets — API keys, provider credentials, and MFA secrets are encrypted with AES-256 before storage

In Transit

All external connections require TLS 1.2 or higher
Internal service mesh communication is encrypted when running behind a reverse proxy or service mesh with mTLS
WebSocket connections use WSS (WebSocket Secure)

Compliance Considerations

Self-Hosted Advantage

Because M3 Forge is self-hosted, your organization maintains full control over:

Concern	Your Control
Data residency	Deploy in any region, any cloud, or on-premises
Network isolation	Run entirely within your VPC — no external callbacks
Encryption keys	Manage your own KMS keys and TLS certificates
Retention policies	Configure per-category retention to match your compliance framework
Access controls	RBAC with custom roles, MFA, session management
Audit trails	All audit data stays in your database — export to your SIEM

Framework Alignment

Framework	Relevant Controls
GDPR	Data retention policies (right to erasure, data minimization), self-hosted data residency, audit trails
HIPAA	Encryption at rest and in transit, access controls, audit logging, configurable PHI retention
SOC 2	Data lifecycle management, role-based access, monitoring and alerting, change audit trail
PCI DSS	Encryption standards, access controls, retention limits, audit logging

Next Steps

Configure data retention in Settings → Data Retention
Set up API Keys for secure programmatic access
Configure Users & Roles for access control
Monitor system activity in Monitoring