Skip to Content
Data Privacy & Storage

Data Privacy & Storage

M3 Forge processes documents, executes workflows, and interacts with LLM providers on your behalf. This page explains how your data is stored, transmitted, and protected, and how to configure retention policies to control data lifecycle.

How M3 Forge Stores Your Data

Interaction Storage

Every interaction with M3 Forge generates data that is stored for monitoring, debugging, quality assessment, and historical tracking:

Data TypeExamplesStorage
LLM ObservabilityPrompt executions, model responses, token usage, latency metricsPostgreSQL + S3
Job ExecutionWorkflow runs, processor results, completion statusPostgreSQL
ConversationsChat threads with AI assistants, HITL discussionsPostgreSQL
HITL RequestsHuman review requests, approvals, correctionsPostgreSQL
Audit TrailsConfiguration changes, user actions, RAG operationsPostgreSQL
Training DataTraining jobs, model metrics, dataset annotationsPostgreSQL + S3

Storage Architecture

  • PostgreSQL — Structured data including policies, metadata, audit logs, and job state. Uses two schemas: marie_studio for application data and marie_scheduler for job execution.
  • S3-compatible object storage — Documents, raw LLM event payloads, model artifacts, and training data. Supports MinIO, AWS S3, or any S3-compatible provider.

Data Retention Policies

M3 Forge provides configurable per-category retention policies that automatically delete data older than a specified period. This helps manage storage costs, maintain query performance, and meet compliance requirements.

All retention policies are disabled by default. Data is retained indefinitely until an administrator configures a retention period.

Retention Categories

CategoryData CoveredTables Affected
MonitoringLLM raw events, failed events, processor executionsLlmRawEvent, LlmFailedEvent, ProcessorExecution
Execution HistoryCompleted job runs, job history, archivesJob, JobHistory, Archive
ConversationsChat threads and all associated messagesChatThread, ChatMessage
Human-in-the-LoopHITL requests, responses, and notificationsHitlRequest, HitlResponse, HitlNotification
Audit TrailAudit logs, configuration changes, RAG audit logsAuditLog, ConfigurationAudit, RagAuditLog
TrainingTraining jobs and extractor training jobsTrainingJob, ExtractorTrainingJob

Available Retention Periods

  • 30 days
  • 60 days
  • 90 days
  • 180 days
  • 365 days
  • Indefinite (no automatic deletion)

Configuring Retention Policies

Go to Settings → Data Retention from the main navigation.

Select a Category

Each data category has its own card with an enable/disable toggle.

Choose Retention Period

Select the desired retention period from the dropdown. Only data older than this period will be affected.

Optional: Monitoring Settings

For the Monitoring category, two additional options are available:

  • Aggregate usage stats before deletion — Rolls up raw LLM events into daily usage statistics (grouped by model and provider) before deleting the raw data. This preserves trend data for dashboards while freeing storage.
  • Clean up S3 storage objects — Deletes referenced S3 objects alongside database records.

Save

Click Save to apply the policy. Changes take effect at the next scheduled enforcement run (03:00 UTC daily).

How Enforcement Works

Retention policies are enforced automatically:

  • Schedule — Daily at 03:00 UTC via an internal cron job
  • Batch processing — Data is deleted in batches of 1,000 rows with short delays between batches to minimize database load
  • Safety caps — A maximum of 500,000 rows per table per enforcement run prevents runaway deletions
  • Manual trigger — Administrators can trigger enforcement runs from the Settings UI
  • Dry-run mode — Preview what would be deleted without actually removing any data

Data deleted by retention policies cannot be recovered. Always use dry-run mode to preview deletions before enabling a new policy.

Execution History: Special Handling

Jobs in the Execution History category respect the keepUntil field. A job will not be deleted until both conditions are met:

  1. The job has reached a terminal state (completed, failed, cancelled, expired)
  2. The job’s keepUntil timestamp has passed AND the retention period has elapsed

Monitoring: Aggregation

When Aggregate usage stats before deletion is enabled:

  1. Raw LLM events are grouped by model, provider, and day
  2. Aggregated counts and token usage are upserted into the LlmUsageStats table
  3. The raw events are then deleted

This preserves long-term usage trends for cost analysis and capacity planning while removing verbose per-request data.

Data Transmission

LLM Provider Communication

When M3 Forge sends data to external LLM providers:

  • All data is transmitted via encrypted channels (TLS/HTTPS)
  • Only the data required for the specific execution is sent — no additional context or metadata
  • M3 Forge does not send your data to LLM providers for training or any purpose beyond generating the requested response
  • Provider API keys are stored encrypted in the database and never included in logs

Internal Communication

  • All internal service communication uses TLS when deployed with recommended configuration
  • API requests are authenticated via Bearer tokens or HMAC signatures
  • WebSocket connections for terminal access use the same authentication layer

Data Encryption

At Rest

  • Database — PostgreSQL supports transparent data encryption (TDE). Enable at the infrastructure level based on your compliance requirements.
  • Object storage — S3 objects are encrypted using server-side encryption (SSE-S3 or SSE-KMS depending on your storage provider configuration)
  • Secrets — API keys, provider credentials, and MFA secrets are encrypted with AES-256 before storage

In Transit

  • All external connections require TLS 1.2 or higher
  • Internal service mesh communication is encrypted when running behind a reverse proxy or service mesh with mTLS
  • WebSocket connections use WSS (WebSocket Secure)

Compliance Considerations

Self-Hosted Advantage

Because M3 Forge is self-hosted, your organization maintains full control over:

ConcernYour Control
Data residencyDeploy in any region, any cloud, or on-premises
Network isolationRun entirely within your VPC — no external callbacks
Encryption keysManage your own KMS keys and TLS certificates
Retention policiesConfigure per-category retention to match your compliance framework
Access controlsRBAC with custom roles, MFA, session management
Audit trailsAll audit data stays in your database — export to your SIEM

Framework Alignment

FrameworkRelevant Controls
GDPRData retention policies (right to erasure, data minimization), self-hosted data residency, audit trails
HIPAAEncryption at rest and in transit, access controls, audit logging, configurable PHI retention
SOC 2Data lifecycle management, role-based access, monitoring and alerting, change audit trail
PCI DSSEncryption standards, access controls, retention limits, audit logging

Next Steps

  • Configure data retention in Settings → Data Retention
  • Set up API Keys for secure programmatic access
  • Configure Users & Roles for access control
  • Monitor system activity in Monitoring
Last updated on