Is Synthetic Data Regulated?


This is one of the most frequently asked questions in data governance today: is synthetic data subject to regulation? The answer matters for every organization evaluating synthetic data for AI training, software testing, analytics, or compliance.

The short answer: synthetic data output is generally not personal data and is not directly regulated as such. But the generation process may trigger significant regulatory obligations depending on how the data is created.

This distinction between process and output is the key to understanding the regulatory landscape for synthetic data. This guide examines the question across major regulatory frameworks and jurisdictions.

The Fundamental Distinction: Process vs. Output

To understand synthetic data regulation, you must separate two questions:

Question 1: Is the synthetic data output regulated?

In most cases, no. If synthetic data cannot be linked back to any identifiable individual, it falls outside the scope of data protection regulations. The output is not personal data.

Question 2: Is the synthetic data generation process regulated?

This depends entirely on the method used. If the generation process involves real personal data as input, data protection laws apply to that process, even though the output may be unregulated.

This creates two fundamentally different regulatory postures:

Generation Method Process Regulated Output Regulated
Learn-from-real (statistical models trained on real data) Yes (GDPR, CCPA, etc.) Generally no
Born Synthetic (generated from mathematical models, zero real data) No Generally no

The practical implication is significant. Learn-from-real generators must comply with data protection law during the generation phase. Born Synthetic generators operate entirely outside the scope of data protection regulation.

GDPR Perspective

The General Data Protection Regulation is the most comprehensive and influential data protection framework globally. Its treatment of synthetic data turns on a precise legal analysis.

When GDPR Applies to Synthetic Data Generation

GDPR applies to the processing of personal data (Article 2(1)). Processing includes collection, recording, organization, structuring, adaptation, alteration, and use (Article 4(2)). If a synthetic data generator:

  • Ingests real personal data as training input
  • Analyzes patterns in real personal data to build statistical models
  • Uses real records as seeds, templates, or reference points

Then the generation process constitutes processing of personal data under GDPR. The organization must:

  • Establish a legal basis under Article 6 (typically legitimate interest, requiring a balancing test)
  • Conduct a Data Protection Impact Assessment under Article 35 if the processing involves large-scale profiling or systematic evaluation
  • Respect data subject rights including access (Art. 15), erasure (Art. 17), and objection (Art. 21)
  • Comply with purpose limitation under Article 5(1)(b) — using customer data to generate synthetic training sets is a different purpose than the original collection purpose
  • Ensure compliance with cross-border transfer rules under Chapter V if the real data crosses jurisdictions during processing

When GDPR Does Not Apply

GDPR does not apply to data that is not personal data. Under Recital 26, data is only personal data if it relates to an identified or identifiable natural person, considering all means reasonably likely to be used for identification.

Born Synthetic data is generated from:

  • Mathematical distributions (Pareto, log-normal, constrained algebraic models)
  • Cultural and demographic archetypes (31 profiles across 6 geographic niches)
  • Algorithmic enrichment via local LLMs that have never been trained on the specific real individuals being modeled

No real individual’s data enters the pipeline. The output cannot be linked to any identifiable person because no identifiable person was involved in creation. GDPR does not apply to the generation process or the output.

The GDPR Article 25 Advantage

Article 25 requires data protection by design and by default. For organizations that use Born Synthetic data for development, testing, or AI training, the choice of Born Synthetic data is itself an implementation of data protection by design. It demonstrates to regulators that the organization considered privacy at the architectural level rather than adding protections after the fact.

EU AI Act Perspective

The EU AI Act regulates AI systems rather than data directly. However, it imposes specific requirements on the data used to train, validate, and test AI systems.

Article 10: Training Data Governance

For high-risk AI systems (which include most financial AI under Annex III), Article 10 requires:

  • Documented data governance practices
  • Quality criteria for training datasets
  • Bias examination
  • Provenance documentation

These requirements apply regardless of whether the training data is real, anonymized, or synthetic. Synthetic data used for AI training is within the scope of Article 10 documentation requirements.

However, the compliance burden differs dramatically by generation method:

Requirement Real Data Learn-from-Real Synthetic Born Synthetic
Document data origin Complex (collection, consent, transfers) Complex (source data + model + generation) Simple (mathematical model + parameters)
Bias examination Must audit historical biases Must audit inherited biases Biases are explicit design choices
Quality criteria Depends on source data quality Depends on source + model quality Defined by generation parameters
Provenance chain Long (collection → storage → transformation → training) Medium (collection → model training → generation → AI training) Short (model → generation → AI training)

Born Synthetic data satisfies Article 10 with the least documentation burden because the provenance chain is short and every parameter is explicitly documented in the Certificate of Sovereign Origin.

Article 10 Does Not Exempt Synthetic Data

It is important to note that using synthetic data does not exempt an organization from Article 10 compliance. The training data must still be documented, bias-examined, and governed. Synthetic data simplifies this compliance but does not eliminate it.

DORA Perspective

The Digital Operational Resilience Act applies to financial entities and their ICT resilience testing. DORA does not regulate data types directly but creates practical requirements for test data.

Articles 24-25: Resilience Testing

DORA requires financial entities to test their ICT systems comprehensively. Using production data in test environments creates operational risk (a breach during testing becomes an ICT incident under Article 17) and regulatory complexity (GDPR applies to test environments containing real data).

Synthetic data is not merely permitted under DORA; it is implicitly encouraged. Using synthetic data for resilience testing:

  • Eliminates the risk that testing creates a data breach
  • Removes GDPR governance requirements from test environments
  • Enables sharing test data with external penetration testers without Data Processing Agreements
  • Allows testing at production scale without production data exposure

DORA treats synthetic data as part of the solution, not as a regulated problem.

PCI DSS 4.0 Perspective

The Payment Card Industry Data Security Standard version 4.0 takes the most explicit position on synthetic data of any major framework.

Requirement 6.5.4

PCI DSS 4.0 Requirement 6.5.4 explicitly prohibits the use of real Primary Account Numbers (PANs) in test and development environments. This requirement, mandatory since March 2025, effectively mandates synthetic or test card data for any organization handling payment card information.

This is not a case where synthetic data is merely a compliance option. For PCI DSS 4.0, synthetic payment data is a regulatory requirement.

Implications for Born Synthetic

Born Synthetic datasets that include financial identifiers generate synthetic account structures that have never been associated with any real payment instrument. This satisfies Requirement 6.5.4 by design, as there are no real PANs to accidentally migrate into test environments.

CCPA and US State Privacy Laws

The California Consumer Privacy Act (CCPA), as amended by the CPRA, and similar state privacy laws in Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), and others follow a similar logic to GDPR.

Process vs. Output Analysis

  • If synthetic data is generated from California residents’ personal information, the generation process is subject to CCPA. Consumers retain rights to know, delete, and opt out of the sale/sharing of their data, including its use as input for synthetic data generation.
  • If synthetic data is Born Synthetic, no personal information is processed, and CCPA does not apply to either the process or the output.

The “Derived Data” Question

Some US state laws include concepts of derived data or inferred data. An argument could be made that learn-from-real synthetic data is derived from personal information. Born Synthetic data cannot be characterized as derived from personal information because no personal information was involved in its creation.

Sector-Specific Regulations

Banking (OCC, Fed, FDIC Guidelines)

US banking regulators have issued guidance on model risk management (SR 11-7, OCC 2011-12) that requires documentation of training data for models used in lending, credit, and risk decisions. Synthetic data used for model development must be documented regardless of generation method. Born Synthetic data simplifies this documentation.

Insurance (Solvency II, IAIS)

Insurance regulators increasingly scrutinize the data used for actuarial models and AI-driven underwriting. The same process-vs-output distinction applies: using real policyholder data to generate synthetic training data triggers regulatory oversight. Born Synthetic data does not.

Healthcare (HIPAA)

In healthcare contexts, HIPAA’s Safe Harbor de-identification standard requires removal of 18 identifiers. Synthetic data generated without reference to Protected Health Information (PHI) is not subject to HIPAA. However, learn-from-real synthetic data generated from PHI requires HIPAA compliance during the generation process.

Regulatory Status by Jurisdiction and Method

Jurisdiction / Framework Real Data Learn-from-Real Synthetic (Process) Learn-from-Real Synthetic (Output) Born Synthetic (Process) Born Synthetic (Output)
GDPR (EU/EEA) Fully regulated Regulated Generally unregulated Not regulated Not regulated
EU AI Act (Art. 10) Documentation required Documentation required Documentation required Documentation required Documentation required
DORA (Art. 24-25) Risk during testing N/A (not test data source) Acceptable for testing N/A Acceptable for testing
PCI DSS 4.0 (Req. 6.5.4) Prohibited in test env N/A Compliant N/A Compliant
CCPA/CPRA (US-CA) Regulated Regulated Generally unregulated Not regulated Not regulated
VCDPA (US-VA) Regulated Regulated Generally unregulated Not regulated Not regulated
HIPAA (US) Regulated (if PHI) Regulated (if from PHI) Generally unregulated Not regulated Not regulated
LGPD (Brazil) Regulated Regulated Generally unregulated Not regulated Not regulated
PDPA (Singapore) Regulated Regulated Generally unregulated Not regulated Not regulated

Note: “Documentation required” under EU AI Act means that even Born Synthetic data used for AI training must be documented under Article 10. This is a governance obligation, not a data protection obligation.

Key Takeaway: Born Synthetic Has the Cleanest Regulatory Posture

Across every framework examined, Born Synthetic data occupies the least regulated position because:

  1. No personal data is processed at any stage, removing GDPR, CCPA, and equivalent obligations from the generation process
  2. The output is not personal data, placing it outside data protection regulation
  3. Documentation requirements (EU AI Act Article 10) are satisfied with minimal burden through the Certificate of Sovereign Origin
  4. Sector-specific prohibitions (PCI DSS 4.0) are inherently satisfied because no real data is involved
  5. Cross-border transfer restrictions do not apply because there is no personal data to restrict

For organizations seeking the simplest, cleanest regulatory posture for their non-production data, Born Synthetic data minimizes legal exposure across all major jurisdictions.

Assess Your Regulatory Exposure

Use the GDPR Risk Assessment tool to evaluate your current data practices against regulatory requirements. The assessment covers training data, test data, and operational data governance.

To evaluate Born Synthetic data quality firsthand, download a free sample of 100 UHNWI profiles with complete documentation, including the Certificate of Sovereign Origin that documents the generation methodology and the absence of real data inputs.

Frequently Asked Questions

If synthetic data is not personal data, why does the EU AI Act still regulate it?

The EU AI Act regulates AI systems, not data types. Article 10 requires that all training data, whether real or synthetic, be governed, documented, and examined for bias. The regulation targets the use of data in AI systems rather than the data protection status of the data itself. Synthetic data simplifies compliance with these requirements but is not exempt from them.

Can a regulator challenge my claim that Born Synthetic data is not personal data?

A regulator could theoretically argue that synthetic data is personal data if it relates to an identifiable individual. For Born Synthetic data, this argument fails because no real individual’s data was used as input. The generated profiles are fictional constructs based on mathematical models. There is no identifiable natural person to whom the data relates under GDPR Recital 26 or equivalent provisions.

Does using synthetic data for AI training create any intellectual property issues?

Born Synthetic data is generated from proprietary mathematical models and does not derive from copyrighted databases or proprietary real-world datasets. Purchasers receive a commercial license for the dataset. Intellectual property concerns are more relevant for learn-from-real generators that may inadvertently memorize and reproduce patterns from copyrighted source databases.

If I switch from masked data to Born Synthetic, do I still need to maintain DPIAs for my old processes?

If you retain historical masked datasets, the DPIAs covering their creation remain relevant documentation. If you decommission all masked datasets and transition fully to Born Synthetic, you may be able to close historical DPIAs for those specific processing activities, subject to your Data Protection Officer’s assessment and any retention obligations.

How should I document Born Synthetic data procurement for internal compliance?

Document the procurement as you would any vendor relationship: vendor assessment, data processing terms (noting that no personal data is processed), the Certificate of Sovereign Origin as provenance evidence, and internal classification of the dataset as non-personal data. This documentation package is typically simpler than the equivalent for masked or learn-from-real synthetic data.


Last updated: March 2026

Learn more about is synthetic data regulated and how Born Synthetic data addresses this in our glossary and comparison guides.


Related Resources

Scroll to Top
Sovereign Forger on Product Hunt