Skip to content

AmirhosseinHonardoust/Synthetic-Data-Artist

Repository files navigation

Synthetic Data Artist

Python PyTorch scikit-learn Status CI

A professional research-style Python project for generating and evaluating synthetic tabular data. The project compares a Gaussian Copula generator with a lightweight Variational Autoencoder (VAE) and evaluates the generated data using distribution, correlation, categorical similarity, boundary validity, privacy-proxy, and optional downstream machine-learning utility checks.

Important: This project is a research and portfolio demo, not a certified privacy-preserving synthetic data product.

It can help analyze synthetic data quality, but it does not provide formal differential privacy or guarantee that generated records are safe to release.


Table of Contents


Project Overview

Synthetic data generation is useful when teams want to experiment, prototype, share examples, or test workflows without exposing raw sensitive datasets. However, synthetic data is often misunderstood. Generating fake-looking rows does not automatically make a dataset private, useful, or statistically realistic.

This project takes a more careful approach. It does not only generate synthetic rows; it also evaluates how closely the synthetic data preserves important properties of the original table.

The goal of this project is to demonstrate:

  • A clean synthetic-data generation workflow
  • Statistical and neural synthetic-data methods
  • Honest quality evaluation
  • Privacy-risk proxy diagnostics
  • Optional train-on-synthetic, test-on-real utility evaluation
  • Visual reporting
  • Configurable CLI execution
  • Tests and CI for reproducibility
  • Clear limitations and responsible-use documentation

What This Project Does

This project can:

  • Load a real tabular CSV dataset
  • Generate synthetic rows using a Gaussian Copula method
  • Generate synthetic rows using a lightweight VAE method
  • Detect numeric and categorical columns automatically
  • Preserve the original column structure in synthetic outputs
  • Validate input data and configuration before generation
  • Generate quality metrics in JSON format
  • Generate a one-row quality_summary.csv for quick comparison
  • Create visual diagnostics for distributions, PCA, correlations, and pairplots
  • Produce lightweight HTML reports
  • Compute privacy proxy diagnostics such as exact duplicate rate and nearest-neighbor distances
  • Optionally evaluate downstream ML utility when a target column is configured
  • Run automated tests and CI smoke workflows

What This Project Does Not Do

This project does not:

  • Provide formal differential privacy
  • Certify that synthetic data is safe to publish
  • Guarantee that generated records cannot leak information
  • Replace domain-specific privacy review
  • Replace mature libraries such as SDV, CTGAN, or commercial privacy platforms
  • Guarantee strong performance on every dataset
  • Prove that a synthetic dataset is suitable for high-stakes use

A production-grade synthetic-data system would require stronger schema constraints, formal privacy evaluation, domain validation, monitoring, and security review.


Features

  • Gaussian Copula generator for statistical synthetic tabular data
  • Variational Autoencoder generator for neural synthetic tabular data
  • Automatic schema detection for numeric and categorical columns
  • Configurable VAE hyperparameters
  • Configurable output directories
  • Input dataframe validation
  • Config validation with clear error messages
  • Distribution overlap metrics
  • Correlation difference metrics
  • Categorical distribution similarity
  • Numeric summary-statistic differences
  • Boundary violation checks
  • Exact duplicate-rate check
  • Nearest-neighbor privacy proxy metrics
  • Optional ML utility evaluation
  • PCA projection chart
  • Correlation heatmap
  • Distribution comparison chart
  • Pairplot comparison chart
  • HTML report generation
  • Unit test suite
  • GitHub Actions CI support

Methods

The project currently supports two synthetic-data generation methods.

Gaussian Copula

The Gaussian Copula method models feature distributions and dependency patterns, then samples new rows from the fitted statistical structure.

It is often useful for small or medium-sized tabular datasets where statistical relationships are relatively stable.

Variational Autoencoder

The VAE method learns a compressed latent representation of the dataset and decodes synthetic rows from that latent space.

In this repository, the VAE is intentionally lightweight and configurable. It should be treated as a baseline neural generator, not a fully tuned production VAE.


Charts and Visual Analysis

The project automatically generates charts to make synthetic-data quality easier to inspect.

Generated charts are saved in:

outputs/<run_name>/plots/

Main charts include:

Chart Purpose
Distribution overlap Compares numeric feature distributions between real and synthetic data
PCA projection Shows whether real and synthetic rows occupy similar low-dimensional space
Correlation heatmap Compares correlation structure between real and synthetic data
Pairplot comparison Provides visual pairwise comparisons for sampled rows

These charts are diagnostic tools. They help identify obvious quality problems, but they should not be treated as proof that synthetic data is private or production-ready.

Distribution Overlap Comparison

Copula VAE
Copula distribution overlap VAE distribution overlap
Analysis: The Copula generator preserves the real numeric distributions much better on this demo dataset. It keeps the feature shapes closer to the original data and achieves a mean distribution-overlap score of approximately 0.943. Analysis: The lightweight VAE baseline produces more compressed distributions and loses more tail behavior. Its mean distribution-overlap score is approximately 0.596, which indicates weaker distribution preservation.

PCA Projection Comparison

Copula VAE
Copula PCA projection VAE PCA projection
Analysis: Copula samples cover a similar region of the real-data space, suggesting better diversity and coverage of the original feature space. Analysis: VAE samples are more concentrated around the center, suggesting weaker coverage and less diversity in this lightweight baseline configuration.

Metric Summary

Method Distribution overlap ↑ Correlation diff ↓ Categorical similarity ↑ Exact duplicate rate ↓
Copula 0.943 0.026 0.957 0.000
VAE 0.596 0.151 0.591 0.000

Interpretation: Higher distribution overlap and categorical similarity are better. Lower correlation difference and duplicate rate are better. On this demo dataset, Copula is the stronger generator, while the VAE is useful as a neural baseline but currently underfits the real data distribution.

Additional Pairplot Diagnostics
Copula VAE
Copula pairplot VAE pairplot
Analysis: Copula better preserves the overall spread and pairwise relationships between numeric variables. Analysis: VAE samples are visibly more concentrated and do not preserve the full spread of the real data as well.

How the Evaluation Works

The evaluation workflow compares real and synthetic datasets across multiple dimensions:

Real dataset
     ↓
Synthetic generator
     ↓
Synthetic dataset
     ↓
Quality metrics + visual diagnostics + optional ML utility evaluation

The evaluation includes:

  • Numeric distribution similarity
  • Numeric correlation preservation
  • Categorical distribution similarity
  • Summary-statistic differences
  • Boundary validity checks
  • Privacy proxy diagnostics
  • Optional train-synthetic-test-real utility checks

This makes the project more useful than a generator-only demo, because it asks whether the generated data actually behaves like the original data.


Project Structure

Synthetic-Data-Artist/
│
├── .github/
│   └── workflows/
│       └── ci.yml
│
├── data/
│   ├── real_data.csv
│   └── synthetic_data_*.csv
│
├── outputs/
│   └── <run_name>/
│       ├── metrics.json
│       ├── quality_summary.csv
│       └── plots/
│           ├── distribution_overlap.png
│           ├── pca_projection.png
│           ├── correlation_heatmap.png
│           └── pairplot_comparison.png
│
├── reports/
│   └── <run_name>_report.html
│
├── synthetic_data_artist/
│   ├── main.py
│   ├── config.py
│   ├── data.py
│   ├── schema.py
│   ├── models/
│   │   ├── copula.py
│   │   └── vae.py
│   ├── evaluation/
│   │   ├── metrics/
│   │   │   ├── distribution.py
│   │   │   ├── privacy.py
│   │   │   └── utility.py
│   │   └── plots.py
│   └── reporting/
│       └── html_report.py
│
├── tests/
│   ├── test_core_contracts.py
│   ├── test_project_integrity.py
│   ├── test_enhanced_evaluation_metrics.py
│   └── test_cli_and_validation.py
│
├── config.yaml
├── requirements.txt
├── requirements-vae.txt
├── README.md
└── LICENSE

Installation

1. Clone the Repository

git clone https://github.com/AmirhosseinHonardoust/Synthetic-Data-Artist.git
cd Synthetic-Data-Artist

2. Create a Virtual Environment

On Windows CMD:

python -m venv .venv
.venv\Scripts\activate

On macOS/Linux:

python -m venv .venv
source .venv/bin/activate

3. Install

Install the project (and its dependencies) in editable mode:

pip install -e .

This also provides a synthetic-data-artist command. The Copula generator and all evaluation run on the core dependencies. The VAE generator (--method vae) additionally needs PyTorch, kept as an optional extra so the default install stays light:

pip install -e .[vae]

Prefer plain requirements files? pip install -r requirements.txt (and -r requirements-vae.txt for the VAE extra) install the same dependency sets.


Running the Generator

All commands below use python -m synthetic_data_artist.main; after pip install -e . the equivalent synthetic-data-artist command is also available.

Run the Copula workflow:

python -m synthetic_data_artist.main --method copula --run_name copula_run

Run the VAE workflow:

python -m synthetic_data_artist.main --method vae --run_name vae_run

Validate configuration and input data without generating synthetic data:

python -m synthetic_data_artist.main --validate-only

Run a faster workflow without the pairplot:

python -m synthetic_data_artist.main --method copula --run_name fast_copula --skip-pairplot

Command-Line Usage

Basic CLI example:

python -m synthetic_data_artist.main \
  --config config.yaml \
  --data data/real_data.csv \
  --method copula \
  --run_name copula_experiment

Available options:

Option Description
--config Path to YAML configuration file
--data Path to real input CSV file
--method Generation method: copula or vae
--run_name Name used for output folders and files
--rows Override the number of synthetic rows
--outdir Override the root output directory
--data-outdir Override the synthetic CSV output directory
--report-dir Override the HTML report directory
--skip-pairplot Skip pairplot generation for faster runs
--validate-only Validate config/data/schema and exit

Example with custom directories:

python -m synthetic_data_artist.main \
  --method vae \
  --run_name experiment_vae \
  --rows 500 \
  --outdir experiment_outputs \
  --data-outdir experiment_data \
  --report-dir experiment_reports \
  --skip-pairplot

Configuration

Main configuration is stored in:

config.yaml

Example configuration:

rows: 1000
categorical_threshold: 20
seed: 42
pca_components: 2
hist_bins: 30
pairplot_sample: 500

paths:
  data_dir: data
  output_dir: outputs
  report_dir: reports

plots:
  pairplot: true

vae:
  epochs: 30
  batch_size: 128
  latent_dim: 8
  hidden_dim: 64
  learning_rate: 0.001
  kl_weight: 0.001

evaluation:
  privacy_max_rows: 500
  ml_utility:
    target: null
    test_size: 0.25

To enable ML utility evaluation, set a target column:

evaluation:
  ml_utility:
    target: target
    test_size: 0.25

Generated Outputs

Each run creates a synthetic CSV, metrics, plots, and an HTML report.

data/synthetic_data_<run_name>.csv
outputs/<run_name>/metrics.json
outputs/<run_name>/quality_summary.csv
outputs/<run_name>/plots/distribution_overlap.png
outputs/<run_name>/plots/pca_projection.png
outputs/<run_name>/plots/correlation_heatmap.png
outputs/<run_name>/plots/pairplot_comparison.png
reports/<run_name>_report.html

Output Files

File Purpose
synthetic_data_<run_name>.csv Generated synthetic dataset
metrics.json Full structured evaluation metrics
quality_summary.csv Compact one-row summary for comparing runs
plots/ Visual diagnostics
<run_name>_report.html Lightweight HTML report

Evaluation

The project uses a multi-part evaluation workflow.

Evaluation includes:

  • Distribution overlap
  • Correlation difference
  • Categorical similarity
  • Numeric summary-statistic differences
  • Boundary violation rate
  • Privacy proxy diagnostics
  • Optional ML utility evaluation

Distribution Overlap

Measures how close numeric feature distributions are between real and synthetic data using Jensen-Shannon distance transformed into an overlap-style score.

Higher is better.

Correlation Difference

Compares numeric correlation matrices between real and synthetic data.

Lower is better.

Categorical Similarity

Compares category proportions between real and synthetic data using total-variation similarity.

Higher is better.

Numeric Summary Difference

Compares scaled differences in numeric summary statistics such as mean, standard deviation, minimum, and maximum.

Lower is better.

Boundary Violation Rate

Checks whether synthetic values fall outside observed real-data numeric ranges or create invalid categorical values.

Lower is better.

ML Utility Evaluation

When a target column is configured, the project compares:

train on real data      → test on held-out real data
train on synthetic data → test on held-out real data

This helps estimate whether synthetic data preserves downstream predictive utility.


Privacy Proxy Analysis

The project includes lightweight privacy-risk proxy diagnostics.

These include:

  • Exact duplicate rate
  • Mean nearest-neighbor distance
  • 5th percentile nearest-neighbor distance
  • Minimum nearest-neighbor distance

These metrics help identify potential memorization or overly close synthetic records.

Important: These are proxy diagnostics only. They do not prove privacy and should not be treated as a formal privacy guarantee.


Testing

Run the test suite:

python -m unittest discover -s tests -v

Compile source and test files:

python -m compileall synthetic_data_artist tests

The tests check important project behavior, including:

  • Schema detection
  • Copula output contracts
  • VAE output contracts
  • Enhanced evaluation metrics
  • Config validation
  • Input dataframe validation
  • CLI argument parsing
  • Existing metrics JSON validity
  • Requirements formatting
  • Source compilation

Code Quality

The project includes automated workflow checks through:

.github/workflows/ci.yml

The CI workflow checks:

  • Dependency installation
  • Source compilation
  • Unit tests
  • Config and input validation
  • Copula smoke workflow
  • VAE smoke workflow
  • Expected output files
  • Required metrics in generated JSON files

This provides a basic reproducibility and regression safety net for future changes.


Limitations

This project has important limitations.

The project:

  • Uses demo data by default
  • Does not provide formal differential privacy
  • Does not certify that synthetic data is safe to publish
  • Uses a lightweight baseline VAE
  • May not preserve complex real-world relationships
  • Uses proxy privacy diagnostics, not formal privacy proofs
  • Requires domain-specific validation for real datasets
  • May generate poor synthetic data if the input data is small, noisy, or highly constrained
  • May be slow on larger datasets when pairplot generation is enabled

High quality scores on one dataset do not guarantee that the method will work well on another dataset.


Responsible Use

This project is intended for:

  • Synthetic data education
  • Research-style experimentation
  • Portfolio demonstration
  • Data quality diagnostics
  • Comparing simple synthetic-data generation methods
  • Learning about synthetic-data evaluation workflows

It should not be used as-is for:

  • Publishing synthetic data derived from sensitive records
  • Healthcare, financial, legal, or high-stakes data release
  • Replacing formal privacy review
  • Claiming differential privacy
  • Production synthetic-data deployment without additional safeguards

Before using synthetic data in sensitive contexts, evaluate duplicate rates, nearest-neighbor distances, domain constraints, utility metrics, and privacy risks with expert review.


Future Improvements

Possible future improvements include:

  • Add CTGAN or TVAE-style generators
  • Add formal privacy evaluation methods
  • Add richer schema metadata and constraints
  • Add per-column quality cards
  • Add train-synthetic-test-real benchmark reports
  • Add support for larger benchmark datasets
  • Add a Streamlit dashboard for visual comparison
  • Add experiment tracking across multiple runs
  • Add Docker support
  • Add model artifact saving and loading
  • Add more advanced missing-data handling
  • Add configurable plot generation levels

Tech Stack

  • Python
  • pandas
  • NumPy
  • SciPy
  • scikit-learn
  • PyTorch
  • matplotlib
  • seaborn
  • PyYAML
  • HTML reports
  • unittest
  • GitHub Actions

Author

Amir Honardoust

GitHub: @AmirhosseinHonardoust


License

This project is intended for educational, research, and portfolio purposes.

If you use or modify this project, please keep the responsible-use notes and limitations clear.

About

A professional, research-grade comparison of Gaussian Copula and Variational Autoencoder (VAE) methods for synthetic tabular data generation. Includes full evaluation pipeline with distribution overlap, correlation analysis, PCA projections, pairplots, metrics, and automated visual reports.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors