A practical framework to evaluate the privacy-utility tradeoff of synthetic data publishing
Based on "Synthetic Data - Anonymisation Groundhog Day, Theresa Stadler, Bristena Oprisanu, and Carmela Troncoso, arXiv, 2020"
The module attack_models
so far includes
A privacy adversary to test for privacy gain with respect to linkage attacks modelled as a membership inference attack MIAAttackClassifier
.
A simple attribute inference attack AttributeInferenceAttack
that aims to infer a target's sensitive value given partial knowledge about the target record
The module generative_models
so far includes:
IndependentHistogram
: An independent histogram model adapted from Data Responsibly's DataSynthesiserBayesianNet
: A generative model based on a Bayesian Network adapted from Data Responsibly's DataSynthesiserPrivBayes
: A differentially private version of the BayesianNet model adapted from Data Responsibly's DataSynthesiserCTGAN
: A conditional tabular generative adversarial network that integrates the CTGAN model from CTGANPATE-GAN
: A differentially private generative adversarial network adapted from its original implementation by the MLforHealth Lab
For your convenience, Synthetic Data is also distributed as a ready-to-use Docker image containing Python 3.9 and CUDA 11.4.2, along with all dependencies required by Synthetic Data, including jupyter notebook to visualise and analyse the results.
Note: This distribution includes CUDA binaries, before downloading the image, ensure to read its EULA and to agree to its terms.
Pull the image and run a container (and bind a volume where you want to save the data):
docker pull springepfl/synthetic-data:latest
docker run -it --rm -v "$(pwd)/output:/output" -p 8888:8888 springepfl/synthetic-data
The Synthetic Data directory is placed at the root directory of the container.
cd /synthetic_data_release
You should now be able to run the examples without encountering any problems, and you should be able to visualize the results with Jupyter by running
jupyter notebook --allow-root --ip=0.0.0.0
and opening the notebook with your favourite web browser at the url http://127.0.0.1:8888/?token=<authentication token>
.
The framework and its building blocks have been developed and tested under Python 3.9.
To mimic our environment exactly, we recommend using poetry
. To install poetry (system-wide), follow the instructions here.
Then run
poetry install
from inside the project directory. This will create a virtual environment (default .venv
), that can be accessed by running poetry shell
, or in the usual way (with source .venv/bin/activate
).
For Pip installation, we recommend creating a virtual environment for installing all dependencies by running
python3 -m venv pyvenv3
source pyvenv3/bin/activate
pip install -r requirements.txt
To run a privacy evaluation with respect to the privacy concern of linkability you can run
python3 linkage_cli.py -D data/texas -RC tests/linkage/runconfig.json -O tests/linkage
The results file produced after successfully running the script will be written to tests/linkage
and can be parsed with the function load_results_linkage
provided in utils/analyse_results.py
.
A jupyter notebook to visualise and analyse the results is included at notebooks/Analyse Results.ipynb
.
To run a privacy evaluation with respect to the privacy concern of inference you can run
python3 inference_cli.py -D data/texas -RC tests/inference/runconfig.json -O tests/inference
The results file produced after successfully running the script can be parsed with the function load_results_inference
provided in utils/analyse_results.py
.
A jupyter notebook to visualise and analyse the results is included at notebooks/Analyse Results.ipynb
.
To run a utility evaluation with respect to a simple classification task as utility function run
python3 utility_cli.py -D data/texas -RC tests/utility/runconfig.json -O tests/utility
The results file produced after successfully running the script can be parsed with the function load_results_utility
provided in utils/analyse_results.py
.