Synthetic_Medical_Tabular_Data

Generate synthetic medical data from a patient population dataset.

SCENARIO

Dr. AA is conducting a study to better understand a rare disease. However, due to the rarity of the disease, he could not gather enough numbers that would power statistical analyses.

In an international conference, he met Dr. BB who was presenting case reports on the disease and who also acknowledged the difficulty of recruiting patients for enrollment in a study.

They agreed to collaborate. However, their individual hospitals would not give them permission to share patient data due to privacy constraints.

What can they do so that they can do robust research on this challenging patient population while maintaining rigid privacy standards?

Primary Objective:

To generate a synthetic medical dataset for a patient population

Secondary Objectives:

Demonstrate the transformation of human-interpretable categorical and continuous data to a tensor input, and the re-transformation of the tensor output to human-interpretable data
Explore the use of tabular data (continuous and categorical variables) in deep learning generative models
Evaluate the performance of generative models regarding its ability to create synthetic data that closely approximates the real data:

Variational AutoEncoder
Generative Adversarial Network

Methodology:

DATA LOADING and PREPARATION

Using medical patient data from https://synthetichealth.github.io/synthea/

Get continuous and categorical variables
Merge continuous and categorical variables
Split data to Train / Test sets
Data Transformation

CREATE AND TRAIN VARIATIONAL AUTOENCODER (VAE) MODEL

Define sampling batch
Create Encoder
Create Decoder
Define losses
Compile and train
Visualize results
Evaluate model on Test set

CREATE AND TRAIN A GENERATIVE ADVERSARIAL NETWORK (GAN) MODEL

Create Generator
Create Discriminator
Define Losses
Compile and train
Visualize results
Evaluate model on Test set

RESULTS

VAE

GAN

Findings:

The GAN model used a single hidden layer (with 100 nodes for the generator and 150 nodes for the discriminator), and 10 latent variable size. The model training showed a torsade pattern illustrating the adversarial method between the training of the generative and discriminator components. It can be considered that 'convergence' was reached at around Epoch 100 - 200.

The synthetic data generated from the Test set showed a distribution nearly equivalent with the Real Test set. (Note: the distribution was more equivalent at lesser Epochs, but a higher Epoch was used here to demonstrate the adversarial training). This shows the the model training was reasonable and is able to generalize to new data.

This shows that GAN can be applied to generate synthetic tabular data containing both continuous and categorical variables.

COMPARING RESULTS BETWEEN VAE AND GAN:

VAE reached convergence faster. Both VAE and GAN and had similar distributions between real and synthetic data with this small dataset.

Due to the continuous improvement in training, it can be hypothesized that GAN may be more useful for more complex and bigger datasets.

TECHNICAL CONCLUSION:

VAE and GAN generative models can generate reasonable synthetic tabular data containing heterogenous variables.

REAL-WORLD APPLICATION:

This will be useful in medical research to alleviate privacy restrictions. It will enable data sharing between different institutions which will lead to data that is more representative of the whole population, particularly for rare diseases.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
Synthetic_Medical_Tabular_Data.ipynb		Synthetic_Medical_Tabular_Data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic_Medical_Tabular_Data

SCENARIO

Primary Objective:

Secondary Objectives:

Methodology:

DATA LOADING and PREPARATION

CREATE AND TRAIN VARIATIONAL AUTOENCODER (VAE) MODEL

CREATE AND TRAIN A GENERATIVE ADVERSARIAL NETWORK (GAN) MODEL

RESULTS

VAE

GAN

Findings:

COMPARING RESULTS BETWEEN VAE AND GAN:

TECHNICAL CONCLUSION:

REAL-WORLD APPLICATION:

About

Releases

Packages

Languages

License

yrodriguezmd/Synthetic_Medical_Tabular_Data

Folders and files

Latest commit

History

Repository files navigation

Synthetic_Medical_Tabular_Data

SCENARIO

Primary Objective:

Secondary Objectives:

Methodology:

DATA LOADING and PREPARATION

CREATE AND TRAIN VARIATIONAL AUTOENCODER (VAE) MODEL

CREATE AND TRAIN A GENERATIVE ADVERSARIAL NETWORK (GAN) MODEL

RESULTS

VAE

GAN

Findings:

COMPARING RESULTS BETWEEN VAE AND GAN:

TECHNICAL CONCLUSION:

REAL-WORLD APPLICATION:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages