Generate synthetic medical data from a patient population dataset.
Dr. AA is conducting a study to better understand a rare disease. However, due to the rarity of the disease, he could not gather enough numbers that would power statistical analyses.
In an international conference, he met Dr. BB who was presenting case reports on the disease and who also acknowledged the difficulty of recruiting patients for enrollment in a study.
They agreed to collaborate. However, their individual hospitals would not give them permission to share patient data due to privacy constraints.
What can they do so that they can do robust research on this challenging patient population while maintaining rigid privacy standards?
To generate a synthetic medical dataset for a patient population
- Demonstrate the transformation of human-interpretable categorical and continuous data to a tensor input, and the re-transformation of the tensor output to human-interpretable data
- Explore the use of tabular data (continuous and categorical variables) in deep learning generative models
- Evaluate the performance of generative models regarding its ability to create synthetic data that closely approximates the real data:
- Variational AutoEncoder
- Generative Adversarial Network
- Using medical patient data from https://synthetichealth.github.io/synthea/
- Get continuous and categorical variables
- Merge continuous and categorical variables
- Split data to Train / Test sets
- Data Transformation
- Define sampling batch
- Create Encoder
- Create Decoder
- Define losses
- Compile and train
- Visualize results
- Evaluate model on Test set
- Create Generator
- Create Discriminator
- Define Losses
- Compile and train
- Visualize results
- Evaluate model on Test set
The GAN model used a single hidden layer (with 100 nodes for the generator and 150 nodes for the discriminator), and 10 latent variable size. The model training showed a torsade pattern illustrating the adversarial method between the training of the generative and discriminator components. It can be considered that 'convergence' was reached at around Epoch 100 - 200.
The synthetic data generated from the Test set showed a distribution nearly equivalent with the Real Test set. (Note: the distribution was more equivalent at lesser Epochs, but a higher Epoch was used here to demonstrate the adversarial training). This shows the the model training was reasonable and is able to generalize to new data.
This shows that GAN can be applied to generate synthetic tabular data containing both continuous and categorical variables.
VAE reached convergence faster. Both VAE and GAN and had similar distributions between real and synthetic data with this small dataset.
Due to the continuous improvement in training, it can be hypothesized that GAN may be more useful for more complex and bigger datasets.
VAE and GAN generative models can generate reasonable synthetic tabular data containing heterogenous variables.
This will be useful in medical research to alleviate privacy restrictions. It will enable data sharing between different institutions which will lead to data that is more representative of the whole population, particularly for rare diseases.