Our DL 289G Project
Deep learning in Computer Vision is important in medical area. Our project is mainly working on disease prediction for patients using X-ray images with technologies, such as CNNs and RNNs. In this project, we will present a deep learning model, which takes in a sequence of the consecutive previous chest X-rays of patients, analyze the variation and difference across this sequence. For the feature extraction phase of the images, the model uses convolutional neural networks (CNN), such as DenseNet, MobileNet, and ResNet. Besides these, we also compare and analyze specifically the impact of LSTMs on these X-ray based on the extracted feature maps from experimental CNN models. In conclusion, throughout this project, we intend to present a single deep learning framework, which would take in more than one X-ray per patient for analysis and would intend to treat these X-rays as an image sequence which would be then used for predicting the disease label based on the differences observed within the regions present across each follow-up X-ray, and our goal is to identify how does follow-up X-ray images play a significant role in predicting the disease labels.
The dataset we used is found here: https://www.kaggle.com/nih-chest-xrays/data
scripts that are used for data preproccessing and data cleaning
-
Filter_and_Create_Sample_Sets.ipynb
- Pick Patients who have at least 3 followups (indexing from 0)
- Create two different sample datasets based on view position and store datasets into CSVs
- Relevant CSV files for this script:
Data_Entry_2017.csv
df_updated_view_postion.csv
df_updated_finding_labels.csv
df_PA.csv
df_AP.csv
-
Preprocess_Analyze_Image_Datasets.ipynb
- PA, PA images dataset processing
- Adding Full Paths and Some basic preprocessing
- Train, Test, Validation dataset creation
- Analyzing the samples for label distributions
- Saving preprocessed into arrays and store in pickle.
- Relevant files for this script:
df_PA.csv
df_AP.csv
added_paths_PA.csv
added_paths_AP.csv
PA_train.csv
PA_test.csv
PA_val.csv
AP_train.csv
AP_test.csv
AP_val.csv
AP_images.pkl
PA_images.pkl
- PA, PA images dataset processing
-
Process_NIH_Dataset_Details.ipynb
- process NIH dataset details
- data analysis using data visualization
- Relevant files for this script:
BBox_List_2017.csv
Data_Entry_2017.csv
-
Sample_Set_Images.ipynb
- PA, AP Position manual Feature Extraction
- Relevant files for this script:
df_AP.csv
-
verify_files.py
- check if files are correctly merged
scripts that used for single image input models
-
AP_X_ray_images_baseline_dataprocessing_v2.ipynb
andPA_X_ray_images_baseline_dataprocessing_v2.ipynb
- For single image preprocessing, we added dataframes for AP or PA (from
df_pa.csv
anddf_ap.csv
), and then we linked images from google drive and then save them toadded_paths_ap.csv
andadded_paths_pa.csv
. We have split that datasets into three one with train, val, and test. We have then resized the images and saved as pickle files - Relevant files for this script:
df_AP.csv
added_paths_AP.csv
train_AP.pkl
val_AP.pkl
test_AP.pkl
df_PA.csv
added_paths_PA.csv
train_PA.pkl
val_PA.pkl
test_PA.pkl
- For single image preprocessing, we added dataframes for AP or PA (from
-
Single_Xray_AP_results.ipynb
andSingle_Xray_PA_results.ipynb
- storing and analyzing results for single AP and PA X-ray images
- Relevant files for this script:
added_paths_PA.csv
added_paths_AP.csv
train_df_DenseNet.csv
valid_df_DenseNet.csv
test_df_DenseNet.csv
-
APmodelling.py
andPAmodelling.py
- To compare DenseNet, ResNet, and MobileNet, we have tested our datasets on a simple CNN model which contained 5 layers, 1000 units, and kernel size of 7. The dropout rate was 40% and used softmax activation function. We have used Adam optimizer. Our CNN model will have 15 outputs. Loss function we used was categorical cross entropy, and we used accuracy metrics. After processing on the CNN, we saved our results on pickle files
- Relevant files for this script:
train.pkl
val.pkl
test.pkl
scripts that used for three images input models
BaseModelScript.ipynb
- Load images and get the outputs: X,y creation
- For both PA and AP
- Train, test, validate X,Y sets
- DenseNet modeling experiment with LSTM/without LSTM
- Relevant files for this script:
PA_images.pkl
AP_images.pkl
PA_train.csv
PA_test.csv
PA_val.csv
AP_train.csv
AP_test.csv
AP_val.csv
DenseNetPAModellingFinal.ipynb
andDenseNet_AP_Modeling.ipynb
- DenseNet169 in-depth modeling experiment with LSTM/without LSTM on PA and AP
- DenseNet169 with LSTM/without LSTM result ROC analysis
- DenseNet169 with LSTM/without LSTM result Loss analysis
- DenseNet169 with LSTM/without LSTM result Accuracy analysis
- Relevant files for this script:
PA_train.csv
PA_test.csv
PA_val.csv
AP_train.csv
AP_test.csv
AP_val.csv
PA_images.pkl
AP_images.pkl
Modeling_MobileNetV2_AP_.ipynb
andModeling_MobileNetV2_PA_.ipynb
- MobileNetV2 in-depth modeling experiment with LSTM/without LSTM on PA and AP
- MobileNetV2 with LSTM/without LSTM result ROC analysis
- MobileNetV2 with LSTM/without LSTM result Loss analysis
- MobileNetV2 with LSTM/without LSTM result Accuracy analysis
- Relevant files for this script:
PA_train.csv
PA_test.csv
PA_val.csv
AP_train.csv
AP_test.csv
AP_val.csv
PA_images.pkl
AP_images.pkl
Modeling_ResNetV2_AP_.ipynb
andModeling_ResNetV2_PA_.ipynb
- ResNet50V2 in-depth modeling experiment with LSTM/without LSTM on PA and AP
- ResNet50V2 with LSTM/without LSTM result ROC analysis
- ResNet50V2 with LSTM/without LSTM result Loss analysis
- ResNet50V2 with LSTM/without LSTM result Accuracy analysis
- Relevant files for this script:
PA_train.csv
PA_test.csv
PA_val.csv
AP_train.csv
AP_test.csv
AP_val.csv
PA_images.pkl
AP_images.pkl
Loss_Acc_Plots.ipynb
- a summary version of Loss plots and Acc plots for DenseNet, MobileNetV2, ResNetV2 experiments on the architecture with/without LSTM
- Pandas
- Numpy
- Keras
- Tensorflow
- OS
- CSV
- Pickle
- tqdm
- Sklearn
- Collections
- PIL
- Matplotlib
- Seaborn
- glob
- CV2
- Time
- Google.colab
files stored in
data_csv_files
directory
added_paths_AP.csv
contains the corresponding full file path for each AP datapoints' X-ray image on google driveadded_paths_PA.csv
contains the corresponding full file path for each PA datapoints' X-ray image on google driveAP_test.csv
contains the test set for APPA_test.csv
contains the test set for PAAP_val.csv
contains the validation set for APPA_val.csv
contains the validation set for PAAP_train.csv
contains the training set for APPA_train.csv
contains the training set for PA
Files present in the Google Drive link for working on the modelling: Google Drive Link: https://drive.google.com/drive/folders/1SezfLewxe0jiSGxc2m1yLnNzMFrwHotQ?usp=sharing
-
single_image_files
: The files required for simulating the single image baseline for modelling for both PA and AP datasets: This directory contains 2 sub-directories:data
: PA based datasets presented as pickle files:train.pkl
,test.pkl
,val.pkl
; AP based datasets presented as pickle files:train_AP.pkl
,test_AP.pkl
,val_AP.pkl
pretrained-models
: The pretrained modelled based files used from the Coursera model:pretrained_model.h5
,densenet.hdf5
-
images_3_followup
: The files required for simulating the three followup images models for both PA and AP images. The csv are in thedata_csv_files
directory but the images are saved as a dictionary of the image filename mapped to its image array notation as a 2D array of size (128,128).PA_images.pkl
: PA images stored as dictionary mapping the image filename to the image array of size (128,128).AP_images.pkl
: AP images stored as dictionary mapping the image filename to the image array of size (128,128).