Skip to content

Latest commit

 

History

History
199 lines (181 loc) · 9.01 KB

README.md

File metadata and controls

199 lines (181 loc) · 9.01 KB

Chest-X-ray-DL

Our DL 289G Project

Project Overall

Deep learning in Computer Vision is important in medical area. Our project is mainly working on disease prediction for patients using X-ray images with technologies, such as CNNs and RNNs. In this project, we will present a deep learning model, which takes in a sequence of the consecutive previous chest X-rays of patients, analyze the variation and difference across this sequence. For the feature extraction phase of the images, the model uses convolutional neural networks (CNN), such as DenseNet, MobileNet, and ResNet. Besides these, we also compare and analyze specifically the impact of LSTMs on these X-ray based on the extracted feature maps from experimental CNN models. In conclusion, throughout this project, we intend to present a single deep learning framework, which would take in more than one X-ray per patient for analysis and would intend to treat these X-rays as an image sequence which would be then used for predicting the disease label based on the differences observed within the regions present across each follow-up X-ray, and our goal is to identify how does follow-up X-ray images play a significant role in predicting the disease labels.

The dataset we used is found here: https://www.kaggle.com/nih-chest-xrays/data

Brief Script Description and Usage

Preprocessing Scripts

scripts that are used for data preproccessing and data cleaning

  1. Filter_and_Create_Sample_Sets.ipynb

    1. Pick Patients who have at least 3 followups (indexing from 0)
    2. Create two different sample datasets based on view position and store datasets into CSVs
    3. Relevant CSV files for this script:
      1. Data_Entry_2017.csv
      2. df_updated_view_postion.csv
      3. df_updated_finding_labels.csv
      4. df_PA.csv
      5. df_AP.csv
  2. Preprocess_Analyze_Image_Datasets.ipynb

    1. PA, PA images dataset processing
      1. Adding Full Paths and Some basic preprocessing
      2. Train, Test, Validation dataset creation
      3. Analyzing the samples for label distributions
    2. Saving preprocessed into arrays and store in pickle.
    3. Relevant files for this script:
      1. df_PA.csv
      2. df_AP.csv
      3. added_paths_PA.csv
      4. added_paths_AP.csv
      5. PA_train.csv
      6. PA_test.csv
      7. PA_val.csv
      8. AP_train.csv
      9. AP_test.csv
      10. AP_val.csv
      11. AP_images.pkl
      12. PA_images.pkl
  3. Process_NIH_Dataset_Details.ipynb

    1. process NIH dataset details
    2. data analysis using data visualization
    3. Relevant files for this script:
      1. BBox_List_2017.csv
      2. Data_Entry_2017.csv
  4. Sample_Set_Images.ipynb

    1. PA, AP Position manual Feature Extraction
    2. Relevant files for this script:
      1. df_AP.csv
  5. verify_files.py

    1. check if files are correctly merged

single_image_models scripts

scripts that used for single image input models

  1. AP_X_ray_images_baseline_dataprocessing_v2.ipynb and PA_X_ray_images_baseline_dataprocessing_v2.ipynb

    1. For single image preprocessing, we added dataframes for AP or PA (from df_pa.csv and df_ap.csv), and then we linked images from google drive and then save them to added_paths_ap.csv and added_paths_pa.csv. We have split that datasets into three one with train, val, and test. We have then resized the images and saved as pickle files
    2. Relevant files for this script:
      1. df_AP.csv
      2. added_paths_AP.csv
      3. train_AP.pkl
      4. val_AP.pkl
      5. test_AP.pkl
      6. df_PA.csv
      7. added_paths_PA.csv
      8. train_PA.pkl
      9. val_PA.pkl
      10. test_PA.pkl
  2. Single_Xray_AP_results.ipynb and Single_Xray_PA_results.ipynb

    1. storing and analyzing results for single AP and PA X-ray images
    2. Relevant files for this script:
      1. added_paths_PA.csv
      2. added_paths_AP.csv
      3. train_df_DenseNet.csv
      4. valid_df_DenseNet.csv
      5. test_df_DenseNet.csv
  3. APmodelling.py and PAmodelling.py

    1. To compare DenseNet, ResNet, and MobileNet, we have tested our datasets on a simple CNN model which contained 5 layers, 1000 units, and kernel size of 7. The dropout rate was 40% and used softmax activation function. We have used Adam optimizer. Our CNN model will have 15 outputs. Loss function we used was categorical cross entropy, and we used accuracy metrics. After processing on the CNN, we saved our results on pickle files
    2. Relevant files for this script:
      1. train.pkl
      2. val.pkl
      3. test.pkl

three_image_models scripts

scripts that used for three images input models

  1. BaseModelScript.ipynb
    1. Load images and get the outputs: X,y creation
    2. For both PA and AP
      1. Train, test, validate X,Y sets
      2. DenseNet modeling experiment with LSTM/without LSTM
    3. Relevant files for this script:
      1. PA_images.pkl
      2. AP_images.pkl
      3. PA_train.csv
      4. PA_test.csv
      5. PA_val.csv
      6. AP_train.csv
      7. AP_test.csv
      8. AP_val.csv
  2. DenseNetPAModellingFinal.ipynb and DenseNet_AP_Modeling.ipynb
    1. DenseNet169 in-depth modeling experiment with LSTM/without LSTM on PA and AP
    2. DenseNet169 with LSTM/without LSTM result ROC analysis
    3. DenseNet169 with LSTM/without LSTM result Loss analysis
    4. DenseNet169 with LSTM/without LSTM result Accuracy analysis
    5. Relevant files for this script:
      1. PA_train.csv
      2. PA_test.csv
      3. PA_val.csv
      4. AP_train.csv
      5. AP_test.csv
      6. AP_val.csv
      7. PA_images.pkl
      8. AP_images.pkl
  3. Modeling_MobileNetV2_AP_.ipynb and Modeling_MobileNetV2_PA_.ipynb
    1. MobileNetV2 in-depth modeling experiment with LSTM/without LSTM on PA and AP
    2. MobileNetV2 with LSTM/without LSTM result ROC analysis
    3. MobileNetV2 with LSTM/without LSTM result Loss analysis
    4. MobileNetV2 with LSTM/without LSTM result Accuracy analysis
    5. Relevant files for this script:
      1. PA_train.csv
      2. PA_test.csv
      3. PA_val.csv
      4. AP_train.csv
      5. AP_test.csv
      6. AP_val.csv
      7. PA_images.pkl
      8. AP_images.pkl
  4. Modeling_ResNetV2_AP_.ipynb and Modeling_ResNetV2_PA_.ipynb
    1. ResNet50V2 in-depth modeling experiment with LSTM/without LSTM on PA and AP
    2. ResNet50V2 with LSTM/without LSTM result ROC analysis
    3. ResNet50V2 with LSTM/without LSTM result Loss analysis
    4. ResNet50V2 with LSTM/without LSTM result Accuracy analysis
    5. Relevant files for this script:
      1. PA_train.csv
      2. PA_test.csv
      3. PA_val.csv
      4. AP_train.csv
      5. AP_test.csv
      6. AP_val.csv
      7. PA_images.pkl
      8. AP_images.pkl
  5. Loss_Acc_Plots.ipynb
    1. a summary version of Loss plots and Acc plots for DenseNet, MobileNetV2, ResNetV2 experiments on the architecture with/without LSTM

Applied Dependencies

  1. Pandas
  2. Numpy
  3. Keras
  4. Tensorflow
  5. OS
  6. CSV
  7. Pickle
  8. tqdm
  9. Sklearn
  10. Collections
  11. PIL
  12. Matplotlib
  13. Seaborn
  14. glob
  15. CV2
  16. Time
  17. Google.colab

File Dependencies

files stored in data_csv_files directory

  1. added_paths_AP.csv contains the corresponding full file path for each AP datapoints' X-ray image on google drive
  2. added_paths_PA.csv contains the corresponding full file path for each PA datapoints' X-ray image on google drive
  3. AP_test.csv contains the test set for AP
  4. PA_test.csv contains the test set for PA
  5. AP_val.csv contains the validation set for AP
  6. PA_val.csv contains the validation set for PA
  7. AP_train.csv contains the training set for AP
  8. PA_train.csv contains the training set for PA

Files present in the Google Drive link for working on the modelling: Google Drive Link: https://drive.google.com/drive/folders/1SezfLewxe0jiSGxc2m1yLnNzMFrwHotQ?usp=sharing

  1. single_image_files: The files required for simulating the single image baseline for modelling for both PA and AP datasets: This directory contains 2 sub-directories:

    1. data: PA based datasets presented as pickle files: train.pkl,test.pkl,val.pkl ; AP based datasets presented as pickle files:train_AP.pkl,test_AP.pkl,val_AP.pkl
    2. pretrained-models: The pretrained modelled based files used from the Coursera model: pretrained_model.h5, densenet.hdf5
  2. images_3_followup: The files required for simulating the three followup images models for both PA and AP images. The csv are in the data_csv_files directory but the images are saved as a dictionary of the image filename mapped to its image array notation as a 2D array of size (128,128).

    1. PA_images.pkl: PA images stored as dictionary mapping the image filename to the image array of size (128,128).
    2. AP_images.pkl: AP images stored as dictionary mapping the image filename to the image array of size (128,128).