GitHub - charlie-becker/UMBC_CT_Project: An Approach to Tuning Hyperparameters in Parallel

Team 3 Project of the CyberTraining program at UMBC in 2019 http://cybertraining.umbc.edu/

Title: An Approach to Tuning Hyperparameters in Parallel - A Performance Study

Team members: Charlie Becker; Bin Wang; Will Mayfield; Sarah Murphy

Mentors: Dr. Matthias Gobbert; Carlos Barajas

This is a working example of a performance study completed on the 'Taki' HPC cluster at UMBC. It uses a combination of popular Python modules for hyperparameter tuning in parallel. The data and base model configuration is borrowed from the Machine Learning in Python forEnvironmental Science Problems AMS Short Course, provided by David John Gagne from the National Center for Atmospheric Research. The repository for that course can be found at [https://github.com/djgagne/ams-ml-python-course]

Below, are some brief directions to reproduce the results in the technical report. Full results are seen and discussed in Technical_Report.pdf

Workflow

After cloning this directory, first run data_download.py which will download the data into a data direcory.

Next, run preprocess.py which will preproces the data and augement it to give a balanced dataset. It will create and place .npy files into the data directory for easy access.

Next, you can run either submit_2013.slurm (2013 partition), submit_2018.slurm (2018 partition) or submit_gpu.slurm (2018 GPU nodes) to submit the performance study across the cluster. These scripts call run_2013.py, run_2018.py and run_gpu.py respectively, which is where additional SLURM argumentes are defined, such as the number of nodes and hyperparameters. Specifically, cluster.scale(x) will refer to the number of nodes desired.

Output for the study will be delivered to slurm-2013.out, slurm-2018.out or slurm-gpu.out with error logs being delivered to slurm-xxxx.err. Additionally, each process within each node will prodeuce training output in slurm-jobID.out, though this probably won't be useful.

Data augmentation

RandomOverSampler class from imblearn.over_sampling was used to oversample the minority classes (non-tornadic data) fed into the deep neural network. This is done to achieve an approximate 50/50 class split within the training data; which began as approximately a 95/5 split. The relevant script is dnn.py

For convolutional neural network, the input data are tensor images. We augment the minority classes (non-tornadic images) by duplicating, shuffling, and transforming the images through small angle rotation but keeping the labels unchanged. This can be done in real time while training the model via ImageDataGenerator from Keras or at the preprocessing stage using skimage.transform.rotate before data are feeding into the model. In the code, this can be selected via two parameters 'augmentation' and 'on_the_fly'. For example, if augmentation==True, and on_the_fly==False, this means that the augmented data is generated before training. The working script is cnn.py

Overall, we did not see a significant spike in performance when using transformed data as opposed to resampled data only. However, augmentation is highly specific to the dataset and will have varying benefits dependent on each specfic dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Title: An Approach to Tuning Hyperparameters in Parallel - A Performance Study

Team members: Charlie Becker; Bin Wang; Will Mayfield; Sarah Murphy

Mentors: Dr. Matthias Gobbert; Carlos Barajas

Workflow

Data augmentation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
plotting		plotting
README.md		README.md
Technical_Report.pdf		Technical_Report.pdf
cnn.py		cnn.py
dnn.py		dnn.py
download_data.py		download_data.py
preprocess.py		preprocess.py
run_2013.py		run_2013.py
run_2018.py		run_2018.py
run_gpu.py		run_gpu.py
submit_2013.slurm		submit_2013.slurm
submit_2018.slurm		submit_2018.slurm
submit_gpu.slurm		submit_gpu.slurm

charlie-becker/UMBC_CT_Project

Folders and files

Latest commit

History

Repository files navigation

Title: An Approach to Tuning Hyperparameters in Parallel - A Performance Study

Team members: Charlie Becker; Bin Wang; Will Mayfield; Sarah Murphy

Mentors: Dr. Matthias Gobbert; Carlos Barajas

Workflow

Data augmentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages