Parallel Data Processing on a HPC Cluster (On-Going Project)

Implementing a python script to perform data processing on a HPC Cluster. This project involved reading a large dataset, performing some data transformations, and writing the processed data back to the disk. The goal is to show case my ablity to utilize HPC reasources for efficent data processing.

Steps Implemented in this project:

Prepared the Dataset:

Reading the data in chunks to handle large files efficently.
Performed a simple data transformation, such as calcualting income per captia()

Parallel Processing:

The script uses the 'multiprocesing' libary to process the multiple files in parallel.

Job Submission - The SLURM job Script submits the data processing job to the HPC cluster, allocating resources as specified.

Future Implemenation

Setup the HPC Envirnoment

access to HPC cluster with necessary permissions
Loading the required modules (eg: Pythonm, SLURM)

Test and Validate

Run jobs on the HPC Cluster, mointer performance, and validate the results
Do any necessary optimizations to improve performances

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data_processing.ipynb		data_processing.ipynb
submit_jobs.sh		submit_jobs.sh
systemdesign.png		systemdesign.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Data Processing on a HPC Cluster (On-Going Project)

Steps Implemented in this project:

Future Implemenation

About

Releases

Packages

Languages

saikiranAnnam/Parallel-DP-HPC

Folders and files

Latest commit

History

Repository files navigation

Parallel Data Processing on a HPC Cluster (On-Going Project)

Steps Implemented in this project:

Future Implemenation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages