Skip to content

Implementing a python script to perform data processing on a HPC Cluster. This project involved reading a large dataset, performing some data transformations, and writing the processed data back to the disk.

Notifications You must be signed in to change notification settings

saikiranAnnam/Parallel-DP-HPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Data Processing on a HPC Cluster (On-Going Project)

Implementing a python script to perform data processing on a HPC Cluster. This project involved reading a large dataset, performing some data transformations, and writing the processed data back to the disk. The goal is to show case my ablity to utilize HPC reasources for efficent data processing.

Steps Implemented in this project:

  1. Prepared the Dataset:
  • Reading the data in chunks to handle large files efficently.
  • Performed a simple data transformation, such as calcualting income per captia()
  1. Parallel Processing:
  • The script uses the 'multiprocesing' libary to process the multiple files in parallel.
  1. Job Submission - The SLURM job Script submits the data processing job to the HPC cluster, allocating resources as specified.

Future Implemenation

  1. Setup the HPC Envirnoment
  • access to HPC cluster with necessary permissions
  • Loading the required modules (eg: Pythonm, SLURM)
  1. Test and Validate
  • Run jobs on the HPC Cluster, mointer performance, and validate the results
  • Do any necessary optimizations to improve performances

About

Implementing a python script to perform data processing on a HPC Cluster. This project involved reading a large dataset, performing some data transformations, and writing the processed data back to the disk.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published