Skip to content

Latest commit

 

History

History
34 lines (27 loc) · 2.36 KB

README.md

File metadata and controls

34 lines (27 loc) · 2.36 KB

Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Average runtime for the batch processor with a ThreadPoolTaskExecutor is 2 minutes 33 seconds. Average runtime for the stateless sessions parser/processor is 40 minutes. Both of these methods will be improved upon in the future by incorporating a MultiResourcePartitioner within the Spring Batch Configuration file, as well as splitting the large dataset into smaller sets, so that multiple threads may operate on different files at a given time.

This project:

  • Uses Spring Boot service uses Spring Batch with Spring Data JPA-Hibernate.
  • Imports data from a CSV file (about 6 million records) to a PostgreSQL database.
  • Improved batch processing performance from implementing a ThreadPoolTaskExecutor to achieve data chunking and multithreaded code.
  • Based on this data, a fraud detection model is built using python machine learning libraries.
  • Is intended to be launched through an API Gateway server (linked below).
  • Instructions to run:

      1. Clone this repository to your local machine.
      2. Download the financial data from Kaggle. Add this data to "resource/data" and be sure to include the .csv file in your .gitignore!
      3. Within main/java/com there are two distinct packages, "batch" and "session", which are the batch processor and sessions processor respectively.
      4. Each package has it's own main file that can be ran
      5. Once the application is launched without issues, head over to Postman and test on your configured port and the route "/load"

    Technologies Used

  • Java
  • Spring Boot for REST API
  • Spring Batch Processing (Open Source Data Processing Framework)
  • Maven
  • Factory Design Pattern within Batch Processor
  • Hibernate
  • Java Persistence API (JPA)
  • PostgreSQL
  • Gateway Server Communication. Gateway Server can be found here.