The purpose of this repo is to develop a data pipeline using Databricks. The primary goal is to create an efficient and functional pipeline that includes at least one data source and one data sink.
Dataset: Iris Dataset
The code performs ETL-Query operations:
Extract (E): Extract a dataset from a URL in CSV format.
Transform (T): Utilizes Spark SQL to amalgamate two datasets, ensuring they are primed for analysis.
Load (L): Transfers the transformed data into the destination data store, leveraging Delta Lake.
To run the project, you can use the Makefile and follow these commands:
-
# To install the required the python packages make install
-
# To check code style make lint
-
# To run tests make test
-
# To format the code make format
On running the above commands, it runs successfully.