Data Pipeline with Databricks

PURPOSE

The purpose of this repo is to develop a data pipeline using Databricks. The primary goal is to create an efficient and functional pipeline that includes at least one data source and one data sink.

Dataset: Iris Dataset

PROCESS

The code performs ETL-Query operations:
Extract (E): Extract a dataset from a URL in CSV format.
Transform (T): Utilizes Spark SQL to amalgamate two datasets, ensuring they are primed for analysis.
Load (L): Transfers the transformed data into the destination data store, leveraging Delta Lake.

Commands to Run the Repo

To run the project, you can use the Makefile and follow these commands:

# To install the required the python packages
make install

```
# To check code style
make lint
```
```
# To run tests
make test
```
```
# To format the code
make format
```

Workflows

On running the above commands, it runs successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Pipeline with Databricks

PURPOSE

PROCESS

Commands to Run the Repo

Workflows

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Pipeline with Databricks

PURPOSE

PROCESS

Commands to Run the Repo

Workflows