How to run this in your own local

Please follow below steps to run the casestudy in your own environment

In your google account (bigquery)

create a dataset in your own node (ex: github)
copy the table (sample_commits , languages) you need into the dataset that just created
download you account source key file (json file) and place in the path ./pyspark-docker-stack

Build the docker image

Change your own parameter in the dockerfile
- Your own google account source api key file:
```
ENV GOOGLE_API_KEY=khung-playground-cb7110dd8c95.json
```
- Select the language table you like to analyze in your bigquery
```
ENV TABLE_LANGUAGE="khung-playground.github.languages"
```
- Select the commit table you like to analyze in your bigquery
```
ENV TABLE_COMMIT="khung-playground.github.commits"
```
- Select one language you like to analyze
```
ENV ENV LANGUAGE="Python"
```
- Select your google account project name
```
ENV PARENT_PROJECT="khung-playground"
```

Build the dockerfile in your local (This step take 10 ~ 15 mins)

cd <project_root_path>/pyspark-docker-stack
docker build -t pyspark:khung .

Run the docker images with some files mounted (the scripts, notebook and output file)

cd <project_root_path>
docker run -it --rm -p 8888:8888 -v "${PWD}/notebooks":/home/jovyan/work \
                                 -v "${PWD}/spark_output":/home/jovyan/spark_output \
                                 -v "${PWD}/scripts":/home/jovyan/scripts \
                                pyspark:khung

To run the spark script and review unit test result

I left the notebook file to showcase my logic of transforming data in both pysaprk & sparkSQL need to open the browser after the container is running

To run the whole process as spark script:

docker exec -it container_id /bin/bash
cd ~/scripts
spark-submit --jars "../spark-bigquery-latest_2.12.jar" run.py

To run the unit test on the function

docker exec -it container_id /bin/bash
cd ~/scripts/tests
pytest test_utils

To validate the data quility

!!! This only run in your local instead of container !!!

prerequisite

Download the great_expectations cli is the must (https://docs.greatexpectations.io/docs/guides/setup/installation/local)
Need to change your datasource base path to your own
- In <project_root_path>/spark_output/great_expectations/great_expectations.yml
```
  base_directory: <change the path to your own>
```

Run the great expectation

cd <project_root_path>/spark_output/great_expectations
greate_expectations checkout run github_dataset_checkpoint

I left both "clean" & "dirty" data in the <project_root_path>/spark_output

if run validation on cleaned data will be like
if run validation on dirty data will be like

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
pyspark-docker-stack		pyspark-docker-stack
scripts		scripts
spark_output		spark_output
.gitignore		.gitignore
failed.png		failed.png
passed.png		passed.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run this in your own local

Please follow below steps to run the casestudy in your own environment

In your google account (bigquery)

Build the docker image

To run the spark script and review unit test result

To validate the data quility

!!! This only run in your local instead of container !!!

prerequisite

I left both "clean" & "dirty" data in the <project_root_path>/spark_output

About

Releases

Packages

Languages

khungCU/prestashop-casestudy

Folders and files

Latest commit

History

Repository files navigation

How to run this in your own local

Please follow below steps to run the casestudy in your own environment

In your google account (bigquery)

Build the docker image

To run the spark script and review unit test result

To validate the data quility

!!! This only run in your local instead of container !!!

prerequisite

I left both "clean" & "dirty" data in the <project_root_path>/spark_output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages