- create a dataset in your own node (ex: github)
- copy the table (sample_commits , languages) you need into the dataset that just created
- download you account source key file (json file) and place in the path ./pyspark-docker-stack
-
Change your own parameter in the dockerfile
- Your own google account source api key file:
ENV GOOGLE_API_KEY=khung-playground-cb7110dd8c95.json
- Select the language table you like to analyze in your bigquery
ENV TABLE_LANGUAGE="khung-playground.github.languages"
- Select the commit table you like to analyze in your bigquery
ENV TABLE_COMMIT="khung-playground.github.commits"
- Select one language you like to analyze
ENV ENV LANGUAGE="Python"
- Select your google account project name
ENV PARENT_PROJECT="khung-playground"
- Your own google account source api key file:
-
Build the dockerfile in your local (This step take 10 ~ 15 mins)
cd <project_root_path>/pyspark-docker-stack docker build -t pyspark:khung .
-
Run the docker images with some files mounted (the scripts, notebook and output file)
cd <project_root_path> docker run -it --rm -p 8888:8888 -v "${PWD}/notebooks":/home/jovyan/work \ -v "${PWD}/spark_output":/home/jovyan/spark_output \ -v "${PWD}/scripts":/home/jovyan/scripts \ pyspark:khung
- I left the notebook file to showcase my logic of transforming data in both pysaprk & sparkSQL need to open the browser after the container is running
- To run the whole process as spark script:
docker exec -it container_id /bin/bash cd ~/scripts spark-submit --jars "../spark-bigquery-latest_2.12.jar" run.py
- To run the unit test on the function
docker exec -it container_id /bin/bash cd ~/scripts/tests pytest test_utils
- Download the great_expectations cli is the must (https://docs.greatexpectations.io/docs/guides/setup/installation/local)
- Need to change your datasource base path to your own
- In <project_root_path>/spark_output/great_expectations/great_expectations.yml
base_directory: <change the path to your own>
- Run the great expectation
cd <project_root_path>/spark_output/great_expectations greate_expectations checkout run github_dataset_checkpoint