A serverless and event-driven approach to build data quality pipeline with AWS Lambda and Great Expectations
Great Expectations is an open-source data quality framework based on Python. GE enables engineers to write tests, review reports, and assess the quality of data. It is a plugable tool, meaning you can easily add new expectations and customize final reports.
AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.
Unfortunately, AWS Lambda imposes certain quotas and limits on the size of the deployment package:
- 50 MB (zipped, for direct upload)
- 250 MB (unzipped). This quota applies to all the files you upload, including layers and custom runtimes.
As a result, deploying GE on lambda takes some ingenuity. However, we can solve this problem by packaging and deploying Lambda functions as container images of up to 10 GB
Install packages in the virtualenv:
pipenv install --dev
Make sure to have Docker installed
docker --version
Run the following script to build the docker image, run the container and locally test the lambda function (AWS account not needed)
script/docker.sh
Run the following script to locally export the HTML documentation generated by Great Expectations. If no local path is specified C:/great_expectations_data_docs/
will be used as default
script/export_data_docs.sh example/local/path
Go to the exported folder and open the index.html
file
I personally recommend Serverless to deploy lambda functions.
Alternatively, in the infra
folder you can find a Terraform example to create the AWS infrastructure.
I've also added a script example to tag and push the docker image to Amazon ECR and automatically update the lambda code with the newly pushed images.
script/naive_deploy.sh
- Python | Programming language
- Pipenv | Dependency management
- Pre-Commit | Managing and maintaining hooks
- Github Actions | CI/CD
- Terraform | Infrastructure as Code
- Docker | Containerization and Deploy
- Pandas | Data analysis and manipulation
- python-lambda-local | Run AWS lambda functions on local machine
- great_expectations | Data Quality framework
- AWS lambda function | Serverless compute service
- Unix shell | Command-line interpreter
- Made with ❤️ by @vittoriopolverino