This repo applies an infrastructure as code approach to deploying Snowflake resources using Terraform. It relies on the open source provider by Snowflake Labs and can create, alter and destroy users, roles and resources in Snowflake.
Making use of Snowflake's default and recommended roles, this project creates the majority of infrastructure with the SYSADMIN
role, users and roles are administered by the SECURITYADMIN
role.
In order to contribute or run this project, you will need:
- terraform v1.3.4
- terraform-provider-snowflake v0.50.0
- Python v3.10
- SnowSQL v1.2.23
- AWS Command Line Interface v2.8.12
- pre-commit
The naming convention for many resources, including the remote state bucket and lock table, incorporate the project name, for example your-project-remote-state
. To set this variable, run the following Python file and when prompted, enter your chosen project name. Avoid using special or escape characters.
python3 project_setup.py
Note: Please remember S3 is a global namespace and so choose something unique, perhaps including a hash. It is down to your discretion if you wish to obfuscate and alias a project should you wish to make your buckets harder to find.
As part of account setup, the grant for create integration
must be assigned to SYSADMIN for the modules to function correctly. By default, only the ACCOUNTADMIN can create storage integrations
and we wanted to abstract this high powered role away from automated deployment users and so grant integrations to SYSADMIN to handle all infra resources. Similarly, for SYSADMIN to execute tasks we must also grant that privilege.
To achieve this, run:
USE ROLE ACCOUNTADMIN;
GRANT CREATE INTEGRATION ON ACCOUNT to ROLE SYSADMIN;
GRANT EXECUTE TASK ON ACCOUNT to ROLE SYSADMIN;
ACCOUNTADMIN should be a break-glass solution and locked away.
The project uses the SnowSQL CLI for resource creation when the Terraform provider lacks the functionality; this includes table creation particularly when deploying Snowpipes. To download SnowSQL cli follow these instructions or if you have homebrew use:
brew cask install snowflake-snowsql
This will also create a config file in ~/.snowsql/config
used for authentication and setting default values such as role
and warehouse
.
Edit the file and create a Snowflake profile (connections is the default profile), e.g:
[connections]
accountname = myAccout
region = eu-west-1
username = YourUserName
password = YourPass
If you are using multiple Snowflake accounts or users you can create additional profiles in this file using the same structure:
[connections.iw]
accountname = infinityworkspartner
region = eu-west-1
username = YourUserName
password = YourPassword
To deploy Snowflake using Terraform, this project depends on user authentication by environment variables; to simplify this process we load the SnowSQL config credentials using a python script; the two env vars outputted are SNOWFLAKE_USER
and SNOWFLAKE_PASSWORD
.
The python script accepts two optional arguments, profile
and application
; these determine the Snowflake profile you wish to use and the env vars to export. If the flags are not called, they will default to using your connections
profile and output both terraform and SnowSQL env vars. The CLI arguments are case sensitive. The accepted values for application
are terraform
, snowsql
or all
, for example:
eval $(python3 load_snowflake_credentials.py --profile connections --application all)
NOTE: This must be run in an eval $( )
statement as the python script prints your vars to the terminal and eval
evaluates the export statement, loading them into your environment. If you do not use the eval
statement your creds will be printed in plain text to your terminal and not loaded into your environment variables.
Remember to execute this eval
statement for each terminal window you are working in.
This project uses pre-commit hooks as a means to keep code readable and a measure to prevent broken code from being rolled out.
Download pre-commit from their website or use:
brew install pre-commit
Before committing code for the first time, make sure to initialise the hooks:
pre-commit install
From now on, pre-commit checks will be run any time you make a commit on the project.
You may also optionally run these checks against all extant files in the project:
pre-commit run --all-files
Important note: if your code fails pre-commit checks, your commit will be cancelled. You'll need to fix the issues and commit again.
Code can be merged into the current development branch main by opening a pull request (PR). For public users, these pull requests will need to be created from your own fork of the repository.
An IW repository maintainer will review your PR. They may suggest code revision for style or clarity, or request that you add unit or integration test(s).
Environments are separated using Terraform workspaces. Each workspace has its own remote state file and the module ./terraform-config/
dynamically passes the correct AWS profile and Snowflake account details based upon the workspace selected.
Each environment's config can be added to terraform-config/providers.tf
and the workspace name must correspond exactly to the environment's map/key within this file.
locals {
environment = {
dev = {
name = "dev"
group_name = "nonprod"
aws_account = {
profile = "aws-dev"
id = "xxxxxxxxxxxx"
region = "eu-west-2"
}
snowflake_account = {
id = "infinityworkspartner"
region = "eu-west-1"
}
}
staging = {
name = "staging"
group_name = "nonprod"
aws_account = {
profile = "aws-staging"
id = "yyyyyyyyyyyy"
region = "eu-west-2"
}
snowflake_account = {
id = "infinityworkspartnerstaging"
region = "eu-west-1"
}
}
providers = {
snowflake_version = "0.50.0""
aws_version = "~> 4.39.0"
}
}
}
The AWS profile must also match the credentials profiles you wish to use.
After initialising Terraform you can create and/or select a workspace. When a workspace is created, it is automatically adopted. To keep things simple, edit the following command with your env name and run the following each time:
terraform init && terraform workspace new <env> || terraform workspace select <env>
terraform apply
If the workspace already exists, the follow-up command will select the workspace.
To begin we must create a remote state bucket and lock table within an AWS account; this is referenced to keep track of all changes made by Terraform and ensures stateful deployments.
The remote state bucket and lock table's name are comprised of your project name, this can be updated in ./aws/state_resources/s3/environment/dev/environment.tfvars
.
To run terraform locally you must have a valid aws session and set your AWS_PROFILE
.
In order to deploy the state bucket, navigate to ./aws/state_resources/s3/
and run:
terraform init && terraform workspace new dev || terraform workspace select dev
terraform apply
This will create a remote state bucket with the name <your-project>-remote-state-<env>
. Next for the lock table, again changing the environment.tfvars
variables:
cd ../dynamoDB
terraform init && terraform workspace new dev || terraform workspace select dev
terraform apply
rm -r .terraform && rm tfplan
Check with the aws CLI that <your-project>-lock-table
now exists:
aws dynamodb list-tables
Now we have our state infrastructure we can begin Terraforming Snowflake and its AWS counterparts.
Some resources are dependent on others already existing, for example schemas belong to databases, and stages belong to a schema within a database; thus we must deploy resources in a specific order:
- RBAC
- Warehouses
- Integrations
- Databases
- Schemas
- Stage
- Pipes
- Tasks
Each directory containing a resource has an associated main.tf
file which declares the provider; this provider includes the Snowflake account name, region and role which is adopted to create, modify and destroy infra. You must ensure you can adopt the appropriate roles required by Snowflake. With an infra as code approach, this project is designed to be deployed across multiple environments using separate accounts; the separation is controlled through configuration with environment.tfvars
files (though it is possible to deploy n
environments in a single account by appending resource names with the env).
Modules are a key element of what makes this project successful, particularly for automating data ingestion from cloud storage. Modules simplify the work flow and reduce maintenance of groups of resources that depend on each other. For example, the pipes-module will dynamically create the supporting infrastructure of an landing table, account integration, AWS iam role, external stage and the pipe itself, as a one-to-one relationship. Each data flow is decoupled and independent from one another.
Beware, when creating modules they can cause deployment and destruction conflicts; for example, independently granting permissions on roles to a resource will conflict with other module deployments as the Terraform provider looks for global permissions and finds the difference; having multiple grant objects will conflict with each other as the sql output is no longer a single source of truth when comparing one Terraform block to another. To avoid this, all resource grant statements much be in a single block.
To create users navigate to ./Snowflake/rbac/users/
and create a new .tf
file for each group of users, a recommended approach is to create one file per squad or business area. Within this file use the provider resource snowflake_user
and define the user's preferences:
resource "snowflake_user" "user_JohnSnow" {
name = "JohnSnow"
login_name = "JohnSnow"
default_role = "PUBLIC"
password = "replace"
must_change_password = "true"
comment = "Data consumer in the wall squad"
}
To grant the user a role we will reference the user's name from the above resource using Terraform outputs; create an output in outputs.tf
with the following structure:
output "JohnSnow_name" {
value = snowflake_role.user_JohnSnow.name
}
Including a default role, other than PUBLIC
, will not automatically grant said role; you must grant the role separately, otherwise a user will not be able to login.
The default password is currently "replace", this must be changed immediately after creation, by the user in the Snowflake console.
Note: Any plain text run through Terraform will be stored in the remote state and so a user's chosen password should not be included in Terraform and a generic one is provided; it is recommended to include the flag to force a user to change their password on first login.
Once this block has been written, run:
terraform init && terraform workspace new dev || terraform workspace select dev
terraform apply
This pattern of initialisation, planning and deployment is repeated across each directory to create different resources.
Roles are created in a similar way to users in ./Snowflake/rbac/roles/
. Each role comes with a grant block which will grant the role created to users or other roles. It is recommended that one role is created per file to maintain separation and ease of organisation.
To persist user changes, for example name or default roles, it is recommended to reference the user's name
value from the remote state of the user file. This is achieved by the data
block, terraform_remote_state
.
WARNING: As previously mentioned, grant statements will misbehave if instantiated multiple times to the same object. The provider appears to run a query to find where the grant statement has already been applied and finds the difference to the grant block you are running; this includes removing any granted permissions that were created in the console and not through code using this provider and remote state. It is recommended not to use the resource which grants AccountAdmin, when revoking or destroying it is possible to remove your own role access and lock yourself out.
The above pattern of declaring a resource, referencing the remote state for any dependencies and granting permissions to it is used in other resource creation, such as databases. At a minimum a role must have usage
granted to access a database.
It is recommended that databases have their own dedicated .tf
file which includes it's grant statements. Schemas should be separated by directory for each database created:
├── schemas
├── analytics_db
│ ├── marketing
│ ├── sales
│ └── customer
└── models_db
├── schema1
├── schema2
└── schema3
Creating roles, databases and warehouses are easy, creating Snowpipes requires a little finesse. To do the heavy lifting we use modules to abstract all the difficulty and configuration. The Snowpipe module lives in snowflake/infra/modules/snowpipe-module/
and new pipes are declared in snowflake/infra/pipes/
. This module creates the landing table, account integration, external stage, iam role, pipe and s3 bucket notification which are required to automate data ingestion. Pipes are decoupled from one another and have their own remote state file, which means you must also update the path and values of snowflake/infra/pipes/my-new-pipe/environment/backend-config.tfvars
to reflect this. This ensures the pipes and their dependencies are isolated from other pipes.
The pipe module is designed to consume data from either an entire bucket or a key within a bucket. The name of the key of the bucket is used in naming all the supporting resources and the pipe itself, e.g. for the finance
key, the pipe would be called FINANCE_DATA_PIPE
. If a bucket does not have a key and you wish to consume from the top level of the bucket, set has_key = false
though in this case it will still take the name FINANCE_DATA_PIPE
. The module variables determines the granularity a pipe can consume from and it's naming.
The default file format to be consumed is JSON; this can be changed to ingest tabular objects such as CSV files using the following variables:
- file_format
- record_delimiter
- field_delimiter
The following example shows how to customise your pipe requirements, including the file suffix; the suffix can include characters of the filename but is mainly used to declare the filetype to be consumed.
module "finance_snowpipe" {
source = "../../modules/snowpipe-module/"
s3_bucket_name = "marketing-analytics"
s3_key = "finance"
has_key = false
file_format = "CSV"
field_delimiter = "\t"
record_delimiter = "\n"
filter_suffix = ".csv"
}
For the case of structured data, such as a .csv
or .tsv.gz
, you must provide a table definition of the columns and datatype. The structure is:
field,type
col1,INTEGER
col2,VARCHAR
col3,FLOAT
and is located in the working directory where the pipe is to be created within the file table-definition.csv
. Terraform will read this and generate the DDL statement for the landing table. In the case of semi-structures data, like JSON, a single column will be created with the Snowflake variant
datatype.
All landing tables have a timestamp column to note when the record was ingested by Snowflake.
NOTE: During pipe creation, instantiating an iam role is not instantaneous, to get around the pipe looking for a role that is still being created, we use a local-exec statement and shell command to wait 10 seconds.
Once deployed, check a pipe's status using:
SELECT system$pipe_status('database_name.schema_name.pipe_name');
To make use of the modular pattens, the resources created by them are separated into their own directory and remote state file to isolate them from one another. This allows actions like terraform destroy
to remove the group of resources that support a pipe or stage without affecting other pipes. This one-to-one relationship increases reliability and depending on s3 key access, enhances security to a more granular level.
The recommended way to deploy a pipe is with the module; that said, it is useful to understand what is going on under-the-hood. The following summaries the steps to create the individual supporting resources:
- Create a database and schema where an external stage can live.
- If it does not already exist, create the landing table where data will be ingested into.
- Next we require an account integration to connect to an external AWS account.
- Once this has been established we can create the external stage, which is dependent on the account integration and s3 bucket.
- Now Snowflake has set its external ID we can create the IAM role which will allow Snowflake to read from the S3 bucket.
- Finally, create the Snowpipe. The pipe creates it's own AWS SQS service behind the scenes which Snowflake manages, this is configured to receive event notifications from your s3 bucket when files land and automatically copy them to the landing table.
Note: If you modify or recreate the account integration, a new snowflake_external_id
will be generated; you will need to destroy the existing resources that depend on the integration and redeploy them to re-establish the link.
Snowflake tasks can be used to schedule SQL. Within the modules directory is a task designed to export a table or view's contents to an external stage in S3. To separate warehouse modelling and table logic, the task export module is designed to unload columns from a single source, not to be joined with other tables; this allows the task to only be responsible for the export mechanism.
Below is an example to unload data from the table ANALYTICS.PUBLIC.TRANSACTIONS
to a stage in S3 at 03:00 each day:
module "export_table" {
source = "../../modules/tasks-export-module"
name = "EXPORT_TABLE"
external_stage = "S3_STAGE_NAME"
export_filename = "data.csv"
database = "ANALYTICS"
schema = "PUBLIC"
table = "TRANSACTIONS"
columns = "*"
schedule = "USING CRON 0 3 * * * UTC"
warehouse = "DEMO_WH"
comment = "Daily export of transaction data"
enabled = true
}
Within the directory aws/iam/ci-deployment-user/
you will find the Terraform resources required to create a CI deployment user and role to adopt. The user by default does not have any powers but the role allows the user to assume the deployment role and adopt the permissions required to work with the remote state and data lake buckets, DynamoDB lock table and IAM roles and policies. The permissions have largely been restricted to what is needed to create the accompanying resources Snowflake needs for integrations (IAM) and pipes (event notifications).
This deployment user is designed to be used after the initial setup of S3 buckets and DynamoDB table. This user is to be created during account setup.
There is also a Snowflake deployment user <your-project>_CI_DEPLOYMENT
, in snowflake/rbac/users/programmatic.tf
which will be required to deploy Snowflake resources. The project name will be set after running the project_setup.py
script as seen above.
To test your CI user and adopt the role locally, you can use the adopt_ci_user_and_role.sh
file. This requires the user's credentials to be in your ~/.aws/credentials
and have exported the profile to your environment variables.
To keep things simple as a contained project, GitHub actions has been chosen as the CI/CD service to automate deployment of Snowflake resources, and their AWS counterparts. You'll find the test and deployment processes in .github/workflows/
.
If you choose to use GitHub actions and the CI pipelines included here, you'll need to create the following GitHub secrets for your repo:
- SNOWFLAKE_ACCOUNT
- SNOWFLAKE_REGION
- SNOWFLAKE_USER
- SNOWFLAKE_PASSWORD
You will need to grant the CI user the SYSADMIN and SECURITYADMIN roles.
GRANT ROLE SYSADMIN TO USER SNOW_CANNON_CI_DEPLOYMENT;
GRANT ROLE SECURITYADMIN TO USER SNOW_CANNON_CI_DEPLOYMENT;
Authentication to AWS is handled by a GitHub openID provider integration found in ./aws/github_openID_provider
. This allows GitHub to assume your deployment role and pass short lived credentials which expire after the session has completed.
To understand when the event driven data movement via snowpipes fails, this project includes an observability module which creates all the necessary resources to publish pipe failures to Cloudwatch Logs and create a Metric, from here you can create alarms and down stream notifications.