Skip to content

3 Scheduling the StorageLoader

Yali Sassoon edited this page Aug 28, 2013 · 4 revisions

HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > Using the StorageLoader

  1. Overview
  2. Scheduling StorageLoader only
  3. Scheduling EmrEtlRunner and StorageLoader
  4. Alternatives to cron
  5. Next steps
## 1. Overview

Once you have the load process working smoothly, you can schedule a daily (or more frequent) task to automate the storage process.

The standard way of scheduling the load process is as a daily cronjob. We provide two alternative shell scripts for you to use in your scheduling:

  1. [snowplow-storage-loader.sh] loader-bash - this script just runs the StorageLoader
  2. [snowplow-runner-and-loader.sh] combo-bash - this script runs the EmrEtlRunner immediately followed by the StorageLoader

The second script is recommended assuming you want to run the StorageLoader immediately after EmrEtlRunner has completed its work.

To consider each scheduling option in turn:

## 2. Scheduling StorageLoader only

The shell script [/4-storage/storage-loader/bin/snowplow-runner-and-loader.sh] loader-bash runs the StorageLoader app only.

You need to edit this script and update the three variables at the top:

rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
LOADER_PATH=/path/to/snowplow/4-storage/snowplow-storage-loader
LOADER_CONFIG=/path/to/your-loader-config.yml

So for example if you installed RVM as the admin user, then you would set:

rvm_path=/home/admin/.rvm

Now, assuming you're using the excellent cronic cronic as a wrapper for your cronjobs, and that both cronic and Bundler are on your path, you can configure your cronjob like so:

0 6   * * *   root    cronic /path/to/snowplow/4-storage/bin/snowplow-runner-and-loader.sh

This will run the ETL job daily at 6am, emailing any failures to you via cronic. Please make sure that your Snowplow events have been safely generated and stored in your In Bucket prior to 6am.

## 3. Scheduling EmrEtlRunner and StorageLoader

The shell script [/4-storage/storage-loader/bin/snowplow-storage-loader.sh] combo-bash runs EmrEtlRunner, immediately followed by StorageLoader - i.e. it chains them together. At Snowplow, this is the scheduling option we use.

If you use this script, you can delete any separate cronjob for the EmrEtlRunner alone.

You need to update this script and update the five variables at the top:

rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
RUNNER_PATH=/path/to/snowplow/3-enrich/snowplow-emr-etl-runner
LOADER_PATH=/path/to/snowplow/4-storage/snowplow-storage-loader
RUNNER_CONFIG=/path/to/your-runner-config.yml
LOADER_CONFIG=/path/to/your-loader-config.yml

So for example if you installed RVM as the admin user, then you would set:

rvm_path=/home/admin/.rvm

Using cronic cronic as a wrapper, and with cronic and Bundler on your path, configure your cronjob like so:

0 4   * * *   root    cronic /path/to/snowplow/4-storage/bin/snowplow-runner-and-loader.sh

This will run the ETL job and then the database load daily at 4am, emailing any failures to you via cronic.

## 4. Alternatives to cron

In place of cron, you could schedule StorageLoader using a continuous integration server such as Jenkins jenkins, or potentially use the [Windows Task Scheduler] windows-task-scheduler.

These options are explored in a little more detail in the [Scheduling EmrEtlRunner] (3-Scheduling-EmrEtlRunner) guide.

5. Next steps

Setup the StorageLoader! Now you are ready to do some analysis!.

HOME > SNOWPLOW SETUP GUIDE > Step 4: Setting up alternative data stores

Setup Snowplow

  • [Step 1: Setup a Collector] (setting-up-a-collector)
  • [Step 2: Setup a Tracker] (setting-up-a-tracker)
  • [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
  • [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
    • [4.1: setup Redshift] (setting-up-redshift)
    • [4.2: setup PostgreSQL] (setting-up-postgresql)
    • [4.3: installing the StorageLoader] (1-installing-the-storageloader)
    • [4.4: using the StorageLoader] (2-using-the-storageloader)
    • [4.5: scheduling the StorageLoader] (3-scheduling-the-storageloader)
  • [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)

Useful resources

Clone this wiki locally