Getting started with EMR

HOME > SNOWPLOW SETUP GUIDE > Step 5: Get started analysing Snowplow data > Getting started with EMR and Hive

An overview of Amazon Elastic Mapreduce

Elastic Mapreduce (EMR) is a service provided by Amazon that makes it relatively straight forward to use Hadoop and Hadoop-powered tools (e.g. Hive, Pig, Mahout, Cascading) to process data stored in S3. Amazon has made it easy to spin up clusters of machines of varying sizes to process large volumes of data, and spin them down once querying is complete.

Because Snowplow data is stored on S3 and Snowplow data volumes are often very large, processing them using Hadoop on EMR is a particularly attractive option for a wealth of analysis. This is especially true for people who wish to run machine learning algorithms on the Snowplow data set using Mahout. (E.g. to develop recommendation engines or segment audience by behaviour.)

More details on EMR can be found on the Amazon website.

Getting started with EMR and Hive

In this guide we cover the steps necessary to get up and running with EMR and querying your data using Hive, which is the most straightforward of the Hadoop-powered services listed above. This guide has two parts:

Setting up command line tools. You can use these to fire up analysis clusters and launch jobs
An introduction to using Hive with the command line tools. An example of using the above tools to query your Snowplow data using Hive.

We plan to add a guide to using Mahout to segment users in Snowplow by behaviour in the near future.

Ready to get started?

Then set up the command line tools!

HOME > SNOWPLOW SETUP GUIDE > Step 5: Getting started analysing Snowplow data

Setup Snowplow

[Step 1: Setup a Collector] (setting-up-a-collector)
[Step 2: Setup a Tracker] (setting-up-a-tracker)
[Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
[Step 4: Setup the StorageLoader] (setting-up-storageloader)
[Step 5: Analyze your data!] (Getting started analyzing Snowplow data)
- [5.1: setting up ChartIO to visualize your data] (Setting-up-ChartIO-to-visualize-your-data)
- [5.2: setting up Tableau to perform OLAP analysis on your data] (Setting-up-Tableau-to-analyse-data-in-Redshift)
- [5.3: setting up R to perform more sophisticated data analysis] (Setting-up-R-to-perform-more-sophisticated-analysis-on-your-data)
- [5.4: get started analysing your data in EMR and Hive] (getting-started-with-EMR)
  - 5.4.1: setting up the EMR command line tools
  - 5.4.2: running Hive queries using the EMR command line tools

Useful resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started with EMR

An overview of Amazon Elastic Mapreduce

Getting started with EMR and Hive

Ready to get started?

Clone this wiki locally