-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up R to perform more sophisticated analysis on your Snowplow data
HOME > SNOWPLOW SETUP GUIDE > [Step 5: Get started analysing Snowplow data](Getting started analyzing Snowplow data) > Setting up R to perform more sophisticated analysis on your data
- What is R, and why use it to analyze / visualize Snowplow data?
- Download and get started with R
- Connecting R to Snowplow data in Redshift
- Getting started analysing Snowplow data in R
- Next steps
R is free, open source software for performing statistical and graphical analysis.
R is in many respects a very strange analytics environment for the newbie. (It is not really a 'program' or 'service' as such.) It is a programming language, and as a result, can be daunting to use for business analysts who do not have development experience.
However, R is not a straightforward tool for developers to use either: many features of the language are unique to R, even amongst other interpreted languages (like Python) and functional languages (like Scala or Haskell).
In spite of its 'unusualness', there is one very good reasons to use R to analyze Snowplow (and other data sets): there is a huge amount you can do with R that is very difficult with traditional analytics programmes. To give just some examples:
- Advanced visualizations. R supports graphing data in many more ways, much more flexibly, than standard analytics packages like Excel or BI tools like Tableau
- Statistical analysis. R supports a staggering array of statistical analyzes: making it easy to run standard statistical tests on data to see if e.g. two groups of visitors behave in a way that is significantly different
- Algorithmic analysis. R libraries include a wide range of algorithmic analytical techniques including market basket analyzes, principle component analysis, to give just two that are relevant with web analytics data.
Over time, we plan to build out the [Analytics Cookbook] cookbook to include tutorials explaining how to perform the above analyzes in R using Snowplow data.
Back to top
## 2. Download and get started with RTo download and install R on Windows or Mac, visit the [download page] download-r, choose a nearby mirror and select the download appropriate for your platform. Then run the installer once the download is completed. You can then launch R from your start/applications menu.
To install R on Ubuntu, add the following line to your /etc/apt/sources.list
file:
deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/
but swap out <my.favorite.cran.mirror>
for your nearest mirror e.g. deb http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/
Then simply
sudo apt-get update
sudo apt-get install r-base
You can then launch R by typing
R
at the prompt.
Back to top
## 3. Connecting R to Snowplow data in RedshiftYou can pull Snowplow data stored in Redshift directly into R using the RPostgreSQL
package.
First, install the package. (This only needs to be done once for your R installation.) Run at the R command prompt:
install.packages("RPostgreSQL")
To use the library (e.g. at the beginning of an R session), enter the following at the command prompt:
library("RPostgreSQL")
Then:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host="<<ENTER HOST DETAILS HERE>>", port="<<ENTER PORT DETAILS HERE>>",dbname="<<ENTER DB NAME HERE>>", user="<<ENTER USERNAME HERE>>", password="<<ENTER PASSWORD HERE>>")
Note: you can access the relevant host, port, dbname and username fields by logging into the AWS console [console.aws.amazon.com] aws-console, selecting Redshift and then clicking on the cluster you use for Snowplow:
You can now fetch Snowplow data directly from Redshift into a dataframe in R, by executing the dbGetQuery
statement to run a SQL statement against that data and return the data into R as a data frame. For example, the following query returns a list of visits to an ecommerce site, classified by whether the stage in the purchase funnel that each visit got to:
SELECT
a.domain_userid,
a.first_touch,
CASE WHEN d.shopper = 1 THEN 'shopper' WHEN c.checkout = 1 THEN 'checkout' WHEN b.basket = 1 THEN 'basket' ELSE 'window-shopper' END AS type
FROM (
SELECT
domain_userid,
MIN(DATE(collector_tstamp)) AS first_touch
FROM events_new
GROUP BY domain_userid
) a
LEFT JOIN (
SELECT
domain_userid,
1 AS basket
FROM events_new
WHERE ev_action = 'add-to-basket'
GROUP BY domain_userid
) b ON a.domain_userid = b.domain_userid
LEFT JOIN (
SELECT
domain_userid,
1 AS checkout
FROM events_new
WHERE ev_action = 'checkout'
GROUP BY domain_userid
) c ON a.domain_userid = c.domain_userid
LEFT JOIN (
SELECT
domain_userid,
1 AS shopper
FROM events_new
WHERE event='transaction'
GROUP BY domain_userid
) d ON a.domain_userid = d.domain_userid
We can pull that data into R by executing the following at the R command prompt:
visits <- dbGetQuery(con, "
SELECT
a.domain_userid,
a.first_touch,
CASE WHEN d.shopper = 1 THEN 'shopper' WHEN c.checkout = 1 THEN 'checkout' WHEN b.basket = 1 THEN 'basket' ELSE 'window-shopper' END AS type
FROM (
SELECT
domain_userid,
MIN(DATE(collector_tstamp)) AS first_touch
FROM events_new
GROUP BY domain_userid
) a
LEFT JOIN (
SELECT
domain_userid,
1 AS basket
FROM events_new
WHERE ev_action = 'add-to-basket'
GROUP BY domain_userid
) b ON a.domain_userid = b.domain_userid
LEFT JOIN (
SELECT
domain_userid,
1 AS checkout
FROM events_new
WHERE ev_action = 'checkout'
GROUP BY domain_userid
) c ON a.domain_userid = c.domain_userid
LEFT JOIN (
SELECT
domain_userid,
1 AS shopper
FROM events_new
WHERE event='transaction'
GROUP BY domain_userid
) d ON a.domain_userid = d.domain_userid
")
Back to top
## 4. Getting started analysing Snowplow data in RA guide to getting started using R to perform analysis can be found here on the Analytics Cookbook.
Back to top
## 5. Next stepsTO WRITE
Back ot top
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
HOME > SNOWPLOW SETUP GUIDE > Step 5: Getting started analysing Snowplow data
- [Step 1: Setup a Collector] (setting-up-a-collector)
- [Step 2: Setup a Tracker] (setting-up-a-tracker)
- [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
- [Step 4: Setup the StorageLoader] (setting-up-storageloader)
- [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)
- [5.1: setting up the default recipes and cubes] views
- [5.2: setting up Looker to visualize your data] looker
- [5.3: setting up ChartIO to visualize your data] chartio
- [5.4: setting up Excel to analyze and visualize your data] excel
- [5.5: setting up Tableau to perform OLAP analysis on your data] tableau
- [5.6: setting up R to perform more sophisticated data analysis] r
- [5.7: get started analyzing your data in EMR and Hive] (getting-started-with-EMR)
- [5.8: using Qubole to analyze your data in S3 using Hive and Pig] (Setting-up-Qubole-to-analyze-Snowplow-data-using-Apache-Hive)
Useful resources