reference markdown.Rmd

---
title: "Exploratory Analysis - Instacart"
author: "Philipp Spachtholz"
output:
  html_document:
    fig_height: 4
    fig_width: 7
    theme: cosmo
---

### Welcome and good luck to you all at Instacart Market Basket Competition!

Here is a first exploratory analysis of the competition dataset.
On its website Instacart has a recommendation feature, suggesting the users some items that he/she may buy again. Our task is to predict which items will be reordered on the
next order.

The dataset consists of information about 3.4 million grocery orders, distributed across 6 csv files. 


### Read in the data
```{r message=FALSE, warning=FALSE, results='hide'}
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)

orders <- fread('../input/orders.csv')
products <- fread('../input/products.csv')
order_products <- fread('../input/order_products__train.csv')
order_products_prior <- fread('../input/order_products__prior.csv')
aisles <- fread('../input/aisles.csv')
departments <- fread('../input/departments.csv')

```


```{r include=FALSE}
options(tibble.width = Inf)
```


Lets first have a look at these files:

### Peek at the dataset {.tabset}

#### orders

This file gives a list of all orders we have in the dataset. 1 row per order. 
For example, we can see that user 1 has 11 orders, 1 of which is in the train set, and 10 of which are prior orders. The orders.csv doesn't tell us about which products were ordered. This is contained in the order_products.csv

```{r, result='asis'}
kable(head(orders,12))
glimpse(orders)
```


#### order_products_train

This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0).

For example, we see below that order_id 1 had 8 products, 4 of which are reorders.

Still we don't know what these products are. This information is in the products.csv

```{r}
kable(head(order_products,10))
glimpse(order_products)
```

#### products

This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included.

```{r}
kable(head(products,10))
glimpse(products)
```

#### order_products_prior

This file is structurally the same as the other_products_train.csv. 

```{r, result='asis'}
kable(head(order_products_prior,10))
glimpse(order_products_prior)
```


#### aisles

This file contains the different aisles.

```{r, result='asis'}
kable(head(aisles,10))
glimpse(aisles)
```

#### departments

```{r, result='asis'}
kable(head(departments,10))
glimpse(departments)
```


### Recode variables
We should do some recoding and convert character variables to factors. 
```{r message=FALSE, warning=FALSE}
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
```

### When do people order?
Let's have a look when people buy groceries online. 


#### Hour of Day
There is a clear effect of hour of day on order volume. Most orders are between 8.00-18.00
```{r warning=FALSE}
orders %>% 
  ggplot(aes(x=order_hour_of_day)) + 
  geom_histogram(stat="count",fill="red")
```

#### Day of Week
There is a clear effect of day of the week. Most orders are on days 0 and 1. Unfortunately there is no 
info regarding which values represent which day, but one would assume that this is the weekend.

```{r warning=FALSE}

orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="red")
```


### When do they order again?
People seem to order more often after exactly 1 week. 
```{r warning=FALSE}
orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
  geom_histogram(stat="count",fill="red")
```

### How many prior orders are there?
We can see that there are always at least 3 prior orders. 
```{r}
orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="red", size=1)+geom_point(size=2, color="red")
```

### How many items do people buy? {.tabset}
Let's have a look how many items are in the orders. We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set.

#### Train set
```{r warning=FALSE}
order_products %>% 
  group_by(order_id) %>% 
  summarize(n_items = last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items))+
  geom_histogram(stat="count",fill="red") + 
  geom_rug()+
  coord_cartesian(xlim=c(0,80))
```

#### Prior orders set
```{r warning=FALSE}
order_products_prior %>% 
  group_by(order_id) %>% 
  summarize(n_items = last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items))+
  geom_histogram(stat="count",fill="red") + 
  geom_rug() + 
  coord_cartesian(xlim=c(0,80))
```


### Bestsellers
Let's have a look which products are sold most often (top10). And the clear winner is:
**Bananas**

```{r fig.height=5.5}
tmp <- order_products %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 
kable(tmp)

tmp %>% 
  ggplot(aes(x=reorder(product_name,-count), y=count))+
  geom_bar(stat="identity",fill="red")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())

```

### How often do people order the same items again?
59% of the ordered items are reorders.
```{r warning=FALSE, fig.width=4}
tmp <- order_products %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(reordered = as.factor(reordered)) %>%
  mutate(proportion = count/sum(count))
kable(tmp)
  
tmp %>% 
  ggplot(aes(x=reordered,y=count,fill=reordered))+
  geom_bar(stat="identity")

```


### Most often reordered
Now here it becomes really interesting. These 10 products have the highest probability of being reordered.

```{r warning=FALSE, fig.height=5.5}
tmp <-order_products %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

kable(tmp)

tmp %>% 
  ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
  geom_bar(stat="identity",fill="red")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95))
```


### Which item do people put into the cart first?
People seem to be quite certain about Multifold Towels and if they buy them, put
them into their cart first in 66% of the time.
```{r message=FALSE, fig.height=5.5}
tmp <- order_products %>% 
  group_by(product_id, add_to_cart_order) %>% 
  summarize(count = n()) %>% mutate(pct=count/sum(count)) %>% 
  filter(add_to_cart_order == 1, count>10) %>% 
  arrange(desc(pct)) %>% 
  left_join(products,by="product_id") %>% 
  select(product_name, pct, count) %>% 
  ungroup() %>% 
  top_n(10, wt=pct)

kable(tmp)

tmp %>% 
  ggplot(aes(x=reorder(product_name,-pct), y=pct))+
  geom_bar(stat="identity",fill="red")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.4,0.7))

```

### Association between time of last order and probability of reorder
This is interesting: We can see that if people order again on the same day, they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order.

```{r}
order_products %>% 
  left_join(orders,by="order_id") %>% 
  group_by(days_since_prior_order) %>%
  summarize(mean_reorder = mean(reordered)) %>%
  ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
  geom_bar(stat="identity",fill="red")
```


### Association between number of orders and probability of reordering
Products with a high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect. 

```{r message=FALSE}
order_products %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>%
  ggplot(aes(x=n,y=proportion_reordered))+
  geom_point()+
  geom_smooth(color="red")+
  coord_cartesian(xlim=c(0,2000))

```


### Organic vs Non-organic
What is the percentage of orders that are organic vs. not organic?
```{r fig.width=4}
products <- products %>% 
    mutate(organic=ifelse(str_detect(str_to_lower(products$product_name),'organic'),"organic","not organic"), organic= as.factor(organic))
    
tmp <- order_products %>% 
  left_join(products, by="product_id") %>% 
  group_by(organic) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))
kable(tmp)

tmp %>% 
  ggplot(aes(x=organic,y=count, fill=organic))+
  geom_bar(stat="identity")

```

### Reordering Organic vs Non-Organic
People more often reorder organic products vs non-organic products.
```{r fig.width=4}
tmp <- order_products %>% left_join(products,by="product_id") %>% group_by(organic) %>% summarize(mean_reordered = mean(reordered))
kable(tmp)

tmp %>% 
  ggplot(aes(x=organic,fill=organic,y=mean_reordered))+geom_bar(stat="identity")
```


### Visualizing the Product Portfolio
Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles. 

```{r}
library(treemap)

tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")

tmp2<-order_products %>% 
  group_by(product_id) %>% 
  summarize(count=n()) %>% 
  left_join(products,by="product_id") %>% 
  ungroup() %>% 
  group_by(department_id,aisle_id) %>% 
  summarize(sumcount = sum(count)) %>% 
  left_join(tmp, by = c("department_id", "aisle_id")) %>% 
  mutate(onesize = 1)

```

#### How are aisles organized within departments?
```{r, fig.width=9, fig.height=6}
treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",title="",sortID="-sumcount", border.col="#FFFFFF",type="categorical", fontsize.legend = 0,bg.labels = "#FFFFFF")
```

#### How many unique products are offered in each department/aisle?
The size of the boxes shows the number of products in each category. 
```{r, fig.width=9, fig.height=6}
treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")
```

#### How often are products from the department/aisle sold?
The size of the boxes shows the number of sales. 
```{r, fig.width=9, fig.height=6}
treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")
```


### Exploring Customer Habits
Here i look for customers who just reorder the same products again all the time. To search those I look at all orders (excluding the first order), where the percentage of reordered items is exactly 1 (This can easily be adapted to look at more lenient thresholds). 
We can see there are in fact **3,487** customers, just always reordering products. 

#### Customers reordering only
```{r}

tmp <- order_products_prior %>% 
  group_by(order_id) %>% 
  summarize(m = mean(reordered),n=n()) %>% 
  right_join(filter(orders,order_number>2), by="order_id")

tmp2 <- tmp %>% 
  filter(eval_set =="prior") %>% 
  group_by(user_id) %>% 
  summarize(n_equal = sum(m==1,na.rm=T), percent_equal = n_equal/n()) %>% 
  filter(percent_equal == 1) %>% 
  arrange(desc(n_equal))

datatable(tmp2, class="table-condensed", style="bootstrap", options = list(dom = 'tp'))

```  


#### The customer with the strongest habit
The coolest customer is id #99753, having 97 orders with only reordered items. That's what I call a strong habit. 
She/he seems to like Organic Milk :-)

```{r warning=FALSE}
uniqueorders <- filter(tmp, user_id == 99753)$order_id
tmp <- order_products_prior %>% 
  filter(order_id %in% uniqueorders) %>% 
  left_join(products, by="product_id")

datatable(select(tmp,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 'tp'))
```  
<br>
Let's look at his order in the train set. One would assume that he would buy "Organic Whole Milk" and "Organic Reduced Fat Milk":
```{r warning=FALSE}
tmp <- orders %>% filter(user_id==99753, eval_set == "train")
tmp2 <- order_products %>%  
  filter(order_id == tmp$order_id) %>% 
  left_join(products, by="product_id")

datatable(select(tmp2,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 't'))
```

**Tadaaaa. Prediction 100% correct.**

<br><br>

**Thank you all for the nice comments and upvotes. You are great.**