-
Notifications
You must be signed in to change notification settings - Fork 0
/
reference markdown.Rmd
412 lines (309 loc) · 12.6 KB
/
reference markdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
---
title: "Exploratory Analysis - Instacart"
author: "Philipp Spachtholz"
output:
html_document:
fig_height: 4
fig_width: 7
theme: cosmo
---
### Welcome and good luck to you all at Instacart Market Basket Competition!
Here is a first exploratory analysis of the competition dataset.
On its website Instacart has a recommendation feature, suggesting the users some items that he/she may buy again. Our task is to predict which items will be reordered on the
next order.
The dataset consists of information about 3.4 million grocery orders, distributed across 6 csv files.
### Read in the data
```{r message=FALSE, warning=FALSE, results='hide'}
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(stringr)
library(DT)
orders <- fread('../input/orders.csv')
products <- fread('../input/products.csv')
order_products <- fread('../input/order_products__train.csv')
order_products_prior <- fread('../input/order_products__prior.csv')
aisles <- fread('../input/aisles.csv')
departments <- fread('../input/departments.csv')
```
```{r include=FALSE}
options(tibble.width = Inf)
```
Lets first have a look at these files:
### Peek at the dataset {.tabset}
#### orders
This file gives a list of all orders we have in the dataset. 1 row per order.
For example, we can see that user 1 has 11 orders, 1 of which is in the train set, and 10 of which are prior orders. The orders.csv doesn't tell us about which products were ordered. This is contained in the order_products.csv
```{r, result='asis'}
kable(head(orders,12))
glimpse(orders)
```
#### order_products_train
This file gives us information about which products (product_id) were ordered. It also contains information of the order (add_to_cart_order) in which the products were put into the cart and information of whether this product is a re-order(1) or not(0).
For example, we see below that order_id 1 had 8 products, 4 of which are reorders.
Still we don't know what these products are. This information is in the products.csv
```{r}
kable(head(order_products,10))
glimpse(order_products)
```
#### products
This file contains the names of the products with their corresponding product_id. Furthermore the aisle and deparment are included.
```{r}
kable(head(products,10))
glimpse(products)
```
#### order_products_prior
This file is structurally the same as the other_products_train.csv.
```{r, result='asis'}
kable(head(order_products_prior,10))
glimpse(order_products_prior)
```
#### aisles
This file contains the different aisles.
```{r, result='asis'}
kable(head(aisles,10))
glimpse(aisles)
```
#### departments
```{r, result='asis'}
kable(head(departments,10))
glimpse(departments)
```
### Recode variables
We should do some recoding and convert character variables to factors.
```{r message=FALSE, warning=FALSE}
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
```
### When do people order?
Let's have a look when people buy groceries online.
#### Hour of Day
There is a clear effect of hour of day on order volume. Most orders are between 8.00-18.00
```{r warning=FALSE}
orders %>%
ggplot(aes(x=order_hour_of_day)) +
geom_histogram(stat="count",fill="red")
```
#### Day of Week
There is a clear effect of day of the week. Most orders are on days 0 and 1. Unfortunately there is no
info regarding which values represent which day, but one would assume that this is the weekend.
```{r warning=FALSE}
orders %>%
ggplot(aes(x=order_dow)) +
geom_histogram(stat="count",fill="red")
```
### When do they order again?
People seem to order more often after exactly 1 week.
```{r warning=FALSE}
orders %>%
ggplot(aes(x=days_since_prior_order)) +
geom_histogram(stat="count",fill="red")
```
### How many prior orders are there?
We can see that there are always at least 3 prior orders.
```{r}
orders %>% filter(eval_set=="prior") %>% count(order_number) %>% ggplot(aes(order_number,n)) + geom_line(color="red", size=1)+geom_point(size=2, color="red")
```
### How many items do people buy? {.tabset}
Let's have a look how many items are in the orders. We can see that people most often order around 5 items. The distributions are comparable between the train and prior order set.
#### Train set
```{r warning=FALSE}
order_products %>%
group_by(order_id) %>%
summarize(n_items = last(add_to_cart_order)) %>%
ggplot(aes(x=n_items))+
geom_histogram(stat="count",fill="red") +
geom_rug()+
coord_cartesian(xlim=c(0,80))
```
#### Prior orders set
```{r warning=FALSE}
order_products_prior %>%
group_by(order_id) %>%
summarize(n_items = last(add_to_cart_order)) %>%
ggplot(aes(x=n_items))+
geom_histogram(stat="count",fill="red") +
geom_rug() +
coord_cartesian(xlim=c(0,80))
```
### Bestsellers
Let's have a look which products are sold most often (top10). And the clear winner is:
**Bananas**
```{r fig.height=5.5}
tmp <- order_products %>%
group_by(product_id) %>%
summarize(count = n()) %>%
top_n(10, wt = count) %>%
left_join(select(products,product_id,product_name),by="product_id") %>%
arrange(desc(count))
kable(tmp)
tmp %>%
ggplot(aes(x=reorder(product_name,-count), y=count))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())
```
### How often do people order the same items again?
59% of the ordered items are reorders.
```{r warning=FALSE, fig.width=4}
tmp <- order_products %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(reordered = as.factor(reordered)) %>%
mutate(proportion = count/sum(count))
kable(tmp)
tmp %>%
ggplot(aes(x=reordered,y=count,fill=reordered))+
geom_bar(stat="identity")
```
### Most often reordered
Now here it becomes really interesting. These 10 products have the highest probability of being reordered.
```{r warning=FALSE, fig.height=5.5}
tmp <-order_products %>%
group_by(product_id) %>%
summarize(proportion_reordered = mean(reordered), n=n()) %>%
filter(n>40) %>%
top_n(10,wt=proportion_reordered) %>%
arrange(desc(proportion_reordered)) %>%
left_join(products,by="product_id")
kable(tmp)
tmp %>%
ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95))
```
### Which item do people put into the cart first?
People seem to be quite certain about Multifold Towels and if they buy them, put
them into their cart first in 66% of the time.
```{r message=FALSE, fig.height=5.5}
tmp <- order_products %>%
group_by(product_id, add_to_cart_order) %>%
summarize(count = n()) %>% mutate(pct=count/sum(count)) %>%
filter(add_to_cart_order == 1, count>10) %>%
arrange(desc(pct)) %>%
left_join(products,by="product_id") %>%
select(product_name, pct, count) %>%
ungroup() %>%
top_n(10, wt=pct)
kable(tmp)
tmp %>%
ggplot(aes(x=reorder(product_name,-pct), y=pct))+
geom_bar(stat="identity",fill="red")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.4,0.7))
```
### Association between time of last order and probability of reorder
This is interesting: We can see that if people order again on the same day, they order the same product more often. Whereas when 30 days have passed, they tend to try out new things in their order.
```{r}
order_products %>%
left_join(orders,by="order_id") %>%
group_by(days_since_prior_order) %>%
summarize(mean_reorder = mean(reordered)) %>%
ggplot(aes(x=days_since_prior_order,y=mean_reorder))+
geom_bar(stat="identity",fill="red")
```
### Association between number of orders and probability of reordering
Products with a high number of orders are naturally more likely to be reordered. However, there seems to be a ceiling effect.
```{r message=FALSE}
order_products %>%
group_by(product_id) %>%
summarize(proportion_reordered = mean(reordered), n=n()) %>%
ggplot(aes(x=n,y=proportion_reordered))+
geom_point()+
geom_smooth(color="red")+
coord_cartesian(xlim=c(0,2000))
```
### Organic vs Non-organic
What is the percentage of orders that are organic vs. not organic?
```{r fig.width=4}
products <- products %>%
mutate(organic=ifelse(str_detect(str_to_lower(products$product_name),'organic'),"organic","not organic"), organic= as.factor(organic))
tmp <- order_products %>%
left_join(products, by="product_id") %>%
group_by(organic) %>%
summarize(count = n()) %>%
mutate(proportion = count/sum(count))
kable(tmp)
tmp %>%
ggplot(aes(x=organic,y=count, fill=organic))+
geom_bar(stat="identity")
```
### Reordering Organic vs Non-Organic
People more often reorder organic products vs non-organic products.
```{r fig.width=4}
tmp <- order_products %>% left_join(products,by="product_id") %>% group_by(organic) %>% summarize(mean_reordered = mean(reordered))
kable(tmp)
tmp %>%
ggplot(aes(x=organic,fill=organic,y=mean_reordered))+geom_bar(stat="identity")
```
### Visualizing the Product Portfolio
Here is use to treemap package to visualize the structure of instacarts product portfolio. In total there are 21 departments containing 134 aisles.
```{r}
library(treemap)
tmp <- products %>% group_by(department_id, aisle_id) %>% summarize(n=n())
tmp <- tmp %>% left_join(departments,by="department_id")
tmp <- tmp %>% left_join(aisles,by="aisle_id")
tmp2<-order_products %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(tmp, by = c("department_id", "aisle_id")) %>%
mutate(onesize = 1)
```
#### How are aisles organized within departments?
```{r, fig.width=9, fig.height=6}
treemap(tmp2,index=c("department","aisle"),vSize="onesize",vColor="department",palette="Set3",title="",sortID="-sumcount", border.col="#FFFFFF",type="categorical", fontsize.legend = 0,bg.labels = "#FFFFFF")
```
#### How many unique products are offered in each department/aisle?
The size of the boxes shows the number of products in each category.
```{r, fig.width=9, fig.height=6}
treemap(tmp,index=c("department","aisle"),vSize="n",title="",palette="Set3",border.col="#FFFFFF")
```
#### How often are products from the department/aisle sold?
The size of the boxes shows the number of sales.
```{r, fig.width=9, fig.height=6}
treemap(tmp2,index=c("department","aisle"),vSize="sumcount",title="",palette="Set3",border.col="#FFFFFF")
```
### Exploring Customer Habits
Here i look for customers who just reorder the same products again all the time. To search those I look at all orders (excluding the first order), where the percentage of reordered items is exactly 1 (This can easily be adapted to look at more lenient thresholds).
We can see there are in fact **3,487** customers, just always reordering products.
#### Customers reordering only
```{r}
tmp <- order_products_prior %>%
group_by(order_id) %>%
summarize(m = mean(reordered),n=n()) %>%
right_join(filter(orders,order_number>2), by="order_id")
tmp2 <- tmp %>%
filter(eval_set =="prior") %>%
group_by(user_id) %>%
summarize(n_equal = sum(m==1,na.rm=T), percent_equal = n_equal/n()) %>%
filter(percent_equal == 1) %>%
arrange(desc(n_equal))
datatable(tmp2, class="table-condensed", style="bootstrap", options = list(dom = 'tp'))
```
#### The customer with the strongest habit
The coolest customer is id #99753, having 97 orders with only reordered items. That's what I call a strong habit.
She/he seems to like Organic Milk :-)
```{r warning=FALSE}
uniqueorders <- filter(tmp, user_id == 99753)$order_id
tmp <- order_products_prior %>%
filter(order_id %in% uniqueorders) %>%
left_join(products, by="product_id")
datatable(select(tmp,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 'tp'))
```
<br>
Let's look at his order in the train set. One would assume that he would buy "Organic Whole Milk" and "Organic Reduced Fat Milk":
```{r warning=FALSE}
tmp <- orders %>% filter(user_id==99753, eval_set == "train")
tmp2 <- order_products %>%
filter(order_id == tmp$order_id) %>%
left_join(products, by="product_id")
datatable(select(tmp2,-aisle_id,-department_id,-organic), style="bootstrap", class="table-condensed", options = list(dom = 't'))
```
**Tadaaaa. Prediction 100% correct.**
<br><br>
**Thank you all for the nice comments and upvotes. You are great.**