-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek2.Rmd
88 lines (68 loc) · 3.86 KB
/
week2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: "H~2~O framework introduction for predictive analysis"
output: html_document
---
## Exercise: Comparing Models
In this exercise our objective will be building a model to identify customers most likely to accept a direct marketing offer.
I'll be using H~2~O flow, which runs on the browser and is designed to streamline the modelling process. It offers the same functionaliity as the non GUI H~2~O, as well as more insightful output. I'll be using the dataset I imported in the last step, but would import it again here for reproducibility reasons.
```{r include = FALSE, echo=FALSE}
library(downloader)
library(h2o)
library(tidyverse)
```
```{r warning=FALSE}
localH20 <- h2o.init(ip = "127.0.0.1", port = 54321)
```
The above starts the h~2~o server on the ip given address. That is where we import our data, prepare for modelling, and draw our model.
In the previous exercise on H~2~O Flow, I created a neural net and GBM model, which I compared for performance and predicted the *y* column. In this next exercise which I'll be doing here with H~2~O as well, I'll be creating a classifier to know if a customer would accept a beneficial offer, or decline. I'll be using a Logistic Regression model for this classifier. Also, I'll be predicting how long a call would last.
### Upload file, and explore
```{r warning=FALSE, include=FALSE}
# dataUrl = "https://raw.githubusercontent.com/QUT-BDA-MOOC/FLbigdataStats/master/bank_customer_data.csv"
# download.file(url <- dataUrl, destfile = "BankCustomerData.csv")
user_data <- h2o.uploadFile("BankCustomerData.csv", destination_frame = "", parse = T, header = T,
sep = ",", na.strings = c("unknown"), progressBar = FALSE,
parse_type = "CSV" )
```
``` {r warning=FALSE}
summary(user_data)
class(user_data)
head(user_data)
table(as.data.frame(user_data)$y)
h2o.describe(user_data)
# Gets the index of a column, takes the columnNames, and the column to find the index.
getColumnIndex <- function(frameColumnNames, column) {
index = frameColumnNames %in% column
if (any(index)) {
return(which(index))
} else {
return(NULL)
}
}
# Exclude the length column from the data set
user_data_predict_offer_accept <- user_data[, -11] # excluding call length
(split_frame_offer <- h2o.splitFrame(user_data_predict_offer_accept, ratios = 0.75))
train_frame_offer <- split_frame_offer[[1]]
test_frame_offer <- split_frame_offer[[2]]
(frame_column_names_offer <- colnames(train_frame_offer))
outcomeIndex = getColumnIndex(frame_column_names_offer, "y")
predictorsIndices <- (1:length(frame_column_names_offer))[-outcomeIndex]
offer_glm <- h2o.glm(x=predictorsIndices, y = outcomeIndex, training_frame = train_frame_offer,
validation_frame = test_frame_offer, max_iterations = 100, solver = "L_BFGS",
family = "binomial", alpha = 1, intercept = T)
user_data_predict_call_length <- user_data[, -21] # excluding offer accept
(split_frame_call <- h2o.splitFrame(user_data_predict_call_length, ratios = 0.75))
train_frame_call <- split_frame_call[[1]]
test_frame_call <- split_frame_call[[2]]
(frame_column_names_call <- colnames(train_frame_call))
(outcomeIndex = getColumnIndex(frame_column_names_call, "length"))
(predictorsIndices <- (1:length(frame_column_names_call))[-outcomeIndex])
# could normalized the values before this normalizing
call_glm <- h2o.glm(x=predictorsIndices, y = outcomeIndex, training_frame = train_frame_call,
validation_frame = test_frame_call,max_iterations = 100, solver = "L_BFGS",
family = "gaussian", alpha = 1,intercept = T)
summary(offer_glm)
summary(call_glm)
# making the predictions, should use an appropriate data set, not this ones used for validation in training
h2o.predict(call_glm, test_frame_call)
h2o.predict(offer_glm, test_frame_offer)
```