forked from cis-ds/course-site
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdistrib-comp_01.Rmd
613 lines (456 loc) · 22.9 KB
/
distrib-comp_01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
---
title: "Getting data from the web: scraping"
output:
html_document:
toc: true
toc_float: true
---
```{r setup2, include = FALSE}
knitr::opts_chunk$set(cache = TRUE,
message = FALSE,
warning = FALSE)
```
```{r packages, message = FALSE, warning = FALSE, cache = FALSE}
library(tidyverse)
library(gapminder)
library(stringr)
theme_set(theme_bw())
```
# Objectives
* Illustrate the split-apply-combine analytical pattern
* Define parallel processing
* Introduce Hadoop and Spark as distributed computing platforms
* Introduce the `sparklyr` package
* Demonstrate how to use `sparklyr` for machine learning using the Titanic data set
# Split-apply-combine
A common analytical pattern is to
* **split** data into pieces,
* **apply** some function to each piece,
* **combine** the results back together again.
Examples of methods you have employed thus far using the `gapminder` dataset:
### `dplyr::group_by()` {#group-by}
```{r group_by}
gapminder %>%
group_by(continent) %>%
summarize(n = n())
gapminder %>%
group_by(continent) %>%
summarize(avg_lifeExp = mean(lifeExp))
```
### `for` loops {#for-loops}
```{r for_loop}
countries <- unique(gapminder$country)
lifeExp_models <- vector("list", length(countries))
names(lifeExp_models) <- countries
for(i in seq_along(countries)){
lifeExp_models[[i]] <- lm(lifeExp ~ year, data = filter(gapminder,
country == countries[[i]]))
}
head(lifeExp_models)
```
### `nest()` and `map()` {#nest-map}
```{r nest}
# function to estimate linear model for gapminder subsets
le_vs_yr <- function(df) {
lm(lifeExp ~ year, data = df)
}
# split data into nests
(gap_nested <- gapminder %>%
group_by(continent, country) %>%
nest())
# apply a linear model to each nested data frame
(gap_nested <- gap_nested %>%
mutate(fit = map(data, le_vs_yr)))
# combine the results back into a single data frame
library(broom)
(gap_nested <- gap_nested %>%
mutate(tidy = map(fit, tidy)))
(gap_coefs <- gap_nested %>%
select(continent, country, tidy) %>%
unnest(tidy))
```
# Parallel computing
![From [An introduction to parallel programming using Python's multiprocessing module](http://sebastianraschka.com/Articles/2014_multiprocessing.html)](http://sebastianraschka.com/images/blog/2014/multiprocessing_intro/multiprocessing_scheme.png)
Parallel computing (or processing) is a type of computation whereby many calculations or processes are carried out simultaneously.^[["Parallel computing"](https://en.wikipedia.org/wiki/Parallel_computing)] Rather than processing problems in *serial* (or sequential) order, the computer splits the task up into smaller parts that can be processed simultaneously using multiple processors. This is also called *multithreading*. By spliting the job up into simultaneous operations running in *parallel*, you complete your operation quicker, making the code more efficient.
## Why use parallel computing
* Imitates real life
* In the real world, people use their brains to think in parallel - we multitask all the time without even thinking about it. Institutions are structured to process information in parallel, rather than in serial.
* More efficient (sometimes)
* Throw more resources at a problem will shorten it time to completion
* Tackle larger problems
* More resources helps solve a larger problem
* Use distributed resources
* Why spend thousands of dollars beefing up your own computing equipment when you can instead rent computing resources from Google or Amazon for mere pennies?
## Why not to use parallel computing
* Limits to efficiency gains
* [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl's_law)
* There are theoretical limits to how much you can speed up computations via parallel computing
* Diminishing returns over time
* Complexity
* Writing parallel code can be more complicated than serial code
* R does not natively implement parallel computing - you have to explicitly build it into your script
* Dependencies
* Your computation may rely on the output from the first set of tasks to perform the second tasks
* If you compute the problem in parallel fashion, the individual chunks do not communicate with one another
* Parallel slowdown
* Parallel computing speeds up computations at a price
* Once the problem is broken into separate threads, reading and writing data from the threads to memory or the hard drive takes time
* Some tasks are not improved by spliting the process into parallel operations
# `multidplyr` {#multidplyr}
[`multidplyr`](https://github.com/hadley/multidplyr) is a work-in-progress package that implements parallel computing locally using `dplyr`. Rather than performing computations using a single core or processor, it spreads the computation across multiple cores. The basic sequence of steps is:
1. Call `partition()` to split the dataset across multiple cores. This makes a partitioned data frame, or a `party df` for short.
1. Each dplyr verb applied to a `party df` performs the operation independently on each core. It leaves each result on each core, and returns another `party df`.
1. When you're done with the expensive operations that need to be done on each core, you call `collect()` to retrieve the data and bring it back to you local computer.
## `nycflights13::flights` {#flights}
Install `multidplyr` if you don't have it already.
```{r install_multidplyr, eval = FALSE}
devtools::install_github("hadley/multidplyr")
```
```{r multidplyr}
library(multidplyr)
library(nycflights13)
```
Next, partition the flights data by flight number, compute the average delay per flight, and then collect the results:
```{r flights_multi}
flights1 <- partition(flights, flight)
flights2 <- summarize(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)
```
The `dplyr` code looks the same as usual, but behind the scenes things are very different. `flights1` and `flights2` are `party df`s. These look like normal data frames, but have an additional attribute: the number of shards. In this example, it tells us that `flights2` is spread across three nodes, and the size on each node varies from 1275 to 1286 rows. `partition()` always makes sure a group is kept together on one node.
```{r flights2}
flights2
```
## Performance
For this size of data, using a local cluster actually makes performance slower.
```{r flights_performance}
system.time({
flights %>%
partition() %>%
summarise(mean(dep_delay, na.rm = TRUE)) %>%
collect()
})
system.time({
flights %>%
group_by() %>%
summarise(mean(dep_delay, na.rm = TRUE))
})
```
That's because there's some overhead associated with sending the data to each node and retrieving the results at the end. For basic `dplyr` verbs, `multidplyr` is unlikely to give you significant speed ups unless you have 10s or 100s of millions of data points. It might however, if you're doing more complex things.
### `gapminder` {#gapminder}
Let's now return to `gapminder` and estimate separate linear regression models of life expectancy based on year for each country. We will use `multidplyr` to split the work across multiple cores. Note that we need to use `cluster_library()` to load the `purrr` package on every node.
```{r multi_nest}
# split data into nests
gap_nested <- gapminder %>%
group_by(continent, country) %>%
nest()
# partition gap_nested across the cores
gap_nested_part <- gap_nested %>%
partition(country)
# apply a linear model to each nested data frame
cluster_library(gap_nested_part, "purrr")
system.time({
gap_nested_part %>%
mutate(fit = map(data, function(df) lm(lifeExp ~ year, data = df)))
})
```
Compared to how long running it locally?
```{r multi_nest_serial}
system.time({
gap_nested %>%
mutate(fit = map(data, function(df) lm(lifeExp ~ year, data = df)))
})
```
So it's roughly 2 times faster to run in parallel. Admittedly you saved only a fraction of a second. This demonstrates it doesn't always make sense to parallelize operations - only do so if you can make significant gains in computation speed. If each country had thousands of observations, the efficiency gains would have been more dramatic.
# Hadoop and Spark
* [Apache Hadoop](http://hadoop.apache.org/)
* Open-source software library that enables distributed processing of large data sets across clusters of computers
* Highly scalable (one server -> thousands of machines)
* Several modules
* Hadoop Distributed File System (HDFS) - distributed file system
* Hadoop MapReduce - system for parallel processing of large data sets
* [Spark](http://spark.apache.org/) - general engine for large-scale data processing, including machine learning
# `sparklyr` {#sparklyr}
Learning to use Hadoop and Spark can be very complicated. They use their own programming language to specify functions and perform operations. In this class, we will interact with Spark through [`sparklyr`](http://spark.rstudio.com/), a package in R from the same authors of RStudio and the `tidyverse`. This allows us to:
* Connect to Spark from R using the `dplyr` interface
* Interact with SQL databases stored on a Spark cluster
* Implement distributed [statistical](cm015.html) [learning](cm016.html) algorithms
See [here](http://spark.rstudio.com/) for more detailed instructions for setting up and using `sparklyr`.
## Installation
You can install `sparklyr` from CRAN as follows:
```{r install_sparklyr, eval = FALSE}
install.packages("sparklyr")
```
You should also install a local version of Spark to run it on your computer:
```{r install_spark, eval = FALSE}
library(sparklyr)
spark_install(version = "2.0.0")
```
```{r spark_lib, include = FALSE}
library(sparklyr)
```
## Connecting to Spark
You can connect to both local instances of Spark as well as remote Spark clusters. Let's use the `spark_connect` function to connect to a local cluster built on our computer:
```{r connect_spark, cache = FALSE}
library(sparklyr)
sc <- spark_connect(master = "local", version = "2.0.0")
```
## Reading data
You can copy R data frames into Spark using the `dplyr` `copy_to` function. Let's replicate some of our work with the `flights` database from last class. First let's load the data frames into Spark:
```{r load_flights, cache = FALSE}
flights_tbl <- copy_to(sc, nycflights13::flights, "flights", overwrite = TRUE)
airlines_tbl <- copy_to(sc, nycflights13::airlines, "airlines", overwrite = TRUE)
airports_tbl <- copy_to(sc, nycflights13::airports, "airports", overwrite = TRUE)
planes_tbl <- copy_to(sc, nycflights13::planes, "planes", overwrite = TRUE)
weather_tbl <- copy_to(sc, nycflights13::weather, "weather", overwrite = TRUE)
```
## Using `dplyr`
Interacting with a Spark database uses the same `dplyr` functions as you would with a data frame or SQL database.
```{r flights_filter}
flights_tbl %>%
filter(dep_delay == 2)
```
We can string together operations using the pipe `%>%`. For instance, we can calculate the average delay for each plane using this code:
```{r flights_delay}
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect()
# plot delays
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)
```
To join together two tables, use the standard set of `_join` functions:
```{r flights_join}
flights_tbl %>%
select(-time_hour) %>%
left_join(weather_tbl)
```
There really isn't much new here, so I'm not going to hammer this point home any further. Read [*Manipulating Data with dplyr*](http://spark.rstudio.com/dplyr.html) for more information on Spark-specific examples of `dplyr` code.
# Machine learning with Spark
You can use `sparklyr` to fit a wide range of machine learning algorithms in Apache Spark. Rather than using `caret::train()`, you use a set of `ml_` functions depending on which algorithm you want to employ.
## Load the data
Let's continue using the Titanic dataset. First, download and install the `titanic` package which contains the data files we have been using for past machine learning exercises:
```{r titanic_install, eval = FALSE}
install.package("titanic")
```
Next we need to load it into the local Spark cluster.
```{r load_titanic, cache = FALSE}
library(titanic)
(titanic_tbl <- copy_to(sc, titanic::titanic_train, "titanic", overwrite = TRUE))
```
## Tidy the data
You can use `dplyr` syntax to tidy and reshape data in Spark, as well as specialized functions from the [Spark machine learning library](http://spark.apache.org/docs/latest/ml-features.html).
### Spark SQL transforms
These are feature transforms (aka mutating or filtering the columns/rows) using Spark SQL. This allows you to create new features and modify existing features with `dplyr` syntax. Here let's modify 4 columns:
1. Family_Size - create number of siblings and parents
1. Pclass - format passenger class as character not numeric
1. Embarked - remove a small number of missing records
1. Age - impute missing age with average age
We use `sdf_register` at the end of the operation to store the table in the Spark cluster.
```{r titanic_tidy, cache = FALSE}
titanic2_tbl <- titanic_tbl %>%
mutate(Family_Size = SibSp + Parch + 1L) %>%
mutate(Pclass = as.character(Pclass)) %>%
filter(!is.na(Embarked)) %>%
mutate(Age = if_else(is.na(Age), mean(Age), Age)) %>%
sdf_register("titanic2")
```
### Spark ML transforms
Spark also includes several functions to transform features. We can access several of them [directly through `sparklyr`](http://spark.rstudio.com/reference/sparklyr/latest/index.html). For instance, to transform `Family_Sizes` into bins, use `ft_bucketizer`. Because this function comes from Spark, it is used within `sdf_mutate()`, not `mutate()`.
```{r titanic_tidy_ml, cache = FALSE}
titanic_final_tbl <- titanic2_tbl %>%
mutate(Family_Size = as.numeric(Family_size)) %>%
sdf_mutate(
Family_Sizes = ft_bucketizer(Family_Size, splits = c(1,2,5,12))
) %>%
mutate(Family_Sizes = as.character(as.integer(Family_Sizes))) %>%
sdf_register("titanic_final")
```
> `ft_bucketizer` is equivalent to `cut` in R
### Train-validation split
Randomly partition the data into training/test sets.
```{r titanic_partition, cache = FALSE}
# Partition the data
partition <- titanic_final_tbl %>%
mutate(Survived = as.numeric(Survived), SibSp = as.numeric(SibSp), Parch = as.numeric(Parch)) %>%
select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Family_Sizes) %>%
sdf_partition(train = 0.75, test = 0.25, seed = 1234)
# Create table references
train_tbl <- partition$train
test_tbl <- partition$test
```
## Train the models
Spark ML includes several types of machine learning algorithms. We can use these algorithms to fit models using the training data, then evaluate model performance using the test data.
### Logistic regression
```{r titanic_logit}
# Model survival as a function of several predictors
ml_formula <- formula(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Family_Sizes)
# Train a logistic regression model
(ml_log <- ml_logistic_regression(train_tbl, ml_formula))
```
### Other machine learning algorithms
Run the same formula using the other machine learning algorithms. Notice that training times vary greatly between methods.
```{r titanic_models}
## Decision Tree
ml_dt <- ml_decision_tree(train_tbl, ml_formula)
## Random Forest
ml_rf <- ml_random_forest(train_tbl, ml_formula)
## Gradient Boosted Tree
ml_gbt <- ml_gradient_boosted_trees(train_tbl, ml_formula)
## Naive Bayes
ml_nb <- ml_naive_bayes(train_tbl, ml_formula)
## Neural Network
ml_nn <- ml_multilayer_perceptron(train_tbl, ml_formula, layers = c(11,15,2))
```
### Validation data
```{r titanic_validate}
# Bundle the models into a single list object
ml_models <- list(
"Logistic" = ml_log,
"Decision Tree" = ml_dt,
"Random Forest" = ml_rf,
"Gradient Boosted Trees" = ml_gbt,
"Naive Bayes" = ml_nb,
"Neural Net" = ml_nn
)
# Create a function for scoring
score_test_data <- function(model, data=test_tbl){
pred <- sdf_predict(model, data)
select(pred, Survived, prediction)
}
# Score all the models
ml_score <- map(ml_models, score_test_data)
```
## Compare results
Compare the model results. Examine performance metrics: accuracy and [area under the curve (AUC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). Also examine feature importance to see what features are most predictive of survival.
### Accuracy and AUC
Receiver operating characteristic (ROC) is a graphical plot that illustrates the performance of a binary classifier. It visualizes the relationship between the true positive rate (TPR) against the false positive rate (FPR).
![From [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)](https://upload.wikimedia.org/wikipedia/commons/3/36/ROC_space-2.png)
![From [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)](https://upload.wikimedia.org/wikipedia/commons/6/6b/Roccurves.png)
The ideal model perfectly classifies are positive outcomes as true and all negative outcomes as false (i.e. TPR = 1 and FPR = 0). The line on the second graph is made by calculating predicted outcomes at different cutpoint thresholds (i.e. $.1, .2, .5, .8$) and connecting the dots. The diagonal line indicates expected true/false positive rates if you guessed at random. The area under the curve (AUC) summarizes how good the model is across these threshold points simultaneously. An area of 1 indicates that for any threshold value, the model always makes perfect preditions. This will almost never occur in real life. Good AUC values are between $.6$ and $.8$. While we cannot draw the ROC graph using Spark, we can extract the AUC values based on the predictions.
```{r titanic_eval}
# Function for calculating accuracy
calc_accuracy <- function(data, cutpoint = 0.5){
data %>%
mutate(prediction = if_else(prediction > cutpoint, 1.0, 0.0)) %>%
ml_classification_eval("prediction", "Survived", "accuracy")
}
# Calculate AUC and accuracy
perf_metrics <- data_frame(
model = names(ml_score),
AUC = 100 * map_dbl(ml_score, ml_binary_classification_eval, "Survived", "prediction"),
Accuracy = 100 * map_dbl(ml_score, calc_accuracy)
)
perf_metrics
# Plot results
gather(perf_metrics, metric, value, AUC, Accuracy) %>%
ggplot(aes(reorder(model, value), value, fill = metric)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Performance metrics",
x = NULL,
y = "Percent")
```
Overall it appears the tree-based models performed the best - they had the highest accuracy rates and AUC values.
### Feature importance
It is also interesting to compare the features that were identified by each model as being important predictors for survival. The tree models implement feature importance metrics (a la `randomForest::varImpPlot()`. Sex, fare, and age are some of the most important features.
```{r titanic_feature}
# Initialize results
feature_importance <- data_frame()
# Calculate feature importance
for(i in c("Decision Tree", "Random Forest", "Gradient Boosted Trees")){
feature_importance <- ml_tree_feature_importance(sc, ml_models[[i]]) %>%
mutate(Model = i) %>%
mutate(importance = as.numeric(levels(importance))[importance]) %>%
mutate(feature = as.character(feature)) %>%
rbind(feature_importance, .)
}
# Plot results
feature_importance %>%
ggplot(aes(reorder(feature, importance), importance, fill = Model)) +
facet_wrap(~Model) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Feature importance",
x = NULL)
```
## Compare run times
The time to train a model is important. Use the following code to evaluate each model `n` times and plot the results. Notice that gradient boosted trees and neural nets take considerably longer to train than the other methods.
```{r titanic_compare_runtime}
# Number of reps per model
n <- 10
# Format model formula as character
format_as_character <- function(x){
x <- paste(deparse(x), collapse = "")
x <- gsub("\\s+", " ", paste(x, collapse = ""))
x
}
# Create model statements with timers
format_statements <- function(y){
y <- format_as_character(y[[".call"]])
y <- gsub('ml_formula', ml_formula_char, y)
y <- paste0("system.time(", y, ")")
y
}
# Convert model formula to character
ml_formula_char <- format_as_character(ml_formula)
# Create n replicates of each model statements with timers
all_statements <- map_chr(ml_models, format_statements) %>%
rep(., n) %>%
parse(text = .)
# Evaluate all model statements
res <- map(all_statements, eval)
# Compile results
result <- data_frame(model = rep(names(ml_models), n),
time = map_dbl(res, function(x){as.numeric(x["elapsed"])}))
# Plot
result %>%
ggplot(aes(time, reorder(model, time))) +
geom_boxplot() +
geom_jitter(width = 0.4, aes(color = model)) +
scale_color_discrete(guide = FALSE) +
labs(title = "Model training times",
x = "Seconds",
y = NULL)
```
### Run time for serial vs. parallel computing
Let's look just at the logistic regression model. Is it beneficial to estimate this using Spark, or would `glm` and a serial method work just as well?
```{r titanic_serial_parallel}
# get formulas for glm and ml_logistic_regression
logit_glm <- "glm(ml_formula, data = train_tbl, family = binomial)" %>%
str_c("system.time(", ., ")")
logit_ml <- "ml_logistic_regression(train_tbl, ml_formula)" %>%
str_c("system.time(", ., ")")
all_statements_serial <- c(logit_glm, logit_ml) %>%
rep(., n) %>%
parse(text = .)
# Evaluate all model statements
res_serial <- map(all_statements_serial, eval)
# Compile results
result_serial <- data_frame(model = rep(c("glm", "Spark"), n),
time = map_dbl(res_serial, function(x){as.numeric(x["elapsed"])}))
# Plot
result_serial %>%
ggplot(aes(time, reorder(model, time))) +
geom_boxplot() +
geom_jitter(width = 0.4, aes(color = model)) +
scale_color_discrete(guide = FALSE) +
labs(title = "Model training times",
x = "Seconds",
y = NULL)
```
Alas, it is not in this instance. With a larger dataset it might become more apparent.
### What about cross-validation?
Where's the LOOCV? Where's the $k$-fold cross validation? Well, `sparklyr` is still under development. It doesn't allow you to do every single thing Spark can do. Spark contains [cross-validation functions](http://spark.apache.org/docs/latest/ml-tuning.html#cross-validation) - there just isn't an interface to them in `sparklyr` [yet](https://github.com/rstudio/sparklyr/issues/196). A real drag.
# Acknowledgments {.toc-ignore}
```{r child='_ack_stat545.Rmd'}
```
* [Parallel Algorithms Advantages and Disadvantages](http://www.slideshare.net/lucky43/parallel-computing-advantages-and-disadvantages)
* Titanic machine learning example drawn from [Comparison of ML Classifiers Using Sparklyr](https://beta.rstudioconnect.com/content/1518/notebook-classification.html)
# Session Info {.toc-ignore}
```{r sessioninfo}
devtools::session_info()
```