-
Notifications
You must be signed in to change notification settings - Fork 17
/
19-Appendix.Rmd
864 lines (678 loc) · 27 KB
/
19-Appendix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
# Appendix {-}
# Handling Large Local Data {#largelocaldata}
When the data is too large to fit in a computer's memory, we can use some big data analytics engine like Spark on a cloud platform (see Chapter \@ref(bigdatacloudplatform)). However, even when the data can fit in the memory, there may be a situation where it is slow to read and manipulate due to a relatively large size. Some R packages can make the process faster with the cost of familiarity, especially for data wrangling. But it avoids the hurdle of setting up Spark cluster and working in an unfamiliar environment. This section presents some of the alternative R packages to read, write and wrangle a data set that is relatively large but not too big to fit in the memory.
Load the R packages first:
```{r, message=FALSE}
# install packages from CRAN if you haven't
library(readr)
library(data.table)
```
## `readr`
You must be familiar with `read.csv()`, `read.table()` and `write.csv()` in base R. Here we will introduce a more efficient package for reading and writing data: `readr` package. The corresponding functions are `read_csv()`, `read_table()` and `write_csv()`. The commands look quite similar, but `readr` is different in the following respects:
1. It is 10x faster. The trick is that `readr` uses C++ to process the data quickly.
1. It doesn't change the column names. The names can start with a number and "`.`" will not be substituted to "`_`". For example:
```{r}
read_csv("2015,2016,2017
1,2,3
4,5,6")
```
1. `readr` functions do not convert strings to factors by default, are able to parse dates and times and can automatically determine the data types in each column.
1. The killing character, in my opinion, is that `readr` provides **progress bar**. What makes you feel worse than waiting is not knowing how long you have to wait.
![](images/prograssbar.png)
The major functions of readr is to turn flat files into data frames:
- `read_csv()`: reads comma delimited files
- `read_csv2()`: reads semicolon separated files (common in countries where `,` is used as the decimal place)
- `read_tsv()`: reads tab delimited files
- `read_delim()`: reads in files with any delimiter
- `read_fwf()`: reads fixed width files. You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`
- `read_table()`: reads a common variation of fixed width files where columns are separated by white space
- `read_log()`: reads Apache style log files
The good thing is that those functions have similar syntax. Once you learn one, the others become easy. Here we will focus on `read_csv()`.
The most important information for `read_csv()` is the path to your data:
```r
sim.dat <- read_csv("http://bit.ly/2P5gTw4")
head(sim.dat)
```
```pre
# A tibble: 6 x 19
age gender income house store_exp online_exp store_trans online_trans Q1
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
1 57 Female 1.21e5 Yes 529. 304. 2 2 4
2 63 Female 1.22e5 Yes 478. 110. 4 2 4
3 59 Male 1.14e5 Yes 491. 279. 7 2 5
4 60 Male 1.14e5 Yes 348. 142. 10 2 5
5 51 Male 1.24e5 Yes 380. 112. 4 4 4
6 59 Male 1.08e5 Yes 338. 196. 4 5 4
# ... with 10 more variables: Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>,
# Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
```
The function reads the file to R as a `tibble`. You can consider `tibble` as next iteration of the data frame. They are different with data frame for the following aspects:
- It never changes an input’s type (i.e., no more `stringsAsFactors = FALSE`!)
- It never adjusts the names of variables
- It has a refined print method that shows only the first 10 rows and all the columns that fit on the screen. You can also control the default print behavior by setting options.
Refer to http://r4ds.had.co.nz/tibbles.html for more information about ‘tibble’.
When you run `read_csv()` it prints out a column specification that gives the name and type of each column. To better understanding how `readr` works, it is helpful to type in some baby data set and check the results:
```{r}
dat <- read_csv("2015,2016,2017
100,200,300
canola,soybean,corn")
print(dat)
```
You can also add comments on the top and tell R to skip those lines:
```{r}
dat <- read_csv("# I will never let you know that
# my favorite food is carrot
Date,Food,Mood
Monday,carrot,happy
Tuesday,carrot,happy
Wednesday,carrot,happy
Thursday,carrot,happy
Friday,carrot,happy
Saturday,carrot,extremely happy
Sunday,carrot,extremely happy",
skip = 2)
print(dat)
```
If you don't have column names, set `col_names = FALSE` then R will assign names "`X1`","`X2`"... to the columns:
```{r}
dat <- read_csv("Saturday,carrot,extremely happy
Sunday,carrot,extremely happy", col_names = FALSE)
print(dat)
```
You can also pass `col_names` a character vector which will be used as the column names. Try to replace `col_names=FALSE` with `col_names=c("Date","Food","Mood")` and see what happen.
As mentioned before, you can use `read_csv2()` to read semicolon separated files:
```{r, message = FALSE}
dat <- read_csv2("Saturday; carrot; extremely happy \n
Sunday; carrot; extremely happy", col_names = FALSE)
print(dat)
```
Here "`\n`" is a convenient shortcut for adding a new line.
You can use `read_tsv()` to read tab delimited files:
```{r}
dat <- read_tsv("every\tman\tis\ta\tpoet\twhen\the\tis\tin\tlove\n",
col_names = FALSE)
print(dat)
```
Or more generally, you can use `read_delim()` and assign separating character:
```{r}
dat <- read_delim("THE|UNBEARABLE|RANDOMNESS|OF|LIFE\n",
delim = "|", col_names = FALSE)
print(dat)
```
Another situation you will often run into is the missing value. In marketing survey, people like to use "99" to represent missing. You can tell R to set all observation with value "99" as missing when you read the data:
```{r}
dat <- read_csv("Q1,Q2,Q3
5, 4,99",
na = "99")
print(dat)
```
For writing data back to disk, you can use `write_csv()` and `write_tsv()`. The following two characters of the two functions increase the chances of the output file being read back in correctly:
- Encode strings in UTF-8
- Save dates and date-times in ISO8601 format so they are easily parsed elsewhere
For example:
```r
write_csv(sim.dat, "sim_dat.csv")
```
For other data types, you can use the following packages:
- `Haven`: SPSS, Stata and SAS data
- `Readxl` and `xlsx`: excel data(.xls and .xlsx)
- `DBI`: given data base, such as RMySQL, RSQLite and RPostgreSQL, read data directly from the database using SQL
Some other useful materials:
- For getting data from the internet, you can refer to the book “XML and Web Technologies for Data Sciences with R”.
- [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html#Acknowledgements)
- `rio` package:https://github.com/leeper/rio
## `data.table`--- enhanced `data.frame`
What is `data.table`? It is an R package that provides an enhanced version of `data.frame`. The most used object in R is `data frame`. Before we move on, let's briefly review some basic characters and manipulations of data.frame:
- It is a set of rows and columns.
- Each row is of the same length and data type
- Every column is of the same length but can be of differing data types
- It has characteristics of both a matrix and a list
- It uses `[]` to subset data
We will use the clothes customer data to illustrate. There are two dimensions in `[]`. The first one indicates the row and second one indicates column. It uses a comma to separate them.
```{r, results="hide"}
# read data
sim.dat <- readr::read_csv("http://bit.ly/2P5gTw4")
# subset the first two rows
sim.dat[1:2, ]
# subset the first two rows and column 3 and 5
sim.dat[1:2, c(3, 5)]
# get all rows with age>70
sim.dat[sim.dat$age > 70, ]
# get rows with age> 60 and gender is Male select column 3 and 4
sim.dat[sim.dat$age > 68 & sim.dat$gender == "Male", 3:4]
```
Remember that there are usually different ways to conduct the same manipulation. For example, the following code presents three ways to calculate an average number of online transactions for male and female:
```r
tapply(sim.dat$online_trans, sim.dat$gender, mean)
aggregate(online_trans ~ gender, data = sim.dat, mean)
sim.dat %>%
group_by(gender) %>%
summarise(Avg_online_trans = mean(online_trans))
```
There is no gold standard to choose a specific function to manipulate data. The goal is to solve the real problem, not the tool itself. So just use whatever tool that is convenient for you.
The way to use `[]` is straightforward. But the manipulations are limited. If you need more complicated data reshaping or aggregation, there are other packages to use such as `dplyr`, `reshape2`, `tidyr` etc. But the usage of those packages are not as straightforward as `[]`. You often need to change functions. Keeping related operations together, such as subset, group, update, join etc, will allow for:
- concise, consistent and readable syntax irrespective of the set of operations you would like to perform to achieve your end goal
- performing data manipulation fluidly without the cognitive burden of having to change among different functions
- by knowing precisely the data required for each operation, you can automatically optimize operations effectively
`data.table` is the package for that. If you are not familiar with other data manipulating packages and are interested in reducing programming time tremendously, then this package is for you.
Other than extending the function of `[]`, `data.table` has the following advantages:
- Offers fast import, subset, grouping, update, and joins for large data files
- It is easy to turn data frame to data table
- Can behave just like a data frame
You need to install and load the package:
Use `data.table()` to convert the existing data frame `sim.dat` to data table:
```{r}
dt <- data.table(sim.dat)
class(dt)
```
Calculate mean for counts of online transactions:
```{r}
dt[, mean(online_trans)]
```
You can't do the same thing using data frame:
```r
sim.dat[,mean(online_trans)]
```
```html
Error in mean(online_trans) : object 'online_trans' not found
```
If you want to calculate mean by group as before, set “`by = `” argument:
```{r}
dt[ , mean(online_trans), by = gender]
```
You can group by more than one variables. For example, group by “`gender`” and “`house`”:
```{r}
dt[ , mean(online_trans), by = .(gender, house)]
```
Assign column names for aggregated variables:
```{r}
dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
```
`data.table` can accomplish all operations that `aggregate()` and `tapply()`can do for data frame.
- General setting of `data.table`
Different from data frame, there are three arguments for data table:
<center>
![](images/datable1.png)
</center>
It is analogous to SQL. You don't have to know SQL to learn data table. But experience with SQL will help you understand data table. In SQL, you select column `j` (use command `SELECT`) for row `i` (using command `WHERE`). `GROUP BY` in SQL will assign the variable to group the observations.
<center>
![](images/rSQL.png)
</center>
Let's review our previous code:
```r
dt[ , mean(online_trans), by = gender]
```
The code above is equal to the following SQL:
```sql
SELECT
gender,
avg(online_trans)
FROM
sim.dat
GROUP BY
gender
```
R code:
```r
dt[ , .(avg = mean(online_trans)), by = .(gender, house)]
```
is equal to SQL:
```sql
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
GROUP BY
gender,
house
```
R code:
```r
dt[ age < 40, .(avg = mean(online_trans)), by = .(gender, house)]
```
is equal to SQL:
```sql
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
WHERE
age < 40
GROUP BY
gender,
house
```
You can see the analogy between `data.table` and `SQL`. Now let's focus on operations in data table.
- select row
```{r}
# select rows with age<20 and income > 80000
dt[age < 20 & income > 80000]
# select the first two rows
dt[1:2]
```
- select column
Selecting columns in `data.table` don't need `$`:
```{r}
# select column “age” but return it as a vector
# the argument for row is empty so the result
# will return all observations
ans <- dt[, age]
head(ans)
```
To return `data.table` object, put column names in `list()`:
```r
# Select age and online_exp columns
# and return as a data.table instead
ans <- dt[, list(age, online_exp)]
head(ans)
```
Or you can also put column names in `.()`:
```r
ans <- dt[, .(age, online_exp)]
```
To select all columns from “`age`” to “`income`”:
```{r}
ans <- dt[, age:income, with = FALSE]
head(ans,2)
```
Delete columns using `-` or `!`:
```r
# delete columns from age to online_exp
ans <- dt[, -(age:online_exp), with = FALSE]
ans <- dt[, !(age:online_exp), with = FALSE]
```
- tabulation
In data table. `.N` means to count。
```{r}
# row count
dt[, .N]
```
If you assign the group variable, then it will count by groups:
```{r}
# counts by gender
dt[, .N, by= gender]
# for those younger than 30, count by gender
dt[age < 30, .(count=.N), by= gender]
```
Order table:
```r
# get records with the highest 5 online expense:
head(dt[order(-online_exp)],5)
```
```pre
age gender income house store_exp online_exp store_trans ...
1: 40 Female 217599.7 No 7023.684 9479.442 10
2: 41 Female NA Yes 3786.740 8638.239 14
3: 36 Male 228550.1 Yes 3279.621 8220.555 8
4: 31 Female 159508.1 Yes 5177.081 8005.932 11
5: 43 Female 190407.4 Yes 4694.922 7875.562 6
...
```
Since data table keep some characters of data frame, they share some operations:
```r
dt[order(-online_exp)][1:5]
```
You can also order the table by more than one variable. The following code will order the table by `gender`, then order within `gender` by `online_exp`:
```r
dt[order(gender, -online_exp)][1:5]
```
- Use `fread()` to import dat
Other than `read.csv` in base R, we have introduced 'read_csv' in 'readr'. `read_csv` is much faster and will provide progress bar which makes user feel much better (at least make me feel better). `fread()` in `data.table` further increase the efficiency of reading data. The following are three examples of reading the same data file `topic.csv`. The file includes text data scraped from an agriculture forum with 209670 rows and 6 columns:
```r
system.time(topic <- read.csv("http://bit.ly/2zam5ny"))
```
```html
user system elapsed
3.561 0.051 4.888
```
```r
system.time(topic <- readr::read_csv("http://bit.ly/2zam5ny"))
```
```html
user system elapsed
0.409 0.032 2.233
```
```r
system.time(topic <- data.table::fread("http://bit.ly/2zam5ny"))
```
```html
user system elapsed
0.276 0.096 1.117
```
It is clear that `read_csv()` is much faster than `read.csv()`. `fread()` is a little faster than `read_csv()`. As the size increasing, the difference will become for significant. Note that `fread()` will read file as `data.table` by default.
# R code for data simulation
## Customer Data for Clothing Company {#appendixdata1}
The simulation is not very straightforward and we will break it into three parts:
1. Define data structure: variable names, variable distribution, customer segment names, segment size
1. Variable distribution parameters: mean and variance
1. Iterate across segments and variables. Simulate data according to specific parameters assigned
By organizing code this way, it makes easy for us to change specific parts of the simulation. For example, if we want to change the distribution of one variable, we can just change the corresponding part of the code.
Here is code to define data structure:
```r
# set a random number seed to
# make the process repeatable
set.seed(12345)
# define the number of observations
ncust <- 1000
# create a data frmae for simulated data
seg_dat <- data.frame(id = as.factor(c(1:ncust)))
# assign the variable names
vars <- c("age", "gender", "income", "house", "store_exp",
"online_exp", "store_trans", "online_trans")
# assign distribution for each variable
vartype <- c("norm", "binom", "norm", "binom", "norm", "norm",
"pois", "pois")
# names of 4 segments
group_name <- c("Price", "Conspicuous", "Quality", "Style")
# size of each segments
group_size <- c(250, 200, 200, 350)
```
The next step is to define variable distribution parameters. There are 4 segments of customers and 8 parameters. Different segments correspond to different parameters. Let's store the parameters in a 4×8 matrix:
```r
# matrix for mean
mus <- matrix( c(
# Price
60, 0.5, 120000,0.9, 500,200,5,2,
# Conspicuous
40, 0.7, 200000,0.9, 5000,5000,10,10,
# Quality
36, 0.5, 70000, 0.4, 300, 2000,2,15,
# Style
25, 0.2, 90000, 0.2, 200, 2000,2,20),
ncol=length(vars), byrow=TRUE)
```
```r
# matrix for variance
sds<- matrix( c(
# Price
3,NA,8000,NA,100,50,NA,NA,
# Conspicuous
5,NA,50000,NA,1000,1500,NA,NA,
# Quality
7,NA,10000,NA,50,200,NA,NA,
# Style
2,NA,5000,NA,10,500,NA,NA),
ncol=length(vars), byrow=TRUE)
```
Now we are ready to simulate data using the parameters defined above:
```r
# simulate non-survey data
sim.dat <- NULL
set.seed(2016)
# loop on customer segment (i)
for (i in seq_along(group_name)) {
# add this line in order to moniter the process
cat(i, group_name[i], "\n")
# create an empty matrix to store relevent data
seg <- data.frame(matrix(NA, nrow = group_size[i],
ncol = length(vars)))
# Simulate data within segment i
for (j in seq_along(vars)) {
# loop on every variable (j)
if (vartype[j] == "norm") {
# simulate normal distribution
seg[, j] <- rnorm(group_size[i], mean = mus[i,
j], sd = sds[i, j])
} else if (vartype[j] == "pois") {
# simulate poisson distribution
seg[, j] <- rpois(group_size[i], lambda = mus[i,
j])
} else if (vartype[j] == "binom") {
# simulate binomial distribution
seg[, j] <- rbinom(group_size[i], size = 1,
prob = mus[i, j])
} else {
# if the distribution name is not one of the above, stop
# and return a message
stop("Don't have type:", vartype[j])
}
}
sim.dat <- rbind(sim.dat, seg)
}
```
Now let's edit the data we just simulated a little by adding tags to 0/1 binomial variables:
```r
# assign variable names
names(sim.dat) <- vars
# assign factor levels to segment variable
sim.dat$segment <- factor(rep(group_name, times = group_size))
# recode gender and house variable
sim.dat$gender <- factor(sim.dat$gender, labels = c("Female",
"Male"))
sim.dat$house <- factor(sim.dat$house, labels = c("No",
"Yes"))
# store_trans and online_trans are at least 1
sim.dat$store_trans <- sim.dat$store_trans + 1
sim.dat$online_trans <- sim.dat$online_trans + 1
# age is integer
sim.dat$age <- floor(sim.dat$age)
```
In the real world, the data always includes some noise such as missing, wrong imputation. So we will add some noise to the data:
```r
# add missing values
idxm <- as.logical(rbinom(ncust, size = 1, prob = sim.dat$age/200))
sim.dat$income[idxm] <- NA
# add wrong imputations and outliers
set.seed(123)
idx <- sample(1:ncust, 5)
sim.dat$age[idx[1]] <- 300
sim.dat$store_exp[idx[2]] <- -500
sim.dat$store_exp[idx[3:5]] <- c(50000, 30000, 30000)
```
So far we have created part of the data. You can check it using `summary(sim.dat)`. Next, we will move on to simulate survey data.
```r
# number of survey questions
nq <- 10
# mean matrix for different segments
mus2 <- matrix( c( 5,2,1,3,1,4,1,4,2,4, # Price
1,4,5,4,4,4,4,1,4,2, # Conspicuous
5,2,3,4,3,2,4,2,3,3, # Quality
3,1,1,2,4,1,5,3,4,2), # Style
ncol=nq, byrow=TRUE)
# assume the variance is 0.2 for all
sd2 <- 0.2
sim.dat2 <- NULL
set.seed(1000)
# loop for customer segment (i)
for (i in seq_along(group_name)) {
# the following line is used for checking the
# progress cat (i, group_name[i],'\n') create an
# empty data frame to store data
seg <- data.frame(matrix(NA, nrow = group_size[i],
ncol = nq))
# simulate data within segment
for (j in 1:nq) {
# simulate normal distribution
res <- rnorm(group_size[i], mean = mus2[i,
j], sd = sd2)
# set upper and lower limit
res[res > 5] <- 5
res[res < 1] <- 1
# convert continuous values to discrete integers
seg[, j] <- floor(res)
}
sim.dat2 <- rbind(sim.dat2, seg)
}
names(sim.dat2) <- paste("Q", 1:10, sep = "")
sim.dat <- cbind(sim.dat, sim.dat2)
sim.dat$segment <- factor(rep(group_name, times = group_size))
```
<!--
## Customer Satisfaction Survey Data from Airline Company {#appendixdata2}
```r
# Create a matrix of factor loadings This pattern
# is called bifactor because it has a general
# factor for separate components. For example,
# 'Ease of making reservation' has general factor
# loading 0.33, specific factor loading 0.58 The
# outcome variables are formed as combinations of
# these general and specific factors
loadings <- matrix(c (
# Ticketing
.33, .58, .00, .00, # Ease of making reservation
.35, .55, .00, .00, # Availability of preferred seats
.30, .52, .00, .00, # Variety of flight options
.40, .50, .00, .00, # Ticket prices
# Aircraft
.50, .00, .55, .00, # Seat comfort
.41, .00, .51, .00, # Roominess of seat area
.45, .00, .57, .00, # Availability of Overhead
.32, .00, .54, .00, # Cleanliness of aircraft
# Service
.35, .00, .00, .50, # Courtesy of flight attendant
.38, .00, .00, .57, # Friendliness
.60, .00, .00, .50, # Helpfulness
.52, .00, .00, .58, # Food and drinks
# General
.43, .10, .30, .30, # Overall satisfaction
.35, .50, .40, .20, # Purchase again
.25, .50, .50, .20), # Willingness to recommend
nrow=15,ncol=4, byrow=TRUE)
# Matrix multiplication produces the correlation
# matrix except for the diagonal
cor_matrix <- loadings %*% t(loadings)
# Diagonal set to ones
diag(cor_matrix) <- 1
# use the mvtnorm package to randomly generate a
# data set with a given correlation pattern
library(mvtnorm)
# mean vectors of the 3 airline companies
mu1 = c(5, 6, 5, 6, 7, 8, 6, 7, 5, 5, 5, 5, 6, 6, 6)
mu2 = c(3, 3, 2, 3, 5, 4, 5, 6, 8, 8, 8, 8, 3, 3, 3)
mu3 = c(2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8)
# set random seed
set.seed(123456)
# respondent ID
resp.id <- 1:1000
library(MASS)
rating1 <- mvrnorm(length(resp.id), mu = mu1, Sigma = cor_matrix)
rating2 <- mvrnorm(length(resp.id), mu = mu2, Sigma = cor_matrix)
rating3 <- mvrnorm(length(resp.id), mu = mu3, Sigma = cor_matrix)
# truncates scale to be between 1 and 9
rating1[rating1 > 9] <- 9
rating1[rating1 < 1] <- 1
rating2[rating2 > 9] <- 9
rating2[rating2 < 1] <- 1
rating3[rating3 > 9] <- 9
rating3[rating3 < 1] <- 1
# Round to single digit
rating1 <- data.frame(round(rating1, 0))
rating2 <- data.frame(round(rating2, 0))
rating3 <- data.frame(round(rating3, 0))
rating1$ID <- resp.id
rating2$ID <- resp.id
rating3$ID <- resp.id
rating1$Airline <- rep("AirlineCo.1", length(resp.id))
rating2$Airline <- rep("AirlineCo.2", length(resp.id))
rating3$Airline <- rep("AirlineCo.3", length(resp.id))
rating <- rbind(rating1, rating2, rating3)
# assign names to the variables in the data frame
names(rating) <- c("Easy_Reservation", "Preferred_Seats",
"Flight_Options", "Ticket_Prices", "Seat_Comfort",
"Seat_Roominess", "Overhead_Storage", "Clean_Aircraft",
"Courtesy", "Friendliness", "Helpfulness", "Service",
"Satisfaction", "Fly_Again", "Recommend", "ID",
"Airline")
```
-->
## Swine Disease Breakout Data {#appendixdata3}
```r
# sim1_da1.csv the 1st simulated data similar
# sim1_da2 and sim1_da3 sim1.csv simulated data,
# the first simulation dummy.sim1.csv dummy
# variables for the first simulated data with all
# the baseline code for simulation
nf <- 800
for (j in 1:20) {
set.seed(19870 + j)
x <- c("A", "B", "C")
sim.da1 <- NULL
for (i in 1:nf) {
# sample(x, 120, replace=TRUE)->sam
sim.da1 <- rbind(sim.da1, sample(x, 120, replace = TRUE))
}
sim.da1 <- data.frame(sim.da1)
col <- paste("Q", 1:120, sep = "")
row <- paste("Farm", 1:nf, sep = "")
colnames(sim.da1) <- col
rownames(sim.da1) <- row
# use class.ind() function in nnet package to encode
# dummy variables
library(nnet)
dummy.sim1 <- NULL
for (k in 1:ncol(sim.da1)) {
tmp = class.ind(sim.da1[, k])
colnames(tmp) = paste(col[k], colnames(tmp))
dummy.sim1 = cbind(dummy.sim1, tmp)
}
dummy.sim1 <- data.frame(dummy.sim1)
# set 'C' as the baseline delete baseline dummy variable
base.idx <- 3 * c(1:120)
dummy1 <- dummy.sim1[, -base.idx]
# simulate independent variable for different values of
# r simulate based on one value of r each time r=0.1,
# get the link function
s1 <- c(rep(c(1/10, 0, -1/10), 40),
rep(c(1/10, 0, 0), 40),
rep(c(0, 0, 0), 40))
link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/10
# Other settings ---------------------------
# r = 0.25
# s1 <- c(rep(c(1/4, 0, -1/4), 40),
# rep(c(1/4, 0, 0), 40),
# rep(c(0, 0, 0), 40))
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/4
# r = 0.5
# s1 <- c(rep(c(1/2, 0, -1/2), 40),
# rep(c(1/2, 0, 0), 40),
# rep(c(0, 0, 0), 40))
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/2
# r = 1
# s1 <- c(rep(c(1, 0, -1), 40),
# rep(c(1, 0, 0), 40),
# rep(c(0, 0, 0), 40))
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3
# r = 2
# s1 <- c(rep(c(2, 0, -2), 40),
# rep(c(2, 0, 0), 40),
# rep(c(0, 0, 0), 40))
#
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/0.5
# calculate the outbreak probability
hp1 <- exp(link1)/(exp(link1) + 1)
# based on the probability hp1, simulate response
# variable: res
res <- rep(9, nf)
for (i in 1:nf) {
res[i] <- sample(c(1, 0), 1, prob = c(hp1[i], 1 -
hp1[i]))
}
# da1 with response variable, without group indicator
# da2 without response variable, with group indicator
# da3 without response variable, without group indicator
dummy1$y <- res
da1 <- dummy1
y <- da1$y
ind <- NULL
for (i in 1:120) {
ind <- c(ind, rep(i, 2))
}
da2 <- rbind(da1[, 1:240], ind)
da3 <- da1[, 1:240]
# save simulated data
write.csv(da1, paste("sim", j, "_da", 1, ".csv", sep = ""),
row.names = F)
write.csv(da2, paste("sim", j, "_da", 2, ".csv", sep = ""),
row.names = F)
write.csv(da3, paste("sim", j, "_da", 3, ".csv", sep = ""),
row.names = F)
write.csv(sim.da1, paste("sim", j, ".csv", sep = ""),
row.names = F)
write.csv(dummy.sim1, paste("dummy.sim", j, ".csv",
sep = ""), row.names = F)
}
```