-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data Mining Analysis of Airbnb Rental Properties in Monserrat Buenos Aires.Rmd
1272 lines (957 loc) · 60.7 KB
/
Data Mining Analysis of Airbnb Rental Properties in Monserrat Buenos Aires.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Data Mining Analysis of Airbnb Rental Properties in Monserrat Buenos Aires
Argentina
knit: (function(input_file, encoding) {
out_dir <- 'docs';
rmarkdown::render(input_file,
encoding=encoding,
output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
author: "Putranegara Riauwindu"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
This GitHub repository contains data mining analysis on Airbnb rental properties in Monserrat, Buenos Aires, Argentina. The analysis focuses on delivering two key outputs to enhance decision-making for property owners and prospective tenants.
**1. Property Descriptive Analytics:**
- Property Overview: Summary statistics, visualizations, mapping, and word cloud analysis.
- Clustering: Grouping of properties based on similar characteristics.
**2. Property Predictive Analytics:**
- Price Prediction Model: A model to forecast property prices.
- Amenities Prediction Model: Predicting available amenities for a property.
- Property Feature Prediction Model: Predictive modeling of property features.
- Review Score Prediction Model: Forecasting review scores for properties.
These comprehensive data mining analysis results serve as valuable assets for property owners and prospective tenants in Monserrat. By leveraging data-driven insights, users can make informed decisions and enhance their ability to choose suitable rental properties.
Explore this repository to access the analysis code, datasets, and detailed documentation. Make data-driven choices for your property investments or find the perfect Airbnb rental in Monserrat with confidence.
## Importing Relevant Libraries
```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(readr)
library(naniar)
library(jsonlite)
library(tidyr)
library(tidytext)
library(wordcloud)
library(leaflet)
library(scales)
library(ggbeeswarm)
library(rpart)
library(rpart.plot)
library(corrplot)
library(caret)
library(e1071)
library(forecast)
library(FNN)
```
## Importing Dataset
```{r message=FALSE}
master_data <- read_csv("buenos.csv")
data <- read_csv("buenos.csv")
```
## Step I: Data Preparation & Exploration
### 1. Missing Values (including Data Cleaning and Manipulation)
Please note that the explanations for what we did and why we did it are available at every steps below with the summary at the end of the steps.
1. Filtering to Monserrat Neighborhood
```{r}
data <- data %>%
filter(neighbourhood_cleansed=='Monserrat')
```
2. Checking for Missing Value
```{r}
miss_var_summary(data)
```
3. Removing variables with missing value >50%: neighbourhood_group_cleansed, bathrooms,calendar_updated, and license
```{r}
data <- data%>%
select(-neighbourhood_group_cleansed, -bathrooms, -calendar_updated, -license)
```
4. Checking for other variables that might not be useful for this particular analysis
```{r}
head(data)
```
- Removing following variables from dataset:
listing_url, scrape_id, last_scraped, source, description, neighborhood_overview, picture_url, host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighborhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identify_verified, calendar_last_scraped, first_review, and last_review.
- Combining descriptive variable and the URL columns into its own dataframe that could later be LEFT JOIN using id column.
- Combining host information into its own dataframe that could later be LEFT JOIN using id column.
```{r}
# Removing non-essential variable from main dataset
data <- data %>%
select(- listing_url, - scrape_id, - last_scraped, - source, - description, - neighborhood_overview, - picture_url, - host_id, - host_url, - host_name, - host_since, - host_location, - host_about, - host_response_time, - host_response_rate, - host_acceptance_rate, - host_is_superhost, - host_thumbnail_url, - host_picture_url, -host_neighbourhood, - host_listings_count, - host_total_listings_count, - host_verifications, - host_has_profile_pic, - host_identity_verified, - calendar_last_scraped, -calculated_host_listings_count, -calculated_host_listings_count_entire_homes, -calculated_host_listings_count_private_rooms, -calculated_host_listings_count_shared_rooms, -first_review, -last_review)
# Creating host related information dataset
host <- master_data %>%
filter(neighbourhood_cleansed=="Monserrat") %>%
select(id, host_id, host_url, host_name, host_since, host_location, host_about,
host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost,
host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count,
host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified,
calculated_host_listings_count, calculated_host_listings_count_entire_homes,
calculated_host_listings_count_private_rooms,
calculated_host_listings_count_shared_rooms)
# Creating Description information dataset
desc <- master_data %>%
filter(neighbourhood_cleansed=="Monserrat")%>%
select(id, description, neighborhood_overview, picture_url, listing_url)
```
5. Checking for missing value in the main dataset
```{r}
miss_var_summary(data)
```
Based on the above information, below inferences and decisions were made
1. neighbourhood: this variable does not give any additional value to the overall analysis as this is a duplicate information from the "neighbourhood_cleansed". This variable will be removed
2. all review scores: the missing observation related to all review scores will be removed review is something that is subjective based on the input of the user thus it would not be a wise decision to impute this as it will introduce bias to the dataset
3. reviews_per_month missing observation will be removed as imputing it might introduce bias to the dataset
4. bedrooms, beds, and bathroom_text missing observation will be removed as it is property specific and imputing the value might give a misleading the associated properly characteristics.
```{r}
# Removing neighbourhood column
data <- data %>%
select (-neighbourhood)
# Removing missing observations from above-mentioned variables
data <- subset(data, complete.cases(review_scores_accuracy,
review_scores_checkin,
review_scores_cleanliness,
review_scores_communication,
review_scores_location,
review_scores_value,
review_scores_rating,
reviews_per_month,
bedrooms,
beds,
bathrooms_text))
```
6. Manipulating bathroom data
```{r}
# extract the numerical value from the bathrooms_text variable
data$bathrooms <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text))
# create a new variable to indicate whether the bathroom is shared or not
data$shared_bathroom <- ifelse(grepl("shared", data$bathrooms_text, ignore.case = TRUE), "Yes", "No")
# handle cases where bathrooms_text is "shared bath" or missing
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) |
is.na(data$bathrooms_text)] <- NA
# handle cases where bathrooms_text is "0 bath" or "0.5 bath"
data$bathrooms[data$bathrooms == 0] <- 0.5
data$bathrooms[data$bathrooms == 0.5 & grepl("shared", data$bathrooms_text, ignore.case = TRUE)] <- NA
# handle cases where bathrooms_text is "X shared bath" or "X.X shared bath"
data$bathrooms[grepl("shared", data$bathrooms_text, ignore.case = TRUE) &
!grepl("0\\.5", data$bathrooms_text) &
!is.na(data$bathrooms_text)] <- as.numeric(gsub("[^[:digit:]./]", "", data$bathrooms_text[grepl("shared", data$bathrooms_text, ignore.case = TRUE)]))
# replace missing values with the median number of bathrooms
data$bathrooms[is.na(data$bathrooms)] <- median(data$bathrooms, na.rm = TRUE)
```
7. Manipulating amenities data
```{r}
# split the string column into a list column
data$amenities_list <- lapply(data$amenities, jsonlite::fromJSON)
# specify the maximum length of the list
max_len <- max(lengths(data$amenities_list))
# pad shorter lists with NA values
data$amenities_list <- lapply(data$amenities_list, `length<-`, max_len)
# convert the list column to wide format
data <- unnest_wider(data, col = amenities_list, names_sep = "_")
# converting all amenities columns into categorical
for (i in 1:10) {
col_name <- paste0("amenities_list_", i)
data[[col_name]] <- as.factor(data[[col_name]])}
```
8. Merging the previously splitted dataset into new merged data
```{r}
merged_data <- left_join(data,host, by='id')
merged_data <- left_join(merged_data, desc, by='id')
```
9. Grouping the amenities for simplification
```{r}
merged_data$Kitchen <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Kitchen", x)) > 0, 1, 0) })
merged_data$Wifi <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Wifi", x)) > 0, 1, 0) })
merged_data$Air_conditioning <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Air conditioning", x)) > 0, 1, 0) })
merged_data$Elevator <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Elevator", x)) > 0, 1, 0) })
merged_data$Dishes_and_silverware <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Dishes and silverware", x)) > 0, 1, 0) })
merged_data$Washer <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Washer", x)) > 0, 1, 0) })
merged_data$Body_soap <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Body soap", x)) > 0, 1, 0) })
merged_data$Microwave <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Microwave", x)) > 0, 1, 0) })
merged_data$Paid_parking_off_premises <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("Paid parking off premises", x)) > 0, 1, 0) })
merged_data$TV <- apply(merged_data[, 41:107], 1, function(x) { ifelse(sum(grepl("TV", x)) > 0, 1, 0) })
```
10. Grouping Availability information for simplification
The availability were binned into two variables
- short term: (availability 30 + availability 60 + availability 90)/3, if the value is more than mean of the short term column, then its 1 otherwise 0
- long term: availability 365. if the value is more the mean of the availability 365 column, then its 1, otherwise its 0
1 means that the property has the availability for that particular short or long term, while 0 is the otherwise.
Another manipulation in this part is that another two new columns were created to aid in the analysis which are:
- years: describing how many years has elapsed since the listing first listed on the AirBnB
- total_amenities; describing how many variants of amenities each listing have.
```{r}
merged_data <- merged_data %>%
mutate(mean_short = (availability_30+availability_60+availability_90)/3) %>%
mutate(short_term_availability = ifelse(mean_short<mean(mean_short), 0, 1)) %>%
mutate(long_term_availability = ifelse(availability_365 < mean(availability_365), 0,1)) %>%
mutate(start_date = as.Date(host_since)) %>%
mutate(end_date = as.Date("2023-05-05")) %>%
mutate(years = as.numeric(difftime(end_date, start_date)))%>%
mutate(years = years/365) %>%
mutate(total_amenities = Kitchen+Wifi+Air_conditioning+Elevator+Dishes_and_silverware+Washer
+Body_soap+Microwave+Paid_parking_off_premises)
```
11. Housekeeping on host data, price data, and property/room type data
```{r}
# Remove "N/A" value in the host data
merged_data <- subset(merged_data, host_response_time != "N/A")
merged_data <- subset(merged_data, host_response_rate != "N/A")
merged_data <- subset(merged_data, host_acceptance_rate != "N/A")
# Converting host response rate and acceptance rate into numeric
merged_data$host_response_rate <- as.numeric(gsub("%", "", merged_data$host_response_rate))/100
merged_data$host_acceptance_rate <- as.numeric(gsub("%", "", merged_data$host_acceptance_rate))/100
# Preparing price data
merged_data$price <- gsub("\\$|,", "", merged_data$price)
merged_data$price <- as.numeric(merged_data$price)
# Converting room and property type data into categorical
merged_data$property_type <- as.factor(merged_data$property_type)
merged_data$room_type <- as.factor(merged_data$room_type)
```
The "N/A" value in the host_response_time and host_response_rate were decided to be removed due to its low proportion in the dataset. Imputing it might introduce bias.
12. Converting variables into categorical
```{r}
merged_data$room_type <- as.factor(merged_data$room_type)
merged_data$instant_bookable <- as.factor(merged_data$instant_bookable)
merged_data$shared_bathroom <- as.factor(merged_data$shared_bathroom)
merged_data$host_response_time <- as.factor(merged_data$host_response_time)
merged_data$host_is_superhost <- as.factor(merged_data$host_is_superhost)
merged_data$host_identity_verified <- as.factor(merged_data$host_identity_verified)
merged_data$Kitchen <- as.factor(merged_data$Kitchen)
merged_data$Wifi <- as.factor(merged_data$Wifi)
merged_data$Air_conditioning <- as.factor(merged_data$Air_conditioning)
merged_data$Elevator <- as.factor(merged_data$Elevator)
merged_data$Dishes_and_silverware <- as.factor(merged_data$Dishes_and_silverware)
merged_data$Washer <- as.factor(merged_data$Washer)
merged_data$Body_soap <- as.factor(merged_data$Body_soap)
merged_data$Microwave <- as.factor(merged_data$Microwave)
merged_data$Paid_parking_off_premises <- as.factor(merged_data$Paid_parking_off_premises)
merged_data$short_term_availability <- as.factor(merged_data$short_term_availability)
merged_data$long_term_availability <- as.factor(merged_data$long_term_availability)
```
13. Selecting columns that we want to focus on
```{r}
data_new <- merged_data %>%
select(id, name, latitude, longitude, property_type, room_type, price,
accommodates, bedrooms, beds,bathrooms,shared_bathroom, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin,
review_scores_communication, review_scores_location, review_scores_value,
instant_bookable, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware,
Washer, Body_soap, Microwave, Paid_parking_off_premises,
total_amenities, short_term_availability,
long_term_availability, years, host_id, host_response_time,
host_response_rate, host_acceptance_rate, host_is_superhost, host_identity_verified,
)
```
14. Exporting the data into csv format to be shared to the rest of the team member
```{r}
write.csv(data_new, file = "data_new.csv", row.names = FALSE)
```
TLDR:
We removed several variables that we believed will not add much value to the analysis that we are going to focus on. We also removed observations with "N/A" or missing value because we believe that it was not possible to impute the data without introducing significant bias. We also did some "feature engineering" on several variables to simplify the modeling and analysis.
### 2. Summary Statistics
Looking at the Airbnb data for Monserrat Neighborhood, it is interesting to know what are the:
- Price
- Bedrooms
- Bathrooms: Private and Shared
- Accommodates
- Overall Review Scores
- Total Amenities
based on each property room type. Below is the summary statistics for each of the variables.
1. Summary Statistics: Price
```{r}
price_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_price = mean(price, na.rm = TRUE),
sd_price = sd(price, na.rm = TRUE),
median_price = median(price, na.rm = TRUE),
min_price = min(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE))
price_stats
```
2. Summary Statistics: Bathrooms Private
```{r}
bathrooms_private_stats <- data_new %>%
filter(shared_bathroom=="No") %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bathrooms = mean(bathrooms, na.rm = TRUE),
sd_bathrooms = sd(bathrooms, na.rm = TRUE),
median_bathrooms = median(bathrooms, na.rm = TRUE),
min_bathrooms = min(bathrooms, na.rm = TRUE),
max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_private_stats
```
2. Summary Statistics: Bathrooms Shared
```{r}
bathrooms_shared_stats <- data_new %>%
filter(shared_bathroom=="Yes") %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bathrooms = mean(bathrooms, na.rm = TRUE),
sd_bathrooms = sd(bathrooms, na.rm = TRUE),
median_bathrooms = median(bathrooms, na.rm = TRUE),
min_bathrooms = min(bathrooms, na.rm = TRUE),
max_bathrooms = max(bathrooms, na.rm = TRUE))
bathrooms_shared_stats
```
3. Summary Statistics: Bedrooms
```{r}
bedrooms_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_bedrooms = mean(bedrooms, na.rm = TRUE),
sd_bedrooms = sd(bedrooms, na.rm = TRUE),
median_bedrooms = median(bedrooms, na.rm = TRUE),
min_bedrooms = min(bedrooms, na.rm = TRUE),
max_bedrooms = max(bedrooms, na.rm = TRUE))
bedrooms_stats
```
4. Summary Statistics: Accommodates
```{r}
accommodates_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_accommodates = mean(accommodates, na.rm = TRUE),
sd_accommodates = sd(accommodates, na.rm = TRUE),
median_accommodates = median(accommodates, na.rm = TRUE),
min_accommodates = min(accommodates, na.rm = TRUE),
max_accommodates = max(accommodates, na.rm = TRUE))
accommodates_stats
```
5. Summary Statistics: Overall Review Scores
```{r}
review_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_review = mean(review_scores_rating, na.rm = TRUE),
sd_review= sd(review_scores_rating, na.rm = TRUE),
median_review = median(review_scores_rating, na.rm = TRUE),
min_review = min(review_scores_rating, na.rm = TRUE),
max_review = max(review_scores_rating, na.rm = TRUE))
review_stats
```
6. Summary Statistics: Total Amenities
```{r}
amenities_stats <- data_new %>%
group_by(room_type) %>%
summarise(
observation = n(),
mean_amenities = mean(total_amenities, na.rm = TRUE),
sd_amenities= sd(total_amenities, na.rm = TRUE),
median_amenities = median(total_amenities, na.rm = TRUE),
min_amenities = min(total_amenities, na.rm = TRUE),
max_amenities = max(total_amenities, na.rm = TRUE))
amenities_stats
```
Summary
Monserrat, a charming location for vacationers, offered an array of Airbnb properties for travelers. Among the options available, entire homes or apartments proved to be the most popular, far outnumbering private rooms and shared spaces. Surprisingly, hotel rooms came out as the most expensive option, while entire homes or apartments ranked a close second.
For those who value their privacy, a property that specifies a private bathroom is essential. Interestingly, all properties with private bathrooms had one bathroom per room type on average, while those with shared bathrooms had two. Private rooms were found to have the highest average number of bedrooms, with around two per room on average. On the other hand, entire homes or apartments offered the highest average number of accommodates, which was typically around three people.
When it came to amenities, hotel rooms triumphed with the highest mean of total amenities, closely followed by entire homes or apartments. Despite the differences in amenities, all room types shared a relatively similar mean review rating, indicating that the quality of the listings was consistent across the board.
With all these options to choose from, Monserrat promises an unforgettable experience for all types of travelers.
### 3. Data Visualization
Looking at the airbnb data for Monserrat Neighborhood, it is interesting to visually see what are the:
- Population
- Price
- Overall Review
- Amenities Count
- Price Trends on Different Accommodates
based on each property room type. Below is the summary statistics for each of the variables.
1. Room Type Population
```{r, warning=FALSE}
ggplot(data_new, aes(x = room_type, y = ..count.., fill = room_type)) +
geom_bar(alpha = 0.7, width = 0.5) +
labs(x = "Room Type", y = "Count", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Number of Property Based on Room Type")
```
2. Room Type Price
note: there are two outliers price point for the "entire home/apt" room type (262,857 USD and 216,521 USD). Those two outliers were removed to show a better visualization
```{r}
# Remove two maximum values of price for entire home/apt
data_new_clean <- data_new %>%
filter(!(room_type == "Entire home/apt" & price %in% tail(sort(price), 2)))
ggplot(data_new_clean, aes(x = room_type, y = price, fill = room_type)) +
geom_boxplot(alpha = 0.7, width = 0.5) +
labs(x = "Room Type", y = "Price", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Price Distribution by Room Type")
```
3. Overall Review Score per Room Type
```{r}
ggplot(data_new, aes(x = room_type, y = review_scores_rating, fill = room_type)) +
geom_violin(scale = "width", alpha = 0.7) +
labs(x = "Room Type", y = "Review Scores Rating", fill = "Room Type") +
scale_fill_manual(values = c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Review Scores Rating Distribution by Room Type")
```
4. Distribution of Amenities Number per Room Type
```{r}
# Calculate the sum of each amenity by room type
amenities_sum_by_roomtype <- data_new %>%
select(room_type, Kitchen, Wifi, Air_conditioning, Elevator, Dishes_and_silverware, Washer, Body_soap, Microwave) %>%
mutate(across(Kitchen: Microwave, as.numeric)) %>%
group_by(room_type) %>%
summarize_all(sum)
# Reshape data to long format for plotting
amenities_sum_by_roomtype_long <- amenities_sum_by_roomtype %>%
pivot_longer(cols = -room_type, names_to = "amenity", values_to = "count") %>%
arrange(room_type, desc(count))
# Create stacked bar plot
ggplot(amenities_sum_by_roomtype_long, aes(x = amenity, y = count, fill = room_type)) +
geom_col() +
scale_fill_manual(values = c("#F8766D", "#00BA38", "#619CFF", "#DA3B3A")) +
labs(x = "Amenities", y = "Number of Listings", fill = "Room Type") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Amenities by Room Type")
```
5. Price Trend on Different Accommodates Capacity per Room Type
```{r}
my_colors <- c("#1F77B4", "#FF7F0E", "#2CA02C", "#D62728")
ggplot(data_new, aes(x = accommodates, y = price, color = room_type)) +
geom_point(alpha = 0.7, size = 3) +
scale_color_manual(values = my_colors) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
labs(x = "Accommodates", y = "Price", color = "Room Type") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
legend.position = "bottom",
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Scatterplot of Price and Accommodates by Room Type")
```
Summary
Nestled in the stunning location of Monserrat, vacationers have an array of Airbnb properties to choose from. Dominating the market with around 400 listings, entire homes or apartments were the most popular option, followed by private rooms with around 100 listings. In contrast, the number of listings for hotel and shared rooms was relatively low.
When it comes to price, hotel rooms reign supreme as the most expensive option, followed by entire homes or apartments. Surprisingly, shared rooms were found to be the cheapest option. Entire homes or apartments boasted the broadest range of prices compared to the rest of the property room types, making them an attractive option for budget-conscious travelers.
The review ratings for all room types in Monserrat were relatively consistent, with no significant differences among them. However, entire homes or apartments had the broadest range of review ratings, spanning from 4.7 to 1. This highlights the importance of reading through reviews thoroughly before making a booking.
If amenities are essential, then entire homes or apartments would be the go-to option in Monserrat. They offer the highest number of amenities compared to the other room types. From free Wi-Fi to essential kitchen supplies, these properties cater to the needs of all types of travelers.
Interestingly, the number of accommodates does not seem to affect rental prices for all room types in Monserrat. This opens up an opportunity for larger groups to enjoy a budget-friendly stay without having to worry about spending more for the same property.
All in all, Monserrat is an excellent location for vacationers, with Airbnb properties offering something for everyone.
### 4. Mapping
```{r}
m <- leaflet() %>% addTiles() %>% addCircles(data = data_new, lng= ~longitude , lat= ~latitude)%>% addProviderTiles(providers$JusticeMap.income)
m
```
Description:
The neighborhood Monserrat is adjacent to the natural reservoir and Laguna de los Patos. Besides the nature, Monserrat has notable landmarks, such as the Casa Rosada and Plaza de Mayo, where the first is the presidential palace of Argentina and serves as the executive office of the President and the second is a historic public square that has been the site of many important political events in Argentina's history.
### 5. Word Cloud
```{r}
# Split neighborhood_overview column into words and create a new dataframe
words <- master_data %>%
select(neighborhood_overview) %>%
unnest_tokens(word, neighborhood_overview)
# Create a custom list of stop words
custom_stopwords <- c(stop_words$word, "de", "la")
# Remove stop words and create a word frequency table
word_freq <- words %>%
anti_join(stop_words, by = "word") %>%
anti_join(data.frame(word = custom_stopwords), by = "word") %>%
count(word, sort = TRUE)
# Set the size of the graphics device
options(repr.plot.width = 8, repr.plot.height = 8)
# Generate a word cloud
wordcloud(words = word_freq$word, freq = word_freq$n, min.freq = 1,
max.words = 200, random.order = FALSE, rot.per = 0.5,
colors = brewer.pal(8, "Dark2"))
```
The words in the word cloud are all related to the neighborhoods and landmarks in Buenos Aires, and their prominence in the word cloud can provide insights into the most frequent and important words in the neighborhood overview column of the Buenos Aires Airbnb dataset.
"Br" is likely to stand for "Barrio" or "neighborhood" in Spanish, and its prominence in the word cloud suggests that the neighborhood overview column frequently mentions different neighborhoods in Buenos Aires. "San" is an honorific title used in place names, and its appearance in the word cloud suggests that the neighborhood overview column may include references to different streets, districts, or landmarks with this title.
"Telmo" refers to the San Telmo neighborhood in Buenos Aires, which is known for its historic architecture, tango culture, and antique markets. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this neighborhood and its characteristics.
"Buenos" and "Aires" refer to the city of Buenos Aires, which is the capital of Argentina and one of the largest cities in South America. The appearance of these terms in the word cloud suggests that the neighborhood overview column may include descriptions of different neighborhoods and landmarks within the city.
"Mayo" refers to the Plaza de Mayo, which is a public square in the heart of Buenos Aires that is known for its historical and political significance. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of this landmark and its role in the city's history.
"Plaza" refers to public squares and plazas, which are common features in many neighborhoods in Buenos Aires. Its appearance in the word cloud suggests that the neighborhood overview column may include descriptions of different plazas and their characteristics.
## Step II: Prediction
The multiple regression model were constructed in the following steps.
1. Defining data
```{r}
MLR <- data_new
```
2. Convert the price variable using a log transformation
```{r}
MLR$price <- log(MLR$price)
```
3. Checking the uniqueness of the categorical variables.
By using the length() and unique() functions, we were able to identify the unique values of the categorical variable in the dataset. Based on the results, we have decided to remove several variables, namely id, host_id, name, latitude, and longitude from the dataset used for multiple linear regression (MLR) because of its irrelevancy to the MLR.
Furthermore, the property_type variable is a subtype of the room_type variable, and as such, it will also be removed.
```{r}
# Looking for number of unique value
length(unique(MLR$id))
length(unique(MLR$host_id))
length(unique(MLR$name))
length(unique(MLR$latitude))
length(unique(MLR$longitude))
length(unique(MLR$property_type))
length(unique(MLR$room_type))
length(unique(MLR$host_response_time))
# Removing the id, host_id, name, property_type, latitude, and longitude variable
MLR_clean <- subset(MLR, select=c(-id, -host_id, -name, -property_type, -latitude, -longitude))
```
4. Check the numeric variables' correlation.
According to the results below, there are some variables that have relationship value >= 0.80: review_scores_rating & review_scores_accuracy, review_scores_rating & review_scores_value, and review_scores_accuracy & review_scores_value. Therefore, the review_scores_rating and review_scores_value variable will be removed from the dataset.
```{r}
library(corrplot)
# Calculating correlation between numeric variables
Corr <- cor(MLR_clean %>%
select(c(accommodates, bedrooms, beds,
minimum_nights, maximum_nights,number_of_reviews,
review_scores_rating, review_scores_accuracy,
review_scores_cleanliness,review_scores_checkin,
review_scores_communication, review_scores_location,
review_scores_value, bathrooms, host_response_rate,
host_acceptance_rate, years)))
print(Corr)
# Plotting the correlation
corrplot(Corr, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Removing the review_scores_rating and review_scores_value variable
MLR_fix <- subset(MLR_clean, select=c(-review_scores_rating, -review_scores_value))
```
5. Data Partioning.
Using the `sample()` function, the MLR_fix data frame was randomly assigned to train.df for 60% of the data, and the rest is assigned to the valid.df.
```{r}
set.seed(62)
train.index <- sample(c(1:nrow(MLR_fix)), nrow(MLR_fix)*0.6)
train.df <- MLR_fix[train.index, ]
valid.df <- MLR_fix[-train.index, ]
```
6. Creating multiple regression model with all variables in training dataset
```{r}
MLR_all <- lm(price~ ., data=train.df)
summary(MLR_all)
```
7. Performing stepwise regression.
```{r echo=FALSE, include=FALSE}
MLR.step <- step(MLR_all, direction = "backward")
```
8. Assess the accuracy of the model against both the training set and the validation set
```{r}
# Accuracy against training dataset
pred_tm <- predict(MLR.step, train.df)
accuracy(pred_tm, train.df$price)
# Accuracy againts validation dataset
pred_vm <- predict(MLR.step, valid.df)
accuracy(pred_vm, valid.df$price)
# RMSE gap between training and validation dataset
RMSE_gap <- (0.5370704-0.3925947)/0.3925947
print(RMSE_gap)
# MAE gap between training and validation dataset
MAE_gap <- (0.3370304-0.3060701)/0.3060701
print(MAE_gap)
```
## Step III: Classification
### Classification Part I: K Nearest Neighbors
The KNN predictive model was constructed using these following steps to predict certain rental properties in Monserrat will have Kitchen amenities or not.
1. Picking the third observation of rental property in Monserrat Neighborhood and removing its amenities information for test observation.
```{r}
rental <- data_new[3, ] %>%
select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin, review_scores_communication,
review_scores_location, review_scores_value, years,
host_response_rate, host_acceptance_rate)
```
2. Building new numeric dataframe for KNN model building. The numeric predictors chosen here were price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, host_acceptance_rate to align with what the team has chosen previously. The predictors chosen were only numeric since KNN will rely on distance matrix for the modeling.
```{r}
knn_var <- data_new %>%
select(price, accommodates, bedrooms, beds, bathrooms, minimum_nights,
maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin, review_scores_communication,
review_scores_location, review_scores_value, years,
host_response_rate, host_acceptance_rate, Kitchen, id)
```
3. Partitioning Dataset into Training and Validation set
```{r}
# Setting seed for reproducibility
set.seed(250)
# Random sampling the dataset index without replacement with 60% for training set
train_index_knn <- sample(c(1:nrow(knn_var)), nrow(knn_var)*0.6)
# Partition the dataset into training and validation set based on the index sampling
train_df_knn <- knn_var[train_index_knn, ]
valid_df_knn <- knn_var[-train_index_knn, ]
```
4. Normalizing the Dataset
Normalization was done due to the different scale for each predictor variable
```{r}
# Initializing normalized training, validation data, complete dataframe to originals
train_norm_df_knn <- train_df_knn
valid_norm_df_knn <- valid_df_knn
knn_var_norm<- knn_var
# Using preProcess () from the caret package to normalize predictor variables
norm_values_knn <- preProcess(train_df_knn[,1:18], method=c("center", "scale"))
train_norm_df_knn[,1:18] <- predict(norm_values_knn, train_df_knn[,1:18])
valid_norm_df_knn[,1:18] <- predict(norm_values_knn, valid_df_knn[,1:18])
knn_var_norm[,1:18] <- predict(norm_values_knn, knn_var[,1:18])
# Normalizing rental dataframe
rental_norm <- predict(norm_values_knn, rental)
```
5. Building KNN Predictive Model with arbitrary k=7
```{r}
# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=7)
# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
```
6. Determining optimal value for k
```{r, warning=FALSE}
# Initialize a data frame with two columns: k, and accuracy
accuracy_df_knn <- data.frame(k=seq(1,14,1), accuracy=rep(0,14))
# Compute knn for different k on validation
for(i in 1:14){
knn.pred <- knn(train_norm_df_knn[,1:18], valid_norm_df_knn[,1:18],
cl = train_norm_df_knn$Kitchen, k=i)
accuracy_df_knn[i,2] <- confusionMatrix(knn.pred, valid_norm_df_knn$Kitchen)$overall[1] %>% round(3)
}
accuracy_df_knn
```
7. Building KNN Predictive Model with optimum k=4
Optimum k=4 were chosen based on the highest accuracy when the model was tested againts the validation set.
```{r}
# Creating knn model to predict whether rental has kitchen amenities
rental_nn <- knn(train=train_norm_df_knn[,1:18], test=rental_norm, cl=train_norm_df_knn$Kitchen, k=4)
# Checking the summary of the knn model prediction, including the 7 nearest neighbors index and distance
attributes(rental_nn)
```
8. Checking with actual data
```{r}
data_new[3,24]
```
Summary
To predict whether certain rental properties in the Monserrat neighborhood had kitchen amenities or not, a KNN predictive model was constructed through a series of steps. Firstly, the third observation of a rental property in Monserrat was selected, and its amenities information was removed to create a test observation. Next, a new numeric dataframe was built for KNN model building using a range of predictors such as price, accommodates, bedrooms, beds, bathrooms, minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, years, host_response_rate, and host_acceptance_rate. Only numeric predictors were chosen as KNN relies on a distance matrix for modeling.
Furthermore, the dataset was partitioned into training and validation sets, and normalization was done to account for the different scale of each predictor variable. Following this, a KNN predictive model was built with an arbitrary k value of 7. These steps laid the foundation for creating a model that could predict which rental properties in the Monserrat neighborhood would have kitchen amenities.
To refine the model, an optimal value for k was determined. This was done by testing the model against the validation set, and the highest accuracy was used to select the optimal value for k, which was found to be k=4. Finally, a KNN predictive model was built using k=4, which was then used to predict whether rental properties in the Monserrat neighborhood would have kitchen amenities or not. By using a range of predictors and an optimal k value, the KNN predictive model was able to provide accurate predictions on whether the third observation might have Kitchen amenities or not.
### Classification Part II: Naive Bayes
The Naive Bayes modeling was done through the following steps
1. Create a new dataset with variable of focus for Naive Bayes modeling
```{r}
# Importing data
merged2 <- data_new
# Create a vector of column names to keep
keep_vars <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "instant_bookable", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability","review_scores_rating")
# Subset the merged2 dataframe to keep only the selected columns
merged2 <- subset(merged2, select = keep_vars)
```
2. Binning the numerical variables into categorical variables of equal frequency using cut function
```{r}
# Binning 'accommodates'
quantiles <- quantile(merged2$accommodates, probs = c(0.5))
breaks <- c(0, quantiles, Inf)
labels <- c("Small", "Large")
merged2$accommodates <- cut(merged2$accommodates, breaks = breaks, labels = labels)
table(merged2$accommodates)
# Binning 'bedrooms'
merged2$bedrooms <- cut(merged2$bedrooms, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))
# Binning 'beds'
merged2$beds <- cut(merged2$beds, breaks = c(0, 1, 2, Inf), labels = c("1-2", "3-4", "5+"))
# Binning 'bathrooms'
merged2$bathrooms <- cut(merged2$bathrooms, breaks = c(0, 1, 2, 3, Inf), labels = c("1", "2", "3", "4+"))
# Binning 'minimum_nights'
merged2$minimum_nights <- cut(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001),
breaks = quantile(merged2$minimum_nights + runif(nrow(merged2), -0.0001, 0.0001), probs = seq(0, 1, 0.25)),
labels = c("1", "2", "3", "4+"))
# Add small amount of noise to 'maximum_nights'
merged2$maximum_nights <- merged2$maximum_nights + runif(nrow(merged2), -0.0001, 0.0001)
# Binning 'maximum_nights'
merged2$maximum_nights <- cut(merged2$maximum_nights, breaks = quantile(merged2$maximum_nights, probs = seq(0, 1, 0.25)), labels = c("1-3", "4-7", "8-14", "15+"))
# Binning 'number_of_reviews'
merged2$number_of_reviews <- cut(merged2$number_of_reviews, breaks = quantile(merged2$number_of_reviews, probs = seq(0, 1, 0.25)), labels = c("1-7", "8-23", "24-56", "57+"))
# Binning 'host_response_rate'
merged2$host_response_rate <- cut(jitter(merged2$host_response_rate),
breaks = quantile(jitter(merged2$host_response_rate),
probs = seq(0, 1, 0.25),
na.rm = TRUE),
labels = c("<75%", "75-94%", "95-99%", "100%"))
# Binning 'review_scores_rating'
# Add jitter to the data
merged2$review_scores_rating<- jitter(merged2$review_scores_rating, amount = 0.001)
quantiles <- quantile(merged2$review_scores_rating, probs = seq(0, 1, 0.25), na.rm = TRUE)
if (length(unique(quantiles)) == length(quantiles)) {
# Bin the data
merged2$review_scores_rating <- cut(merged2$review_scores_rating,
breaks = quantiles,
labels = c("<80", "80-90", "90-95", "95+"),
include.lowest = TRUE)
} else {
cat("Quantiles are not unique. Please consider using different probabilities or jitter amount.")
}
# Binning 'Price'
# Calculate the quantiles for equal frequency binning
quantiles <- quantile(merged2$price, probs = seq(0, 1, length.out = 3 + 1), na.rm = TRUE, type = 5)
# Generate labels for the bins
bin_labels <- c("Low", "Medium", "High")
# Bin the data
merged2$price <- cut(merged2$price, breaks = quantiles, labels = bin_labels, include.lowest = TRUE)
```
3. Creating Proportional Barplot for Feature Selection to be loaded into Naive Bayes Model
```{r}
# Select the categorical variables
variables <- c("property_type", "room_type", "accommodates", "bedrooms", "beds", "price", "minimum_nights", "maximum_nights", "number_of_reviews", "bathrooms", "shared_bathroom", "Kitchen", "Wifi", "Air_conditioning", "Elevator", "Dishes_and_silverware", "Washer", "Body_soap", "Microwave", "Paid_parking_off_premises", "host_response_rate", "host_identity_verified", "short_term_availability", "long_term_availability", "review_scores_rating")
# Reshape the dataset
merged2_long <- merged2 %>%
select(one_of(variables), instant_bookable) %>%
gather(key = "variable", value = "value", -instant_bookable)
# Create the faceted barplot
p <- ggplot(merged2_long, aes(x = value, fill = instant_bookable)) +
geom_bar(position = "dodge") +
theme_minimal() +
facet_wrap(~variable, scales = "free_x", ncol = 5) +
xlab("Value") +
ylab("Count") +
scale_fill_discrete(name = "Instant Bookable")
print(p)
```
based on the barplot it appears that the longterm availability, short term availability, air_conditioning,beds, number_of_reviews, price, minimum nights, review_score_rating variable may not have a strong amount of predictive power in a naive Bayes model as the distribution is relatively similar. so we gonna remove it
4. Removing Variable with Weak Predictive Power
```{r}
# List of variables to remove
variables_to_remove <- c("long_term_availability", "short_term_availability", "Air_conditioning", "number_of_reviews", "price", "minimum_nights", "review_scores_rating")
# Remove the variables
merged2 <- merged2 %>%
select(-one_of(variables_to_remove))
```
5. Building the Naive Bayes Prediction Model
```{r, warning=FALSE}
# Set the seed for reproducibility
set.seed(42)
# Create an 60-40 split for training and testing sets
train_index <- createDataPartition(merged2$instant_bookable, p = 0.6, list = FALSE)
train_set <- merged2[train_index, ]
test_set <- merged2[-train_index, ]
# Build the Naive Bayes model using naiveBayes() function
nb_model <- naiveBayes(instant_bookable ~ ., data = train_set)
# Summary of the model
print(nb_model)
# Generate predictions for the test set
predictions <- predict(nb_model, test_set)
# Convert predictions and test_set$instant_bookable to factors
predictions_factor <- factor(predictions, levels = c("FALSE", "TRUE"))
test_set_factor <- factor(test_set$instant_bookable, levels = c("FALSE", "TRUE"))
# Create the confusion matrix
cm <- confusionMatrix(predictions_factor, test_set_factor)
# Print the confusion matrix
print(cm)
```
6. Prediction Fictional Apartment