-
Notifications
You must be signed in to change notification settings - Fork 1
/
Code.Rmd
1002 lines (627 loc) · 41.6 KB
/
Code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "R Notebook"
output:
html_document:
df_print: paged
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
---
title: "R Notebook"
output:
html_document:
df_print: paged
pdf_document: default
html_notebook: default
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
---
title: "R Notebook"
output:
html_document:
df_print: paged
html_notebook: default
pdf_document: default
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
# Customer Lifetime Value Prediction
### Problem Statement:
To predict the Customer Lifetime Value for an insurance company offering vehicle insurance.
### Description:
Customer Lifetime Value is a commonly used metric by companies and financial institutions to assign a numeric value to their customers and thereby inform their strategy of increasing the companies profits.
It is defined as the total monetary value that a customer holds to a bank or any financial entity over the entire course of their relationship.
#### The formula used to calculate CLV is = (Annual revenue per customer x Customer relationship in years) – Customer acquisition cost
The difficulty arises when some segments of customers invest a lot of money in a company over a short period of time while others might invest small sums over a longer period of time. Now, if a company were to focus only on the short-term high paying customers, they will miss out on the gradual but constant revenue invested by the latter kind of customer. Both of these kinds of customers might be of high value to the company and hence there is a need to account for these two kinds of customers as well as other factors.
In the case of insurance, customers fall into several categories. Companies design different policies as not all categories of customers will want the same policy. Some customers might go for a greater coverage, while some might go for less. This does not mean that the customers with lesser coverage are less valuable to the company, as we must take into account the cost of acquiring these customers as well.
The insurance company must therefore study their existing customers considering all these factors to find out which category of customers to target.
The dataset contains historical data of the customers already acquired by the company and the CLV for each of these customers has been computed. We must use this previously computed CLV along with the independent variables to predict the category of customers who will be profitable to the company.
To account for all these factors this metric of customer-oriented evaluation is widely used.
### Aim:
To establish the relationship between the explanatory variables and the target variable and thereby to propose a model that can predict the target variable.
In this case, the objective is to study how the outcome variable (CLV) is related to the independent variables and the subsequent model thus proposed should help the company to make an informed decision with regard to the kind of customer to target.
It is a regression task to predict how much a given customer will be valuable to an insurance company.
## Exploratory Data Analysis
Following are the packages used.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(car)
library(zoo)
library(lmtest)
library(dplyr)
library(stringr)
library(caret)
library(ggplot2)
library(timeDate)
library(plotly)
library(readxl)
library(gganimate)
library(corrplot)
library(Hmisc)
library(vtree)
library(DataExplorer)
library(caTools)
library(nortest)
library(modelr)
```
### Reading the dataset
```{r}
Marketing_Customer_Value_Analysis_2 <- read_excel('C:/Users/HP/Downloads/Marketing-Customer-Value-Analysis 2.xlsx')
```
### Setting seed for reproducibility and overview of dataset
```{r}
set.seed(223)
head(Marketing_Customer_Value_Analysis_2)
```
### Histogram of the target variable (CLV)
This shows us the distribution of the target variable, where y-axis contains the probability density of the target variable.
This tells us the monetary value that the customers represent to the company.
```{r}
Insurance_Dataset <- data.frame(Marketing_Customer_Value_Analysis_2)
hist(Insurance_Dataset$Customer.Lifetime.Value,
breaks = 800,
freq = FALSE,
main = "Histogram of CLV", xlab = "CLV", border = "Blue")
```
This plot indicates that the distribution is heavily positively skewed, meaning that an overwhelming majority of the customers hold lower customer lifetime value to the company.A very small number of customers are in the higher bracket of lifetime value.
The "ideal" customers to the company are small in number and if the company is to turn a profit they must also focus on catering to the customers with lower CLV as they are more in number.
###### Description of dataset
```{r}
Insurance_Dataset %>% introduce()
Insurance_Dataset %>% plot_intro()
```
There are no null values in this dataset.
### CATEGORICAL VARIABLES VISUALIZATION
#### To visualize the effect of state on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=State ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="State",y = "Customer Life Time Value", fill="State") +
ggtitle("Sum of CLV contribution by State ")
```
In this case, we're looking at how much effect a customer's state has on CLV. In other words, we are trying to find if a customer from a particular state is more valuable to the company than other states.
From the above chart it would appear that the company should focus their efforts on states like California or Oregon, since the sum of CLV from these states are higher. As we can see in the chart below the population of these states is a factor for the high CLV obtained.
```{r}
count_state <- table(Insurance_Dataset$State)
barplot(count_state,
main = "Count plot of State",col = "Blue",
xlab = "State", ylab = "Count")
```
Let us explore this by considering the mean of the CLV by state in the subsequent chart. This measure will account for the larger populations of states like California and Oregon.
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(State = Insurance_Dataset$State),
FUN = mean)
ggplot(data = aggData, aes(x = State, y = prop.table(stat(aggData$x)), fill = State, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'State', y = 'CLV in Percentage', fill = 'State') +
ggtitle("Mean contribution to CLV by State")
```
When the mean of the CLV is computed we can see that no particular state is any more economically valuable than the other, as the customers from each state on an average contribute equally to the target variable(CLV).
This tells us that state is a weak indicator variable for the CLV.
#### To visualize the effect of Education on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=Education ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Education",y = "Customer Life Time Value", fill="Education") +
ggtitle("Contribution of CLV by Education")
```
```{r}
count_education <- table(Insurance_Dataset$Education)
barplot(count_education,
main = "Count plot of Education",col = "Blue",
xlab = "Education", ylab = "Count")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Education = Insurance_Dataset$Education),
FUN = mean)
ggplot(data = aggData, aes(x = Education, y = prop.table(stat(aggData$x)), fill = Education, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Education', y = 'CLV in Percentage', fill = 'Education') +
ggtitle("Contribution to CLV by Education")
```
In the first plot it appears as though the contribution of customers having doctors and Master's qualification is much lesser compared to the customers with other qualifications, which is counter-intuitive.
But if we factor in the count of customers from different levels of education, we notice that since there are more customers having Bachelor's, College and High level qualification, the contribution from these categories is more, as shown in the second chart.
This point is further supported by the next chart where the average contribution of each class of qualification is almost the same, which indicates that the value of insurance policies purchased by the customers having doctors and Master's qualification is much higher than the customers having other qualifications.
#### To visualize the effect of Coverage on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=Coverage ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Coverage",y = "Customer Life Time Value", fill="Coverage") +
ggtitle("Contribution to CLV by Coverage")
```
```{r}
count_coverage <- table(Insurance_Dataset$Coverage)
barplot(count_coverage,
main = "Count plot of Coverage",col = "Blue",
xlab = "Coverage", ylab = "Count")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Coverage = Insurance_Dataset$Coverage),
FUN = mean)
ggplot(data = aggData, aes(x = Coverage, y = prop.table(stat(aggData$x)), fill = Coverage, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Coverage', y = 'CLV in Percentage', fill = 'Coverage') +
ggtitle(" Mean CLV Contribution by Coverage")
```
It would be apparent from the first chart that the Basic coverage plan has the most contribution to CLV because there are more takers for the basic coverage plan, as proven by the second chart.
In the third chart, however, we can see that even though premium coverage plans accounted for the least volume of CLV, on an average a customer having the premium coverage has a greater contribution to CLV.
#### To visualize the effect of Employment Status on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=EmploymentStatus ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Employment Status",y = "Customer Life Time Value", fill="Employment Status") +
ggtitle("Contribution to CLV by Employment Status")
```
```{r}
count_employmentstatus <- table(Insurance_Dataset$EmploymentStatus)
barplot(count_employmentstatus,
main = "Count plot of Employment Status",col = "Blue",
xlab = "Employment Status", ylab = "Count")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(EmploymentStatus = Insurance_Dataset$EmploymentStatus),
FUN = mean)
ggplot(data = aggData, aes(x = EmploymentStatus, y = prop.table(stat(aggData$x)), fill = EmploymentStatus, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Employment Status', y = 'CLV in Percentage', fill = 'Employment Status') +
ggtitle("Contribution to CLV by Employment Status")
```
In the first chart, it is evident that the customers who are employed are of greater value to the company than the other categories. The inference drawn from this is straightforward, i.e employed customers are more likely to be able to afford the premiums and therefore contribute a major chunk to the CLV.
But in the second chart, when we account for the contribution on an average by the employment status, we notice that all are equally contributing to CLV.
The reason why the the employed status has such a high contribution to the CLV is because the number of customers who are employed is high.
#### To visualize the effect of Location Code on CLV.
```{r}
ggplot(Insurance_Dataset,aes (x=Location.Code ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Location Code",y = "Customer Life Time Value", fill="Location Code") +
ggtitle("Contribution to CLV by Location Code")
```
```{r}
count_locationcode <- table(Insurance_Dataset$Location.Code)
barplot(count_locationcode,
main = "Count plot of Location Code",col = "Blue",
xlab = "Location Code", ylab = "Count")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Location.Code = Insurance_Dataset$Location.Code),
FUN = mean)
ggplot(data = aggData, aes(x = Location.Code, y = prop.table(stat(aggData$x)), fill = Location.Code, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Location Code', y = 'CLV in Percentage', fill = 'Location Code') +
ggtitle("Contribution to CLV by Location Code")
```
In the first chart it appears as though customers from the suburban location are a better contributor to CLV than the other areas.
From the second chart it is clear that it is because of the higher number of subscribers from suburban areas.
But from the third, we see that all of the location codes on an average contribute equally to the CLV and therefore Location Code is a weak predictor of the CLV on its own.
#### Effect on CLV by State and Location Code
```{r}
p1<-plot_ly(Insurance_Dataset, x =~State, y =~Insurance_Dataset$`Customer.Lifetime.Value`,type='bar',color=~Insurance_Dataset$`Location.Code`)
layout(p1, title ='CLV w.r.t State and Location Code', yaxis = list(title = 'CLV '))
```
California and Oregon outperform the other states in every location code with regard to CLV.
#### To visualize the effect of Marital Status on CLV
```{r warning=FALSE}
ggplot(Insurance_Dataset,aes (x=Insurance_Dataset$"Marital.Status", y=Insurance_Dataset$"Customer.Lifetime.Value")) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue") +
labs(x="Marital Status",y = "Customer Life Time Value", fill="Marital Status") +
ggtitle("Visualization of CLV wrt Marital Status")
```
```{r}
count_maritalstatus <- table(Insurance_Dataset$Marital.Status)
barplot(count_maritalstatus,
main = "Count plot of Marital Status",col = "Blue",
xlab = "Marital Status", ylab = "Count")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Marital.Status = Insurance_Dataset$Marital.Status),
FUN = mean)
ggplot(data = aggData, aes(x = Marital.Status, y = prop.table(stat(aggData$x)), fill = Marital.Status, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'MaritalStatus', y = 'CLV in Percentage', fill = 'Marital Status') +
ggtitle("Contribution to CLV by Marital Status")
```
We might erroneously conclude from the first and the second chart that most married customers have high CLV but the third chart shows us that on an average there is no difference between the contributions of each sub-category to the CLV.
#### To visualize the effect of Policy Type on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=Policy.Type ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Policy Type",y = "Customer Life Time Value", fill="Policy Type") +
ggtitle("Contribution to CLV by Policy Type")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Policy.Type)) +
geom_bar(stat="count", width=0.5, fill = "Blue") +
labs(x="Policy Type",y = "Count", fill="Policy Type") +
ggtitle("Count of Policy Type")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Policy.Type = Insurance_Dataset$Policy.Type),
FUN = mean)
ggplot(data = aggData, aes(x = Policy.Type, y = prop.table(stat(aggData$x)), fill = Policy.Type, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Policy Type', y = 'CLV in Percentage', fill = 'Policy Type') +
ggtitle("Mean CLV contribution by Policy Type")
```
Similar results are obtained as before. Initially it may appear that the personal auto policy might be a majority contributor but further analysis shows that it seems more likely that customers who have purchased the Special Auto have a greater CLV.
#### To visualize the effect of gender on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=Gender)) +
geom_bar(stat="count", width=0.5, fill = "Blue") +
labs(x="Gender",y = "Count") +
ggtitle("Count of Gender")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Gender ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Gender",y = "Customer Life Time Value", fill="Gender") +
ggtitle("Contribution to CLV by Gender")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Gender ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="mean", width=0.5, fill = "Blue")+
labs(x="Gender",y = "Customer Life Time Value", fill="Gender") +
ggtitle("Mean Contribution to CLV by Gender")
```
Females are on an average slightly better contributors to CLV than men as there are more female subscribers.
#### To visualize the effect of Sales Channel on CLV.
```{r}
ggplot(Insurance_Dataset,aes (x=Sales.Channel ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Sales Channel",y = "Customer Life Time Value", fill="Sales Channel") +
ggtitle("Contribution to CLV by Sales Channel")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Sales.Channel)) +
geom_bar(stat="count", width=0.5, fill = "Blue") +
labs(x="Policy Type",y = "Count") +
ggtitle("Count of Sales Channel")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Sales.Channel = Insurance_Dataset$Sales.Channel),
FUN = mean)
ggplot(data = aggData, aes(x = Sales.Channel, y = prop.table(stat(aggData$x)), fill = Sales.Channel, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Sales Channel', y = 'CLV in Percentage', fill = 'Sales Channel') +
ggtitle("CLV Distribution by Sales Channel")
```
The customers procured through agents are contributing to higher CLV.
From the third chart, it is evident that it is hard to predict CLV from Sales Channel as all the sub-categories are equal contributors on an average to CLV.
Therefore the insurance company needs to promote the channel which costs the least to sustain operations.
#### To visualize the effect of Vehicle Class on CLV
```{r}
ggplot(Insurance_Dataset,aes (x=Vehicle.Class ,
y=Customer.Lifetime.Value)) + geom_bar(stat="summary",fun="sum", width=0.5, fill = "Blue")+
labs(x="Vehicle Class",y = "Customer Life Time Value", fill="Vehicle Class") +
ggtitle("Contribution to CLV by Vehicle Class")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Vehicle.Class)) +
geom_bar(stat="count", width=0.5, fill = "Blue") +
labs(x="Vehicle Class",y = "Count") +
ggtitle("Count of Vehicle Class")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Vehicle.Class = Insurance_Dataset$Vehicle.Class),
FUN = mean)
ggplot(data = aggData, aes(x = Vehicle.Class, y = prop.table(stat(aggData$x)), fill = Vehicle.Class, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Vehicle Class', y = 'CLV in Percentage', fill = 'Vehicle Class') +
ggtitle("CLV Distribution by Vehicle Class")
```
These two charts show us that although the customers owning Luxury, and Luxury SUV are a small fraction, on an average they contribute to almost 50% of CLV. Therefore we can make a conclusion that if a customer owns a Luxury car or Luxury SUV car, there is a high likelihood that she/he she will have high CLV.
#### Effect on CLV by Marital Status and Vehicle Class
```{r}
p1<-plot_ly(Insurance_Dataset, x =~Insurance_Dataset$`Marital.Status`, y =~Insurance_Dataset$`Customer.Lifetime.Value`,type='bar',color=~Insurance_Dataset$`Vehicle.Class`)
layout(p1, title ='CLV status w.r.t Marital Staus and Vehicle Class', yaxis = list(title = 'CLV '))
```
When we consider Marital Status and Vehicle Class we notice that across all the marital statuses the customers owning Four-Door cars and SUVs are better contributors to CLV.
#### To visualize the effect of Vehicle Size on CLV.
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Vehicle.Size = Insurance_Dataset$Vehicle.Size),
FUN = sum)
ggplot(data = aggData, aes(x = Vehicle.Size, y = prop.table(stat(aggData$x)), fill = Vehicle.Size, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Vehicle Size', y = 'CLV in Percentage', fill = 'Vehicle Size') +
ggtitle("CLV Distribution by Vehicle Size")
```
```{r}
ggplot(Insurance_Dataset,aes (x=Vehicle.Size)) +
geom_bar(stat="count", width=0.5, fill = "Blue") +
labs(x="Vehicle Size",y = "Count") +
ggtitle("Count of Vehicle Size")
```
```{r}
aggData <- aggregate(x = Insurance_Dataset$Customer.Lifetime.Value,
by=list(Vehicle.Size = Insurance_Dataset$Vehicle.Size),
FUN = mean)
ggplot(data = aggData, aes(x = Vehicle.Size, y = prop.table(stat(aggData$x)), fill = Vehicle.Size, label = scales::percent(prop.table(stat(aggData$x))))) +
geom_bar(stat="identity", position = "dodge") +
geom_text(stat = 'identity', position = position_dodge(.9), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Vehicle Size', y = 'CLV in Percentage', fill = 'Vehicle Size') +
ggtitle("CLV Distribution by Vehicle Size")
```
As we can see, the variable vehicle size is a weak predictor because all the sub-categories contribute equally to the CLV on an average.
The above tree shows us the breakdown of Coverage plans by vehicle size.
### EDA OF NUMERIC DEPENDENT VARIABLES VS CLV.
#### To visualize the correlation between the variables
A correlation heat map is plotted for all the numeric variables. This also checks for multi-collinearity between variables.
```{r}
autoCorr <- Insurance_Dataset[,c(3,10,13:17,22)]
colnames(autoCorr) <- c("Customer Lifetime Value", "Income", "Months Premium Auto", "Months Since Last Claim", "Months Since Policy Inception",
"Open Complaints", "Num of Policies", "Total Claim Amt.")
autoCorr <- cor(autoCorr)
# Plot the correlation table
corrplot(autoCorr, method = "color", order = "hclust")
```
As is evident from the correlation plot, Monthly Premium Auto and Total Claim Amount are moderately correlated while the other variables are weakly correlated.
Other than Monthly Premium Auto and Total Claim Amount negligible multicollinearity is seen between the remaining independent variables.
#### To explore the effect of Income and Total Claim Amount
```{r}
plot(x=Insurance_Dataset$"Income", y=Insurance_Dataset$"Total.Claim.Amount", col="Blue", cex=1, xlab="Income",
ylab="Total Claim Amount",main="Scatterplot of Income vs TCA")
```
We see in this chart that there is no linear positive or negative relationship between variables.
This means that they are independent of each other
#### To visualize the effect of Monthly Premium Auto and Total Claim Amount
```{r}
plot(x=Insurance_Dataset$"Monthly.Premium.Auto", y=Insurance_Dataset$"Total.Claim.Amount", col="Blue", cex=1, xlab="Monthly Premium Auto",
ylab="Total Claim Amount",main="Scatterplot of MPA vs TCA")
```
Here we see the relationship of MPA and TCA, we notice that a few clusters have a positive linear relationship, as evidenced by the upward slopes in the chart.
#### To visualize the effect of Monthly Premium Auto and CLV
```{r}
plot(x=Insurance_Dataset$"Monthly.Premium.Auto", y=Insurance_Dataset$"Customer.Lifetime.Value", col="Blue", cex=1, xlab="Monthly Premium Auto",
ylab="Customer Lifetime Value",main="Scatterplot of MPA vs CLV")
```
From the scatterplot it is evident that higher the MPA, higher is the CLV.
#### To visualize the effect of Total Claim Amount on CLV.
```{r}
plot(x=Insurance_Dataset$"Total.Claim.Amount", y=Insurance_Dataset$"Customer.Lifetime.Value", col="Blue", cex=1, xlab="Total.Claim.Amount",
ylab="Customer Lifetime Value", main="Scatterplot of TCA vs CLV")
```
There is no evidence that there is any linear relationship between Total Claim Amount and CLV as the scatterplot is inconclusive. There is no clear slope either downward or upward in the chart indicating that these two variables are independent of each other.
## Feature Engineering, Feature Selection and Model Building
### Introduction
Having done the EDA, Feature Engineering, Feature Selection and Model Building was carried out.
### Feature Engineering:
1. Sqrt and Log transformations have been used in trying out various models.
2. Variations of relationship between “Monthly Premium Auto” and “Number of Policies” were tried out but their effect was redundant.
### Feature Selection
Feature selection using Stepwise Regression, Random Forest and ANOVA were carried out. These features were used in the various models that were tried out, with and without transformation. The details are in the report below.
```{r pressure1, echo=FALSE, fig.cap="Feature Selection", out.width = '100%'}
knitr::include_graphics("C:/Users/HP/Documents/R/bobo/1.jpeg")
```
### Model Building
#### 1. Random Forest.
One of the models that was chosen was Random Forest. Since the data had a lot of outliers, random forest was selected as it is resilient to outliers. Random Forest algorithm itself is not robust to outliers but the base learner on which it is built - the decision tree, is.The R2 value (in percent) and Adjusted R2 values (in percent) of the RF model for all the various trials with different variables was 96% or higher, as shown in the table below:-
```{r pressure2, echo=FALSE, fig.cap="Summary ", out.width = '100%'}
knitr::include_graphics("C:/Users/HP/Documents/R/bobo/2.jpeg")
```
Based on research carried out on the internet for this model, Random Forest is biased towards specific factors (like categorical variables with different levels) because it provided exceptionally high results for each experiment, which does not seem realistic. As a result, it was decided to not go ahead with Random Forest.
* In trial 2, some of the variables were not included which are Customer, State, Response, Effective to Date, Income, Policy Type and Income_Bin. The selection of these variables was through trial and error and were selected as these were giving better results.
#### 2. Linear Regression.
The second model was Linear Regression. The summary of the various models tried out is as given below :-
##### SUMMARY : REGRESSION
```{r pressure3, echo=FALSE, fig.cap="Not Scaled", out.width = '100%'}
knitr::include_graphics("C:/Users/HP/Documents/R/bobo/3.jpeg")
```
```{r pressure4, echo=FALSE, fig.cap="Scaled", out.width = '100%'}
knitr::include_graphics("C:/Users/HP/Documents/R/bobo/4.jpeg")
```
## CONCLUSIONS
Various models were tried out and their performance measures tabulated as given above. The models were modelled without following feature selection methods, initially, and, later, using feature selection methods. A summary of the features selected in various models is as shown in a subsequent section.
#### Without Feature selection and any feature engineering :-
1. If outliers are not removed, then the performance is extremely poor, even after removing the least significant features.
2. If outliers are removed from the indicator variables, then the performance of Adj R2improves to approx. 63%. There is hardly any effect in the Adj R2 value when insignificant variables are removed and even after binning is effected. Removal of outliers results in removal of approx. 8% of the records being removed.
3. When outliers are removed from the respondent variable, clv, the performance dramatically improves to 93%. There is no change in this figure even after binning. Removal of outliers results in removal of approx. 12% of the records being removed.
4. When outliers are removed from the respondent variable, clv, there are outliers still available in Total Claim Amount and Monthly Premium Auto. Removal of outliers from these features results in removal of a total of 16.88% of records. The performance, however, varies only from the third decimal point onwards wrt the case in point 3 above, and hence there is marginal improvement in performance with all outliers removed.
5. However, steps 3 and 4 were only to check the effect of removing outliers from the respondent variable. This is not being followed.
#### With Feature Selection and/or sqrt/log transformation:
1. The Adj R2 value remains around 63% irrespective of whether outliers are removed or not and whether binning is done or not.Therefore, we retain all outliers as indicators of variation in the data and do not carry out binning. Instead, we carry out scaling of the data.
2. When we apply sqrt transformation to the respondent variable the Adj R2 value goes up to 79 – 80% approx.
3. When we apply log transformation to the respondent variable, we obtain Adj R2 values in the region of 89-90%.
4. The best values are obtained in the model where all the features are taken and the log transformation is applied to the respondent variable.
5. “Monthly Premium Auto” was omitted from modelling due to its correlation with “Total Claim Amount”.
#### Pertinent Take-aways:
1. Binning has negligible effect on performance.
2. Transformation of Respondent variable (sqrt/Log) has a significant improvement in performance vis-à-vis the non transformed variants.
3. Removal of outliers improved performance, but also caused significant loss of data. Hence, outliers were retained.
4. Removal of least significant features hardly caused an improvement in performance. Hence, feature selection techniques were employed.
5. Converting “Number of Open Complaints” and “Number of Policies” to factors improved accuracy of the models.
### Best Model
The best model was one which used all the features and had the log transformation applied to the respondent variable.
#### Code for the model with the best performance
```{r}
df<- read.csv("C:\\Users\\HP\\Downloads\\Marketing-Customer-Value-Analysis.csv")
str(df)
glimpse(df)
```
```{r}
#-----------------------------------Min Max normalization all numeric variables (Scaling)---------------------------#
#Income
df$Income<- (df$Income-min(df$Income))/(max(df$Income)-min(df$Income))
```
```{r}
#Months.Since.Last.Claim
df$Months.Since.Last.Claim<- (df$Months.Since.Last.Claim-min(df$Months.Since.Last.Claim))/(max(df$Months.Since.Last.Claim)-min(df$Months.Since.Last.Claim))
```
```{r}
#Months.Since.Policy.Inception
df$Months.Since.Policy.Inception<- (df$Months.Since.Policy.Inception-min(df$Months.Since.Policy.Inception))/(max(df$Months.Since.Policy.Inception)-min(df$Months.Since.Policy.Inception))
```
```{r}
#Total.Claim.Amount
df$Total.Claim.Amount<- (df$Total.Claim.Amount-min(df$Total.Claim.Amount))/(max(df$Total.Claim.Amount)-min(df$Total.Claim.Amount))
```
```{r}
#converting categorical feauters to factors.
df$State <- as.factor(df$State)
df$Response <- as.factor(df$Response)
df$Coverage <- as.factor(df$Coverage)
df$Education <- as.factor(df$Education)
df$EmploymentStatus <- as.factor(df$EmploymentStatus)
df$Gender <- as.factor(df$Gender)
df$Location.Code <- as.factor(df$Location.Code)
df$Marital.Status <- as.factor(df$Marital.Status)
df$Policy.Type <- as.factor(df$Policy.Type)
df$Renew.Offer.Type <- as.factor(df$Renew.Offer.Type)
df$Policy <- as.factor(df$Policy)
df$Sales.Channel <- as.factor(df$Sales.Channel)
df$Vehicle.Class <- as.factor(df$Vehicle.Class)
df$Vehicle.Size <- as.factor(df$Vehicle.Size)
```
```{r}
#Converting no. of open complaints and policies also to factor.
df$Number.of.Open_Complaints <- as.factor(df$Number.of.Open.Complaints)
df$Number.of.Policies <- as.factor(df$Number.of.Policies)
str(df)
```
```{r}
#-----------------------------------Log transformation only on CLV---------------------------#
df$Customer.Lifetime.Value=log(df$Customer.Lifetime.Value)
```
```{r}
#-----------------------------------Splitting the data to test and train.---------------------------#
split <- sample.split(df, SplitRatio = 0.7)
split
train <- subset(df, split="true")
test <-subset(df, split="false")
train
```
```{r}
#Training the model with normalized data columns and log transformed clv
fit3<- lm(Customer.Lifetime.Value ~ State+Response+Coverage+
Education+EmploymentStatus+Gender+
Income+Location.Code+Marital.Status+
Months.Since.Last.Claim+Months.Since.Policy.Inception+
Number.of.Open.Complaints+Number.of.Policies+Policy+ Renew.Offer.Type+Sales.Channel+Total.Claim.Amount+Vehicle.Class+Vehicle.Size , data=train)
summary(fit3)
```
```{r}
#Finding RSE
sigma(fit3)
```
```{r}
# computing test MSE
test %>%
add_predictions(fit3) %>%
summarise(MSE = mean((Customer.Lifetime.Value - pred)^2))
```
```{r}
# computing train MSE
train %>%
add_predictions(fit3) %>%
summarise(MSE = mean((Customer.Lifetime.Value - pred)^2))
```
```{r}
#Plotting the model.
plot(fit3)
```
```{r}
##################################### Checking of Assumption ############################################
# 1. In the residual vs fitted graph we cannot see any funnel shape in the residues, hence the assumption of homoskedasticity is satisfied.
# 2. The points in the center part of the graphs follow the Q-Q plot. The trailing portion deviates from the Q-Q plot by a small amount. However, the leading portion deviates significantly from the Q-Q plot indicating non adherence to normality. Therefore, Log transformation has been applied to the target variable. The graph of the log transformed target variable is displayed below. As we can see graph resembles the normal curve.
# 3. Residuals are spread equally along the ranges of predictors, indicating homoscedasticity. We can see a horizontal line with equally (randomly) spread points.
# 4. Even though there seems to be extreme values, the regression line is more or less straight.
```
```{r}
# Plot of Log transformed CLV
hist(df$Customer.Lifetime.Value,
breaks = 800,
freq = FALSE,
main = "CLV Histogram", xlab = "CLV", border = "Blue")
```
```{r}
# Residuals should be uncorrelated.There should be no Autocorrelation.
# Null H0: residuals from a linear regression are uncorrelated.
# D-W Statistic should be close to 2.
durbinWatsonTest(fit3)
#Since, the p-value is >0.05, we fail to reject H0: (No Autocorrelation)
```
```{r}
# Checking multicollinearity
vif(fit3)
# The values of VIF should be within 2. And in no case it should be greater than 10.
# Since all values are from 1 to 4, absence of multicollinearity is witnessed.
```
```{r}
# After checking the assumption of the linear regression model we can say that the assumptions seems to be largely satisfied.
```
### Additional Data that could have better predicted the outcome variable
1.In order to better analyze the CLV of a customer, the company needs to make sure the customer stays with them. This can be calculated using "Customer Retention Rate" :
Customer Retention Rate is the number of customers retained by a company over a certain time period. It’s expressed as a percentage of a company’s existing customers who remain
loyal within that time frame.
CRR = [(E-N)/S] x 100
Where :
The number of existing customers at the start of the time period (S)
The number of total customers at the end of the time period (E)
The number of new customers added within the time period (N)
2.If a customer has made no claims for a period of n years, company can provide perks such as "no claim bonus", which would essentially reduce the premium amount payable at next renewal,
while keeping the insurance cover at the same or higher value. We have the data regarding "Months since last claim" , however if we have an additional data regarding customers response to opting for "no claim bonus" it would help us analyze customer preference . Hence, leading to better prediction of customers with consistent CLV
3. If the cost of sustaining Sales Channel was provided it would have helped us analyze the sustainability cost of each Sales Channel, with which we could have essentially found out which Sales Channel is contributing effectively to the CLV. Based on the outcome, we can suggest courses of action to the company reduce such overheads.
### Contribution of team members
To coordinate the team activities, formal meetings were held every day at 1200 hrs and 1900 hrs, besides informal meetings on teams, whatsapp calls, phone calls and innumerable chats at all hours of the day.
Initially everybody carried out EDA. The results of EDA were discussed. Further courses of action on EDA were discussed and carried out. Once EDA was satisfactorily done, feature selection and model building was carried out. Joel, Alex and Tashi carried out feature selection using Stepwise Regression. Sarah, Aishwarya and Bart carried out Feature selection using Random Forest and ANOVA. Sarah, Aishwarya and Bart carried out model building using Random Forest implementing various models with and without transformation as brought out in the report above. Joel, Alex and Tashi constructed various models using Regression, with and without transformation as well as, with and without scaling as brought out in the report above. Retention and deletion of outliers and effect of binning was also experimented with, in the models by both sub-teams.
Finally, having compared the various results and zeroing on the most operative model was done, the report was prepared drawing from the EDA and Model Building done earlier. All members sat together and constructed the report vetting all aspects of the report collectively.
All members were involved at every stage of the process from beginning to end.
### References
https://www.geeksforgeeks.org/random-forest-approach-in-r-programming/
https://github.com/abhiyerasi/CLV-Auto-Insurance
http://www.sthda.com/english/wiki/two-way-anova-test-in-r
http://www.sthda.com/english/wiki/kruskal-wallis-test-in-r
https://www.kaggle.com/juancarlosventosa/models-to-improve-customer-retention
https://medium.com/@aravanshad/gradient-boosting-versus-random-forest-cfa3fa8f0d80
https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why
https://medium.com/@TheDataGyan/day-8-data-transformation-skewness-normalization-and-much-more-4c144d370e55
https://anshikaaxena.medium.com/how-skewed-data-can-skrew-your-linear-regression-model-accuracy-and-transfromation-can-help-62c6d3fe4c53
http://r-statistics.co/Variable-Selection-and-Importance-With-R.html
https://dataaspirant.com/feature-selection-techniques-r/