forked from stat660-f18/team-2_project2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
STAT660-01_f18-team-2_project2_data_analysis_by_YL.sas
212 lines (163 loc) · 6.14 KB
/
STAT660-01_f18-team-2_project2_data_analysis_by_YL.sas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
*******************************************************************************;
**************** 80-character banner for column width reference ***************;
* (set window width to banner width to calibrate line length to 80 characters *;
*******************************************************************************;
*
This file uses the following analytic dataset to address several research
questions regarding bank clients' decisions to subscribe a term deposit at a
Portuguese banking institution.
Dataset Name: bank_analysis created in external file
STAT660-01_f18-team-2_project2_data_preparation.sas, which is assumed to be in
the same directory as this file
See included file for dataset properties
;
* environmental setup;
* set relative file import path to current directory (using standard SAS trick);
X "cd ""%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))))""";
* load external file that generates analytic datasets bank_analysis;
%include '.\STAT660-01_f18-team-2_project2_data_preparation.sas';
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question: How was the last phone call duration distributed compared to the outcome of subscription?'
;
title2
"Rationale: According to the data dictionary, this duration attribute highly affects the response variable, which is the outcome of subscription."
;
footnote1
"The Boxplot showing the distributions of phone call duration by the subscription outcome is highly skewed."
;
footnote2
"Besides, it is hard to tell the significant difference by only eyeballing the boxplot"
;
*
Note: This is essentially trying to take a look at the response variable y in
the original bank_subscriber and bank_nonsubscriber datasets.
Methodology: Adopte a boxplot to take a look at the distribution of
interested attribute, and to visually compare the attribute's difference
between two subgroups.
Limitations: Eyeballing the difference is sometimes hard.
Followup Steps: Run a statistical test between two subgroups for the same
attribute.
;
proc sort
data=bank_analysis
;
by
y
;
run;
proc univariate
data = bank_analysis
plot
;
var duration
;
by
y
;
run;
title2
"To compare the duration for different subsription outcome statistically, use t test"
;
footnote1
"t test reveals a very small p value, indicating that the two durations for different subsriptions are significantly different"
;
footnote2
"That is to say, last phone call's duraion does affect the outcome of subsription, as stated by the data dictionary"
;
proc ttest
data = bank_analysis
;
var
duration
;
class
y
;
run;
title;
footnote;
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question: How do the social and economic attributes affect the outcome of subscription?'
;
title2
'Rationale: This would help to determine how important the social and economic attributes are to the response variable'
;
footnote1
"A logistic regression model reveals that 4 out of 5 social/economics attributes are significant."
;
footnote2
"That is to say, the quarterly employment variation rate, the montly consumer price index, the monthly consumer confidence index and the quarterly number of employees are affecting the clients' decision to subsribe a term deposit."
;
footnote3
"In other words, the society's financial environments are highly affecting citizens' decision whether to put their money in the bank."
;
*
Note: This compares the columns of social/economics attributes in the bank_se
to the outcome of subscription colume y in data_subscriber and
data_nonsubscriber datasets.
Methodology: The outcome/response varialbe is binary, thus a logistic
regression model was used to see which of the social/economics attributes are
affecting the subscription of bank clients.
Limitations: This model only takes the social/economic attributes into
consideration to prove their affects. A combined model with the bank's campaign
activities and the clients' own info might affect the signficance of these SE
attributes.
Followup Steps: A full model with all the possible attributes should be
incorporated.
;
proc logistic
data = bank_analysis
;
model
y = emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed
;
run;
title;
footnote;
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question: How do the compaign activities of the bank affect the customers decision of subscription?'
;
title2
'Rationale: This would help to determine the efficiency of the bank campaigns'
;
footnote1
"A logistic regression model was built to check how the previous campaigns and contacts affect the decision of subscription"
;
footnote2
"Small p values indicate that these attributes are significant"
;
footnote3
"In other words, the campaign acitivites of the bank do influence the clients decision to subscribe a term deposit"
;
*
Note: This compares bank's campaign activites with the response varaible.
Methodology: A logtistic regression method is used as the response variable is
binary.
Limitations: Even though logistic regression has less assumptions, it does have
some assumptions for the model to be properly used. On the other hands, many
machine learning algorithsms has less assumptions and predict accurate results.
For example, kNN for numeric predicting variables, and neutral networks et al.
Followup Steps: Different machine learning predicting tools might be help to
predict easy, and accurate results.
;
proc logistic
data=bank_analysis
;
class
poutcome
;
model
y = campaign previous poutcome
;
run;
title;
footnote;