forked from stat660-f18/team-1_project2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
STAT660-01_f18-team-1_project2_data_analysis_by_JD.sas
executable file
·193 lines (146 loc) · 5.25 KB
/
STAT660-01_f18-team-1_project2_data_analysis_by_JD.sas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
*******************************************************************************;
**************** 80-character banner for column width reference ***************;
* (set window width to banner width to calibrate line length to 80 characters *;
*******************************************************************************;
*
This file uses the following analytic dataset to address several research
questions regarding loan amounts and statistics on loans
Dataset Name: STAT660-01_f18-team-1_project2_data_preparation.sas
See included file for dataset properties
;
* environmental setup;
* set relative file import path to current directory (using standard SAS trick);
X "cd ""%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))))""";
* load external file that generates analytic datasets Loanstat_analytic_file_v1,
Loanstat_analytic_file_h1, Loanstat_analytic_file_v1_sorted, and Loanstat_
analytic_file_h1_sorted;
%include '.\STAT660-01_f18-team-1_project2_data_preparation.sas';
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question:What are the top three members that had the highest annual income?'
;
title2
'Rationale: This should help identify three members that earn the most annually.'
;
footnote1
"From the output, we notice that the top three members that has the highest annual income are member 1301, 1135, and 1238, with income of 340000, 285000, and 267525, respectively."
;
footnote2
"Given the result, we can add some steps to figure out why some members with high income still borrow money from LendingClub. Try to find out the purpose of it."
;
*
Note: This compares column "annual_income" and "member_id" from dataset
Loanstat1 and Loanstat3
Methodology: Use PROC PORT to sort the annual income in the combined dataset
descendingly and use PROC PRINT to output the top 3 member id accordingly.
Limitations: The highest top 3 annual income data are not visualized.
Follow Up: We can add PROC SGPLOT statement to plot a bar graph, making the
annual income differences more explicit.
;
proc sort
data=Loanstat_analytic_file_v1
out=Loanstat_analytic_file_v1_sorted
;
by
descending annual_inc
;
run;
proc print
data=Loanstat_analytic_file_v1_sorted(obs=3)
;
id
member_id
;
var
annual_inc
;
run;
proc sgplot
data=Loanstat_analytic_file_v1_sorted(obs=3)
;
vbar
member_id/response=annual_inc
;
run;
title;
footnote;
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question: What is the average loan amount for each state?'
;
title2
'Rationale: This shows the lending situation from each state.'
;
footnote1
"From the output, we can see the 5-number summaries for loan amount in each state. For example, the mean loan amount for California is 15063.24."
;
footnote2
"Focusing on the mean loan amount, the lowest one is in MS. Investigation could be performed to find out the reason of it."
;
*
Note: This compares column "loan_amount", "state", and "member_id" from dataset
Loanstat1 and Loanstat2
Methodology: Use the PROC MEANS statement to compute the mean loan amount.
Limitations: We cannot know the distribution of the loan amount for each state.
Follow Up: Add min, median, and max in the PROC MEANS statement to compute the
five-number summaries.
;
proc means
data=Loanstat_analytic_file_h1
;
var
loan_amnt
;
class
addr_state
;
run;
title;
footnote;
*******************************************************************************;
* Research Question Analysis Starting Point;
*******************************************************************************;
title1
'Research Question: What is the purpose of the highest loan amount?'
;
title2
'Rationale: This would help identify what is the use of the largest amount of money borrowed.'
;
footnote1
"From the output, we notice that the highest loan amount is 40000. The money is used for credit card."
;
footnote2
"We should try to understand the meaning for each purpose in the datasets, since the description of purpose is not detailed"
;
*
Note: This compares column "loan_amount", "purpose",and "member_id" from
dataset Loanstat1 and Loanstat2
Methodology: Use PROC SORT statement to sort the loan amount descendingly and
find out the purpose of it.
Limitations: The currency symbol for the loan amount is not clear.
Follow Up: Use PROC FORMAT to add dollar sign in the data output.
;
proc sort
data=Loanstat_analytic_file_h1
out=Loanstat_analytic_file_h1_sorted
;
by
descending loan_amnt
;
run;
proc print
data=Loanstat_analytic_file_h1_sorted(obs=1)
;
id
purpose
;
var
loan_amnt
;
run;
title;
footnote;