-
Notifications
You must be signed in to change notification settings - Fork 2
/
STAT660-01_f18-team-2_project1_data_analysis_by_YL.sas
213 lines (170 loc) · 6.22 KB
/
STAT660-01_f18-team-2_project1_data_analysis_by_YL.sas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
*******************************************************************************;
**************** 80-character banner for column width reference ***************;
* (set window width to banner width to calibrate line length to 80 characters *;
*******************************************************************************;
*
This file uses the following analytic dataset to address several research
questions regarding the absenteeism at work from July 2007 to July 2010 at a
counrier company in Brazil
Dataset Name: Absenteeism_at_work created in external file
STAT660_f18-team-2_project1_data_preparation.sas, which is assumed to be
in the same directory as this file
See included file for dataset properties
;
* environmental setup;
* set relative file import path to current directory (using standard SAS trick);
X "cd ""%substr(%sysget(SAS_EXECFILEPATH),1,%eval(%length(%sysget(SAS_EXECFILEPATH))-%length(%sysget(SAS_EXECFILENAME))))""";
* load external file that generates analytic dataset Absenteeism_at_work;
%include '.\STAT660-01_f18-team-2_project1_data_preparation.sas';
title1
'Research Question: What does the distribution of the workers absenteeism look like?'
;
title2
'Rationale: This gives a general idea of the absenteeism in the company including the mean and median missing hours, and the seasonal changes of working absenteeism.'
;
footnote1
'Based on the summary table, the average absenteeism at work is 6.99 hours with a median of 3 hours'
;
footnote2
'Histogram shows that the distribution of absenteeism is right skewed, but most of the absenteeism hours are between 0 and 10 hours'
;
footnote3
'After grouping the employees, we can easily see that a few employees have many absent hours while some have very small amount of absenteeism, which explained the skewness of the histogram'
;
*
Methodology: First use PROC MEANS to check the mean, median and a few other
statistics for several interested variables, and then get a histogram to vividly
show the distribtuion of absenteeism time in hours.
Limitations: Because the experimental unit and measurement unit are different
in this case, getting the mean and median for variables that show the employees'
properties has no practical meaning.
Follow-up Steps: use a CLASS statement in PROC MEANS to get the summary
statistics for each employee
;
proc means
mean median maxdec=2
data = absenteeism_analytic_file
;
var
Absenteeism_time_in_hours
;
run;
proc univariate
noprint
data=absenteeism_analytic_file
;
var
Absenteeism_time_in_hours
;
histogram;
run;
proc means
mean median maxdec=2
data=absenteeism_analytic_file
;
class
id
;
var
Absenteeism_time_in_hours
;
run;
title;
footnote;
title1
'Research Question: Is daily workload a factor that affects the absenteeism of employees?'
;
title2
'Rationale: The answer for this question will help the manager of the company to evaluate if the employees are overloaded.'
;
footnote1
'The simple logistic regression gives a p value of 0.3866 for the coefficient of daily workload, which suggests that daily workload is not a significant factor affecting absenteeism.'
;
footnote2
'A simple linear regression of daily workload on absent hours revealed the same conclusion: workload is statistically irrelavent to absenteeism at work.'
;
*
Methodology: compute a logistic regression to see if the daily workload affects
employees' absenteeism. Because the dataset doesn't have a binary variable to
indicate absenteeism, a new variable called absence was created and used to run
logistic regression.
Limitations: Because every employee has repeated measurements in the dataset, it
is not accurate to use this dataset to run regression, as the more absent worker
has more "weighted" conditions for evaluating factors of absenteeism.
Possible Follow-up Steps: check the total abseentism for every worker, and
create a subset of the dataset, which only has the 36 employees with a binary
variable to indicate if he or she is ever absent.
;
proc logistic
data=Absenteeism_analytic_file
;
model absence = Work_load_Average_day
;
run;
proc glm
;
model absenteeism_time_in_hours = Work_load_Average_day
;
run;
title;
footnote;
title1
'Research Question: Which employees have more absenteeism?'
;
title2
'Rationale: Identifying the employees who have more absenteeism helps to decide if their absenteeisms are due to personal issues or more common reasons among other employees, in other words, it helps to decide if this employee is an outlier for our linear regression analysis.'
;
footnote1
'The reason frequency table for each worker is given. From this table, a few workers only had 1 absence reason listed as NA, which means that they are never absent from work.'
;
footnote2
'The modified table only considers the absenteeism, and removed the workers who never missed work. The third employee had 111 leaves, which means that he or she could be an outlier in determing the total absenteeism at work in this company.'
;
footnote3
'The bar graph helps to quickly identify the employee who has more absenteeisms. The worker with ID 3 is certainly an observation we want to look into when doing analysis.'
;
*
Methodology: Use a two-way frequency table to take a glance at the common
absenteeism reasons for each employee. Each row represents the situiton for each
worker.
Limitations: It is hard to evaluate the freuency distribution difference among
different workers.
Possilble Follow-up Steps: get a histogram for the reasons' frequency and then
compare the histograms between workers -- data visualizaion often helps to
quickly identify the change.
;
proc freq
data = Absenteeism_analytic_file
;
tables
id*reason_for_absence
/ nopercent norow nocol
;
format
reason_for_absence reasonofabsence.;
run;
proc freq
data = Absenteeism_analytic_file
;
tables
id*reason_for_absence
/nopercent norow nocol
;
where
absence = 1
;
format
reason_for_absence reasonofabsence.;
run;
proc sgplot
data=absenteeism_analytic_file
;
vbar
ID
;
where
absence = 1;
run;
quit;
title;
footnote;