-
Notifications
You must be signed in to change notification settings - Fork 0
/
dsc_report2024.qmd
468 lines (393 loc) · 19.2 KB
/
dsc_report2024.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: 'DSC Report: 2024'
author: "Tim Dennis"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: true
toc_depth: 3
word_document:
toc: true
toc_depth: '3'
html_document:
toc: true
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(googledrive)
library(tidyverse)
library(scales)
options(stringsAsFactors = F)
library(readr)
library(RColorBrewer)
library(viridis)
library(kableExtra)
library(tidyr)
library(htmlTable)
library(stringr)
library(janitor)
library(lubridate)
library(calecopal)
library(tidytext) # for NLP
library(wordcloud) # to render wordclouds
library(DT) # for dynamic tables
library(tidytext)
library(tm)
library(topicmodels)
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(forcats)
# Load the extrafont package
library(extrafont)
library(showtext)
#font_import()
loadfonts(device = "pdf")
# Use showtext to handle fonts
showtext_auto()
# Define the custom theme
custom_theme <- theme_minimal() +
theme(
text = element_text(family = "Arial", color = "#333333"),
axis.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 10, face = "italic", hjust = 1),
panel.grid.major = element_blank(), # Remove major grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
panel.background = element_rect(fill = "#FFFFFF", color = NA),
plot.background = element_rect(fill = "#FFFFFF", color = NA)
)
# Set the custom theme as the default theme
theme_set(custom_theme)
```
```{r read-data, include=FALSE, message=FALSE}
# Source the R script
#source("src/data_cleaning.R")
source("src/data_clean2.R")
# Load the cleaned and merged data
dsc_consult <- readRDS("data/dsc_consult_merged.rds")
# Load the standardized data
load("data/standardized_data.rda")
load("data/ucla_workshops.rda")
dataverse <- read_csv('data/datasets_files_published_monthly.csv')
```
## Events & Workshops
The DSC puts on events for the UCLA community and for the larger UC system in collaboration with other campuses. We manage the [UCLA Carpentries program](https://www.library.ucla.edu/about/programs/the-carpentries/) and provide a community for over 15 instructors on campus. We have also catalyzed a UC-wide Carpentries community and, over the pandemic, developed collaborative programming and workshop events. The success of this model has led to collaborative work in the development of [UC Love Data Week](https://uc-love-data-week.github.io/) in 2021 and subsequent years. In a similar vein, [UC GIS Week](https://uc-gis-ucop.hub.arcgis.com/pages/uc-gis-week-2023) was started in 2020 by GIS professionals in the UC system. We think these UC collaborative educational events will be a permanent fixture even as we offer more traditional local instruction.
Our workshops typically address skills gaps in data science and foundational coding for researchers, staff, and librarians. We also contribute to curricula and train-the-trainer best practices through a global network.
To get a sense of the growth of events organized, taught & designed by DSC instructors, let us look at attendance over time:
### Attendance over time
```{r workshops_year, echo=FALSE}
#remove Na
dsc_workshops %>%
count(Year = year(date), name = "Number") %>%
drop_na() %>%
ggplot(aes(reorder(Year, -Year), Number)) + geom_col(fill = "#2774AE") +
coord_flip() +
scale_fill_manual(values=cal_palette("kelp1")) +
labs(x= "Year", y="Attendance", title="DSC Event Attendance by Year") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
#geom_text (aes(label=Number), hjust= -0.12) +
geom_text(aes(label=Number), color = "white", hjust = 1.2)
```
### Departments, Schools & Units
```{r number-depts, echo=FALSE}
num_depts_wkshp <- ucla_workshops %>% drop_na(standardized_department) %>% filter(institution == "UCLA") %>% distinct(standardized_department) %>% nrow()
num_depts_wkshp_2023 <- ucla_workshops %>% filter(date >= "2023-01-01" & date <= "2023-12-31") %>% drop_na(standardized_department) %>% filter(institution == "UCLA") %>% distinct(standardized_department) %>% nrow()
```
Since 2017, our workshops have been attended by **`r num_depts_wkshp`** different departments, schools, centers or units from UCLA.
We can look at the top departments who attend our workshops:
```{r departmens_attendance, echo=FALSE}
top_departments <- ucla_workshops %>%
filter(!is.na(standardized_department) & standardized_department != "NULL" & standardized_department != "" & standardized_department != "NA") %>%
count(standardized_department, sort = TRUE, name = "Attendance") %>%
head(15)
# Plotting
ggplot(top_departments, aes(x = reorder(standardized_department, -Attendance), y = Attendance)) +
geom_col(fill = "#2774AE") +
coord_flip() +
labs(x = "Department", y = "Attendance",
title = "Top Departments by Attendance",
subtitle = "2017-2024") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label = Attendance), color = "white", hjust = 1.2)
```
### Affiliation of Attendee: 2017-24
A look at the affiliation of learners who come to our workshops.
```{r attendee_status, echo=FALSE}
dsc_workshops %>%
# filter(date >= "2020-07-01" & date <= "2021-06-30") %>%
count(status, sort = TRUE, name = "Attendance") %>%
drop_na(status) %>%
head(5) %>%
mutate(Percentage = Attendance / sum(Attendance) * 100) %>%
ggplot(aes(reorder(status, -Attendance), Attendance)) +
geom_col(fill = "#2774AE") +
coord_flip() +
scale_fill_manual(values = cal_palette("kelp1")) +
labs(x = "Status", y = "Attendance", title = "Attendance by Status", subtitle = "2017-2022") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), color = "white", hjust = 1.2)
```
## Consultations
We work with researchers one-on-one to help them accomplish their research goals. Prior to the pandemic, we provided consultations both online and in-person, depending on the researchers' preferences. We estimate that prior to March 2020, we provided consultations online approximately 10-15% of the time. During the pandemic, we moved our service to online only and restarted in-person consulting in 2023 on a smaller scale as a pilot. Regardless of how users access our service, it has shown growth over time. With the move to the first floor of YRL in 2024, we anticipate more business due to the improved visibility of our location and the addition of more walk-in hours.
```{r consulting, echo=FALSE}
dsc_consult %>%
filter(!is.na(start_date_time)) %>%
group_by(year = year(start_date_time)) %>%
summarize(consults = sum(n())) %>%
arrange(year) %>%
ggplot(aes(year, consults), consults) + geom_col(fill = "#2774AE") +
#ggplot(aes(year, -consults, consults)) + geom_col(fill = "#2774AE") +
coord_flip() +
scale_fill_manual(values=cal_palette("kelp1")) +
labs(x= "Year", y="Consults") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label=consults), color = "white", hjust = 1.2)
```
### Consulting by Department
```{r number-depts-consult, echo=FALSE}
num_depts <- dsc_consult %>% drop_na(department) %>% distinct(department) %>% nrow()
num_depts_2023 <- dsc_consult %>% filter(year(start_date_time) == 2023) %>% drop_na(department) %>% distinct(department) %>% nrow()
```
Since 2019, when we started capturing more user information, our consultations have come from **`r num_depts`** different departments, schools, or centers. We include **The Library** in this number because, as a secondary and sometimes tertiary referral point, we often work with liaising librarians and internal units needing data support. For 2023, we provided data services to **`r num_depts_2023`** campus departments.
Historically, we didn't require patrons to provide departmental information in our appointment scheduler, so the data may be incomplete. While we've normalized departments, schools, and centers as much as possible, some variability remains. Despite this, the data shows that DSC services are interdisciplinary, aligning with our vision to broaden the library's data services.
```{r top_ten_depts, echo=FALSE}
dsc_consult %>%
select(department) %>%
drop_na() %>%
filter(department != 'DSC') %>%
rename(Departments = department) %>%
count(Departments) %>%
rename(Number = n) %>%
arrange(desc(Number)) %>%
head(15) %>%
ggplot(aes(x=reorder(Departments, -Number), y=Number)) +
geom_col(fill = "#2774AE") +
coord_flip() +
scale_fill_manual(values=cal_palette("kelp1")) +
labs(x= "Department", y="Consults", title = "Consulting by Department", subtitle = "FY 2019-2024") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label=Number), color = "white", hjust = 1.2)
```
### Researcher Status
When provided, we also collect information on our users' status.
```{r consults_status, echo=FALSE}
# Combine specific categories into standardized names
dsc_consult <- dsc_consult %>%
mutate(ucla_affiliation = case_when(
ucla_affiliation %in% c("Graduate", "Graduate Student", "Visiting Graduate Student") ~ "Graduate Student",
ucla_affiliation %in% c("Undergrauate 3rd & Undergraduate", "Undergraduate") ~ "Undergraduate",
TRUE ~ ucla_affiliation
))
# Plot the data, showing only the top 6 categories
dsc_consult %>%
select(ucla_affiliation) %>%
drop_na() %>%
count(ucla_affiliation) %>%
rename(Number = n) %>%
arrange(desc(Number)) %>%
top_n(6, Number) %>% # Show only the top 6 categories
ggplot(aes(x = reorder(ucla_affiliation, -Number), y = Number)) +
geom_col(fill = "#2774AE") +
coord_flip() +
labs(x = "Status", y = "Number", title = "Consultations by Status") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label = Number), color = "white", hjust = 1.2, nudge_y = -0.1)
```
## DataSquad Work Overview
The DataSquad team engages in various activities, supporting data services at UCLA. Their work can be categorized into direct interactions with patrons and assigned tasks from DSC staff consultants, often managed through Trello boards.
### Direct Consults with Researchers
```{r echo=FALSE}
# Summarize direct interactions
direct_interactions <- dsc_consult %>%
filter(group == "Datasquad", !(department %in% "DSC")) %>%
group_by(ucla_affiliation) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Get total number of consults including NAs
total_consults <- sum(direct_interactions$count)
```
DataSquad members frequently interact with faculty, staff, and students to provide consultations and support. The total number of consults provided by DataSquad is [`r total_consults`]{style="font-size: 20px; font-weight: bold;"}.
```{r echo=FALSE}
# Extract year from start_date_time and filter for Datasquad and non-DSC department
yearly_interactions <- dsc_consult %>%
filter(group == "Datasquad", !(department %in% "DSC")) %>%
mutate(year = year(ymd_hms(start_date_time))) %>%
group_by(year) %>%
summarise(total_count = n()) %>%
arrange(desc(year))
# Plot for Direct Interactions by Year
ggplot(yearly_interactions, aes(x = year, y = total_count)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Total Consults by Year for DataSquad", x = "Year", y = "Number of Consults") +
theme_minimal() +
coord_flip()
```
The following chart summarizes the number of consults by patron type for the top 4 categories:
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Get top 4 patron types excluding NAs
top_4_interactions <- direct_interactions %>%
filter(!is.na(ucla_affiliation)) %>%
top_n(4, count)
# Define the color palette
#colors <- c("#2774AE", "#FFD700", "#FF4500", "#32CD32")
# Plot for Direct Interactions (Top 4)
ggplot(top_4_interactions, aes(x = reorder(ucla_affiliation, -count), y = count, fill = ucla_affiliation)) +
geom_bar(stat = "identity", fill='steelblue') +
#scale_fill_manual(values = colors) +
labs(title = "Top 4 Direct Interactions by Status for DataSquad", x = "Patron Type", y = "Number of Consults") +
theme_minimal() +
coord_flip()
```
### DataSquad Assigned Work via Trello
In addition to direct consultations with researchers, DataSquad handles a significant number of project tasks assigned by the DSC team via our Trello board. These tasks include various research support activities, managed and tracked by DSC staff who act as mentors. The following chart provides an overview of the tasks handled by DataSquad over the past year:
```{r eval=TRUE, include=FALSE}
# Filter data for the past year and department == "DSC"
trello_subset_filtered <- trello_subset %>%
mutate(start_date_time = ymd_hms(start_date_time)) %>%
filter(start_date_time >= today() - years(1), department == "DSC")
# Summarize tasks by month
monthly_tasks <- trello_subset_filtered %>%
mutate(month = floor_date(start_date_time, "month")) %>%
group_by(month) %>%
summarise(total_tasks = n())
# Ensure the month column is of class Date
monthly_tasks <- monthly_tasks %>%
mutate(month = as.Date(month))
# View the summarized data
print(monthly_tasks)
```
```{r}
# Plot for Monthly Tasks
ggplot(monthly_tasks, aes(x = month, y = total_tasks)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Tasks Handled by DataSquad Over the Past Year",
x = "Month",
y = "Number of Tasks") +
theme_minimal() +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
```{=html}
<!--
### DSC Consulting Requests by Topic
When a researcher requests a meeting, we ask for the reason for the appointment so we can be prepared. Below is a graph using text analysis to extract the top terms and bigrams from those reasons. One interesting aspect this illustrates is the prominence of tools and software needed to accomplish research tasks, highlighting the importance of software in data-intensive research.
-->
```
```{r echo=FALSE, eval=FALSE}
# Source the text analysis script
source("src/text_analysis.R")
# Load the combined tokens
combined_tokens <- readRDS("data/combined_tokens.rds")
# View the first few rows
#head(combined_tokens)
```
```{r consult_topics, echo=FALSE, eval=FALSE}
remove_bigrams <- c("data", "project", "analysis", "geospatial", "geospatial data",
"data gis", "wrangling", "cleaning", "discuss", "meeting",
"research", "https", "time", "create", "files", "google",
"data data", "collection", "cleaning data", "planning",
"management", "set", "forward", "sharing", "tool", "file",
"analysis coding", "management planning", "twitter", "access",
"statistical", "hoping", "manipulation", "consulting", "smartcard", "smartcard inline", "inline", "https docs", "library", "drive", "docs google", "datasquad", "docs")
# Remove specific words, e.g., "data"
filtered_tokens <- combined_tokens %>%
filter(!bigram %in% remove_bigrams)
# Combine "programming", "coding", and "coding programming" into "coding/programming"
filtered_tokens <- filtered_tokens %>%
mutate(bigram = if_else(bigram %in% c("programming",
"coding", "coding programming", "code"),
"coding/programming", bigram))
# Count the frequency of each bigram
bigram_counts <- filtered_tokens %>%
count(bigram, sort = TRUE)
# View the top bigrams
#head(bigram_counts, 20)
# Pick top 15
top_bigrams <- bigram_counts %>%
top_n(15, n) %>%
arrange(desc(n)) %>%
mutate(bigram = tools::toTitleCase(bigram))
# Visualize the top bigrams with similar styling to the first chart
ggplot(top_bigrams, aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "#2774AE") +
coord_flip() +
labs(title = "Top 15 Topics in Consulting Requests",
x = "Topics",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
geom_text(aes(label = n), color = "white", hjust = 1.2)
```
## Data Repositories & Infrastructure
### UCLA Dataverse
Datasets in UCLA Dataverse are collections of files and this is a view of the growth of those collections since we started Dataverse in 2019. The big jump in 2020 is the addition of metadata from the existing Social Science Data Archive collection we had already curated previously in Dataverse.
```{r dataverse-datasets-published, echo=FALSE}
# Custom theme
custom_theme <- theme_minimal() +
theme(
text = element_text(family = "Arial", color = "#333333"),
axis.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 10, face = "italic", hjust = 1),
panel.grid.major = element_line(color = "#D3D3D3"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#FFFFFF", color = NA),
plot.background = element_rect(fill = "#FFFFFF", color = NA)
)
# Create the plot
ggplot(dataverse, aes(x = ym(date), y = datasets_published)) +
geom_line(color = "#2774AE", size = 1.5) +
#geom_point(color = "#2774AE", size = 3) +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
labs(
title = "Datasets Published Over Time",
subtitle = "Monthly Data Publication Trends",
x = "Year",
y = "Number of Datasets Published",
caption = "Source: UCLA Data Science Center"
) + custom_theme
```
File growth in Dataverse is another metric to consider when assessing the usage of UCLA Dataverse as a resource on campus. Each file in Dataverse is addressable by a unique DOI.
```{r dataverse-files-published, echo=FALSE}
# Custom theme
custom_theme <- theme_minimal() +
theme(
text = element_text(family = "Arial", color = "#333333"),
axis.title = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 10, face = "italic", hjust = 1),
panel.grid.major = element_line(color = "#D3D3D3"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#FFFFFF", color = NA),
plot.background = element_rect(fill = "#FFFFFF", color = NA)
)
# Create the plot
ggplot(dataverse, aes(x = ym(date), y = files_published)) +
geom_line(color = "#2774AE", size = 1.5) +
#geom_point(color = "#2774AE", size = 3) +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
labs(
title = "Files Published Over Time",
subtitle = "Monthly File Publication Trends",
x = "Year",
y = "Number of Files Published",
caption = "Source: UCLA Data Science Center"
) +
custom_theme
```
### Geospatial Services