-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_curation.qmd
57 lines (41 loc) · 1.23 KB
/
data_curation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
title: "Data curation"
format: pdf
editor: visual
---
## Data curation file for the pedagogical extended prone project
In this file you will find an example of how to curate your data, i.e. how to upload your raw file, process it so it is usable and evaluate the quality of the data set and the possible presence of outliers.
First let's upload our toolbox: the packages and the different handmade functions for curating the data.
```{r}
library(here)
source(here("code", "R", "packages.R"))
source(here("code", "R", "functions_data.R"))
```
Let's upload the data:
```{r}
demog <- upload_demog(here("random_data", "random_demog.xlsx"))
```
```{r}
demog
```
What are the different columns of my table ?
```{r}
names(demog)
```
Are there any missing values ?
```{r}
number_of_variable_with_missing_values <- demog %>%
summarise(across(everything(), ~sum(is.na(.x)))) %>%
filter(across(everything(), ~.x!=0)) %>%
dim()
```
`r number_of_variable_with_missing_values[1]` variables have missing values in the demog data frame.
Let's evaluate the bmi column:
```{r}
ggplot(demog, aes(x=bmi)) +
stat_ecdf() +
xlab("BMI in kg/m^2") +
ylab("percentage") +
ggtitle("Empirical cumulative distribution function of the BMI variable")
```
##