From 37e7ff4e922431f1cb95701fd21dd8bca39b85c1 Mon Sep 17 00:00:00 2001 From: Aleksander Chlebowski <114988527+chlebowa@users.noreply.github.com> Date: Wed, 24 Jan 2024 17:03:34 +0100 Subject: [PATCH] 251 update vignettes@main (#276) Closes #251 --------- Signed-off-by: Aleksander Chlebowski <114988527+chlebowa@users.noreply.github.com> Co-authored-by: Vedha Viyash <49812166+vedhav@users.noreply.github.com> --- R/teal_data.R | 12 ++-- man/cdisc_data.Rd | 3 +- man/teal_data.Rd | 7 ++- vignettes/teal-data-reproducibility.Rmd | 84 +++++++++++++++++++------ vignettes/teal-data.Rmd | 35 ++++++----- 5 files changed, 100 insertions(+), 41 deletions(-) diff --git a/R/teal_data.R b/R/teal_data.R index e455f67a1..5e37ba35a 100644 --- a/R/teal_data.R +++ b/R/teal_data.R @@ -6,11 +6,15 @@ #' Universal function to pass data to teal application. #' #' @param ... any number of objects (presumably data objects) provided as `name = value` pairs. -#' @param join_keys (`join_keys`) object or a single (`join_key_set`) object #' -#' (optional) object with dataset column relationships used for joining. -#' If empty then no joins between pairs of objects -#' @param code (`character`, `language`) code to reproduce the datasets. +#' @param join_keys (`join_keys`) object or a single (`join_key_set`) object. +#' +#' (optional) object with dataset column relationships used for joining. +#' If empty then no joins between pairs of objects. +#' +#' @param code (`character`, `language`) optional code to reproduce the datasets provided in `...`. +#' Note this code is not executed and the `teal_data` may not be reproducible +#' #' @param check (`logical`) reproducibility check - whether to perform a check that the pre-processing #' code included in the object definitions actually produces those objects. #' If `check` is true and preprocessing code is empty an error will be thrown. diff --git a/man/cdisc_data.Rd b/man/cdisc_data.Rd index 315e1b28c..cecd579eb 100644 --- a/man/cdisc_data.Rd +++ b/man/cdisc_data.Rd @@ -20,7 +20,8 @@ cdisc_data( If empty then it would be automatically derived basing on intersection of datasets primary keys. For ADAM datasets it would be automatically derived.} -\item{code}{(\code{character}, \code{language}) code to reproduce the datasets.} +\item{code}{(\code{character}, \code{language}) optional code to reproduce the datasets provided in \code{...}. +Note this code is not executed and the \code{teal_data} may not be reproducible} \item{check}{(\code{logical}) reproducibility check - whether to perform a check that the pre-processing code included in the object definitions actually produces those objects. diff --git a/man/teal_data.Rd b/man/teal_data.Rd index 6f30c45b9..a8b6b6008 100644 --- a/man/teal_data.Rd +++ b/man/teal_data.Rd @@ -14,12 +14,13 @@ teal_data( \arguments{ \item{...}{any number of objects (presumably data objects) provided as \code{name = value} pairs.} -\item{join_keys}{(\code{join_keys}) object or a single (\code{join_key_set}) object +\item{join_keys}{(\code{join_keys}) object or a single (\code{join_key_set}) object. (optional) object with dataset column relationships used for joining. -If empty then no joins between pairs of objects} +If empty then no joins between pairs of objects.} -\item{code}{(\code{character}, \code{language}) code to reproduce the datasets.} +\item{code}{(\code{character}, \code{language}) optional code to reproduce the datasets provided in \code{...}. +Note this code is not executed and the \code{teal_data} may not be reproducible} \item{check}{(\code{logical}) reproducibility check - whether to perform a check that the pre-processing code included in the object definitions actually produces those objects. diff --git a/vignettes/teal-data-reproducibility.Rmd b/vignettes/teal-data-reproducibility.Rmd index c1b289c48..a54e85618 100644 --- a/vignettes/teal-data-reproducibility.Rmd +++ b/vignettes/teal-data-reproducibility.Rmd @@ -1,7 +1,9 @@ --- title: "teal_data reproducibility" author: "NEST CoreDev" -output: rmarkdown::html_vignette +output: + rmarkdown::html_vignette: + toc: true vignette: > %\VignetteIndexEntry{teal_data reproducibility} %\VignetteEngine{knitr::rmarkdown} @@ -10,34 +12,50 @@ vignette: > # Reproducibility of `teal_data` objects -Reproducibility is a primary function of the `qenv` class, which `teal_data` inherits from. Every data modification in `teal_data` object is performed in an encapsulated environment, separate from the global environment. +Reproducibility is a primary function of the `qenv` class, which `teal_data` inherits from. +Every data modification in a `teal_data` object is performed in an encapsulated environment, separate from the global environment. -It is important to note that the reproducibility of this object is limited only to the data-code relationship. Other aspects such as the reliability of the data source, reproducibility of the R session (including package versions), and creation and use of objects from other environments (e.g. `.GlobalEnvironment`) cannot be verified properly by the `teal_data`. +It is important to note that the reproducibility of this object is limited only to the data-code relationship. +Other aspects such as the reliability of the data source, reproducibility of the R session (including package versions), and creation and use of objects from other environments (e.g. `.GlobalEnv`) cannot be verified properly by `teal_data`. +It is advisable to always begin analysis in a new session and run all code that pertains to the analysis within the `teal_data` object. -## Verification status +## Verification + +### Verification status + +Every `teal_data` object has a _verification status_, which is a statement of whether the contents of the `env` can be reproduced by `code`. +From this perspective, `teal_data` objects that are instantiated empty are _verified_ but ones instantiated with data and code are _unverified_ because the code need not be reproducible. +Obviously, `teal_data` objects instantiated with data only are _unverified_ as well. -`teal_data` objects that are instantiated empty are created as _verified_. Objects can be modified only through `eval_code()` and `within()` functions to ensure that the code is saved in the object and always evaluated in the correct environment. Evaluating code in a `teal_data` object _does not_ change its verification status. +When evaluating code in a `teal_data` object, the code that is stored is the same as the code that is executed, so it is reproducible by definition. +Therefore, evaluating code in a `teal_data` object _does not_ change its verification status. + +The verification status is always printed when inspecting a `teal_data` object. +Also, when retrieving code, unverified objects add a warning to the code stating that it has not passed verification. ```{r, message=FALSE, error=TRUE} library(teal.data) -my_data <- teal_data() -my_data <- within(my_data, data <- data.frame(x = 11:20)) -my_data <- within(my_data, data$id <- seq_len(nrow(data))) -my_data # is verified -``` +data_empty <- teal_data() +data_empty # is verified +data_empty <- within(data_empty, i <- head(iris)) +data_empty # remains verified -Find out more in the `teal_data()` documentation. +data_with_data <- teal_data(i = head(iris), code = "i <- head(iris)") +data_with_data # is unverified +data_with_data <- within(data_with_data, i$rand <- sample(nrow(i))) +data_with_data # remains unverified +cat(get_code(data_with_data)) # warning is prepended +``` -## Verification +### Verification process -`teal_data` objects instantiated with data and code run the risk of the code not being reproducible. -Such objects will be created in an _unverified_ state. -To confirm that the code exactly reproduces the variables stored in the object, one must run the `verify()` function. -If the verification succeeds, the object's state will be changed to _verified_, otherwise an error will be raised. -When retrieving code, unverified objects will always add a warning to the code stating that it has not passed verification. +In order to confirm that the code stored in `teal_data` exactly reproduces the contents of the environment, one must run the `verify()` function. +This causes the code to be evaluated and the results to be compared to the contents of the environment. +If the code executes without errors and the results are the same as the contents already present in the environment, the verification is successful and the object's state will be changed to _verified_. +Otherwise an error will be raised. -### verified +#### verified ```{r} library(teal.data) @@ -55,7 +73,7 @@ data_right <- teal_data( (data_right_verified <- verify(data_right)) # returns verified object ``` -### unverified +#### unverified ```{r, message=FALSE, error=TRUE} data_wrong <- teal_data( @@ -67,3 +85,31 @@ data_wrong <- teal_data( verify(data_wrong) # fails verification, raises error ``` +## Retrieving code + +The `get_code` function is used to retrieve the code stored in a `teal_data` object. +A simple `get_code()` will return the entirety of the code but using the `datanames` argument allows for obtaining a subset of the code that only deals with some of the objects stored in `teal_data`. + +```{r} +library(teal.data) + +data <- within(teal_data(), { + i <- iris + m <- mtcars + head(i) +}) +cat(get_code(data)) # retrieve all code +cat(get_code(data, datanames = "i")) # retrieve code for `i` +``` + +Note that in when retrieving code for a specific dataset, the result is only the code used to _create_ that dataset, not code that _uses_ is. + +## Tracking object dependencies + +Calling `get_code` with `datanames` specified initiates an analysis of the stored code, in which object dependencies are automatically discovered. +If object `x` is created with an expression that uses object `y`, the lines that create object `y` must also be returned. +This is quite effective when objects are created by simple assignments like `x <- foo(y)`. +However, in rare cases discovering dependencies is impossible, _e.g._ when opening connections to databases or when objects are created by side effects (functions acting on their calling environment implicitly rather than returning a value that is then assigned). +In such cases the code author must manually tag code lines that are required for a dataset by adding a special comment to the lines: `# @linksto x` will cause the line to be included when retrieving code for `x`. + +See `?get_code` for a detailed explanation and examples. diff --git a/vignettes/teal-data.Rmd b/vignettes/teal-data.Rmd index 0547a5b74..687b136d9 100644 --- a/vignettes/teal-data.Rmd +++ b/vignettes/teal-data.Rmd @@ -10,18 +10,21 @@ vignette: > # Introduction -The `teal.data` package enables `teal` application developers to convert their data into a format which can be used -inside `teal` applications. `teal_data` class inherits from [`qenv`](https://insightsengineering.github.io/teal.code/latest-tag/articles/qenv.html) and is meant to be used for reproducibility purposes. +The `teal.data` package specifies the data format used in `teal` applications. +The `teal_data` class inherits from [`qenv`](https://insightsengineering.github.io/teal.code/latest-tag/articles/qenv.html) and is meant to be used for reproducibility purposes. ## Quick Start -To create a `teal_data` class object, use the `teal_data` function. `teal_data` has a number of methods which allow -to set and get relevant information from private class slots. +To create an object of class `teal_data`, use the `teal_data` function. +`teal_data` has a number of methods to manage relevant information in private class slots. -```{r, results = 'hide'} +```{r, results = 'hide', message = FALSE} library(teal.data) +# create teal_data object my_data <- teal_data() + +# run code within teal_data to create data objects my_data <- within( my_data, { @@ -30,34 +33,35 @@ my_data <- within( } ) -# get variables from teal_data environment +# get objects stored in teal_data my_data[["data1"]] my_data[["data1"]] # get reproducible code get_code(my_data) -# set/get datanames +# get or set datanames datanames(my_data) <- c("data1", "data2") datanames(my_data) -# concise print +# print print(my_data) ``` ## `teal_data` characteristics -`teal_data` class object keeps the following information: -- `env` - an environment containing data. -- `code` - a string containing code to reproduce `env` (details in [teal_data reproducibility ](teal-data-reproducibility.html)). -- `datanames` - a character vector with names of datasets. +A `teal_data` object keeps the following information: + +- `env` - an environment containing data. +- `code` - a string containing code to reproduce `env` (details in [reproducibility](teal-data-reproducibility.html)). +- `datanames` - a character vector listing objects of interest to `teal` modules (details in [this `teal` vignette](https://insightsengineering.github.io/teal/latest-tag/articles/including-data-in-teal-applications.html)). - `join_keys` - a `join_keys` object defining relationships between datasets (details in [Join Keys](join-keys.html)). ### Reproducibility The primary function of `teal_data` is to provide reproducibility of data. -We recommend to initialize empty `teal_data`, which marks object as _verified_, and create data sets with code evaluated in the object, using `within` or `eval_code`. +We recommend to initialize empty `teal_data`, which marks object as _verified_, and create datasets by evaluating code in the object, using `within` or `eval_code`. Read more in [teal_data Reproducibility](teal-data-reproducibility.html). ```{r} @@ -70,7 +74,10 @@ my_data # is verified ### Relational data models -The `teal_data` class can be extended to support relational data. The `join_keys` function can be used to specify relationships between datasets. See more in [join_keys](join-keys.html). +The `teal_data` class supports relational data. +Relationships between datasets can be described by joining keys and stored in a `teal_data` object. +These relationships can be read or set with the `join_keys` function. +See more in [join_keys](join-keys.html). ```{r} my_data <- teal_data()