diff --git a/inst/dev-guide/dg_debug_rtables.Rmd b/inst/dev-guide/dg_debug_rtables.Rmd new file mode 100644 index 000000000..a89cc0d71 --- /dev/null +++ b/inst/dev-guide/dg_debug_rtables.Rmd @@ -0,0 +1,91 @@ +--- +title: "Debugging in `rtables` and Beyond" +author: "Davide Garolini" +date: '`r Sys.Date()`' +output: + html_document: + theme: spacelab +editor_options: + chunk_output_type: console +knit: (function(inputFile, encoding) { + rmarkdown::render(inputFile, encoding = encoding, output_dir = ".")}) +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Debugging + +This is a short and non-comprehensive guide to debugging `rtables`. Regardless, it is to be considered valid for personal use at your discretion. + +#### Coding in Practice + +* It is easy to read and find problems +* It is not clever if it is impossible to debug + +#### Some Definitions + +* __Coding Error__ - Code does not do what you intended -> Bug in the punch card +* __Unexpected Input__ - Defensive programming FAIL FAST FAIL LOUD (FFFL) -> useful and not too time consuming +* __Bug in Dependency__ -> never use dependencies if you can! + +#### Considerations About FFFL + +Errors should be as close as possible to the source. For example, bad inputs should be found very early. The worst possible example is a software that is silently giving incorrect results. Common things that we can catch early are missing values, column `length == 0`, or `length > 1`. + +#### General Suggestions + +* Robust code base does not attempt doing possibly problematic operations. +* Read Error Messages +* `debugcall` you can add the signature (formals) +* `trace` is powerful because you can add the reaction +* `tracer` is very good and precise to find where it happens + +`options(error = recover)` is one of the best tools to debug at it is a core tool when developing that allows you to step into any point of the function call sequence. + +`dump.frames` and `debugger`: it saves it to a file or an object and then you call debugger to step in it +as you did recover. + +#### `warn` Global Option + +- `<0` ignored +- `0` top level function call +- `1` immediately as they occur +- `>=2` throws errors + +`<<-` for `recover` or `debugger` gives it to the global environment + +#### lo-fi debugging + +* PRINT / CAT is always a low level debugging that can be used. It is helpful for server jobs where maybe only terminal or console output is available and no `browser()` can be used. For example, you can print the position or state of a function at a certain point untill you find the break point. +* comment blocks -> does not work with pipes (you can use `identity()` it is a step that does nothing but does not break the pipes) +* `browser()` bombing + +#### Regression Tests + +Almost every bug should become a regression test. + +#### Debugging with Pipes + +* Pipes are better to write code but horrible to debug +* T in pipe `%T>%` does print it midway +* `debug_pipe()` -> it is like the T pipe going into browser() + +#### Shiny Debugging + +More difficult due to reactivity. + +#### General Suggestion + +DO NOT BE CLEVER WITH CODE - ONLY IF YOU HAVE TO, CLEVER IS ALSO SUBJECTIVE AND IT WILL CHANGE WITH TIME. + +## Debugging in `rtables` + +We invite the smart developer to use the provided examples as a way to get an "interactive" and dynamic view of the internal algorithms as they are routinely executed when constructing tables with `rtables`. This is achieved by using `browser()` and `debugonce()` on internal and exported functions (`rtables:::` or `rtables::`), as we will see in a moment. We invite you to continuously and autonomously explore the multiple `S3` and `S4` objects that constitute the complexity and power of `rtables`. To do so, we will use the following functions: + +* `methods(generic_function)`: This function lists the methods that are available for a generic function. Specifically for `S4` generic functions, `showMethods(generic_function)` gives more detailed information about each method (e.g. inheritance). +* `class(object)`: This function returns the class of an object. If the class is not one of the built-in classes in R, you can use this information to search for its documentation and examples. `help(class)` may be informative as it will call the documentation of the specific class. Similarly, the `?` operator will bring up the documentation page for different `S4` methods. For `S3` methods it is necessary to postfix the class name with a dot (e.g. `?summary.lm`). +* `getClass(class)`: This describes the type of class in a compact way, the slots that it has, and the relationships that it may have with the other classes that may inherit from or be inherited by it. With `getClass(object)` we can see to which values the slots of the object are assigned. It is possible to use `str(object, max.level = 2)` to see less formal and more compact descriptions of the slots, but it may be problematic when there are one or more objects in the class slots. Hence, the maximum number of levels should always be limited to 2 or 3 (`max.level = 2`). Similarly, `attributes()` can be used to retrieve some information, but we need to remember that storing important variables in this way is not encouraged. Information regarding the type of class can be retrieved with `mode()` and indirectly by `summary()` and `is.S4()`. +*`getAnywhere(function)` is very useful to get the source code of internal functions and specific generics. It works very well with `S3` methods, and will display the relevant namespace for each of the methods found. Similarly, `getMethod(S4_generic, S4_class)` can retrieve the source code of class-specific `S4` methods. +* `eval(debugcall(generic_function(obj)))`: this is a very useful way to browse a `S4` method, specifically for a defined object, without having to manually insert `browser()` into the code. It is also possible to do similarly with R > 3.4.0 where `debug*()` calls can have the triggering signature (class) specified. Both of these are modern and simplified wrappers of the tracing function `trace()`. diff --git a/inst/dev-guide/dg_split_machinery.Rmd b/inst/dev-guide/dg_split_machinery.Rmd new file mode 100644 index 000000000..0f8632a22 --- /dev/null +++ b/inst/dev-guide/dg_split_machinery.Rmd @@ -0,0 +1,997 @@ +--- +title: "Split Machinery" +author: "Davide Garolini" +date: '`r Sys.Date()`' +output: + html_document: + theme: spacelab + toc: true + toc_float: + collapsed: false +editor_options: + chunk_output_type: console +knit: (function(inputFile, encoding) { + rmarkdown::render(inputFile, encoding = encoding, output_dir = ".")}) +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Disclaimer + +This article is intended for use by developers only and will contain low-level explanations of the topics covered. For user-friendly vignettes, please see the [Articles](https://insightsengineering.github.io/rtables/main/articles/index.html) page on the `rtables` website. + +Any code or prose which appears in the version of this article on the `main` branch of the repository may reflect a specific state of things that can be more or less recent. This guide describes very important pieces of the split machinery that are unlikely to change. Regardless, we invite the reader to keep in mind that the current repository code may have drifted from the following material in this document, and it is always the best practice to read the code directly on `main`. + +Please keep in mind that `rtables` is still under active development, and it has seen the efforts of multiple contributors across different years. Therefore, there may be legacy mechanisms and ongoing transformations that could look different in the future. + +Being that this a working document that may be subjected to both deprecation and updates, we keep `xxx` comments to indicate placeholders for warnings and to-do's that need further work. + +## Introduction + +The scope of this article is understanding how `rtables` creates facets by splitting the incoming data into hierarchical groups that go from the root node to singular `rcell`s. The latter level, also called the leaf-level, contains the final partition that is subjected to analysis functions. More details from the user perspective can be found in the [Split Functions vignette](https://insightsengineering.github.io/rtables/main/articles/split_functions.html) and in function documentation like `?split_rows_by` and `?split_funcs`. + +The following article will describe how the split machinery works in the row domain. Further information on how the split machinery works in the column domain will be covered in a separate article. + +## Process and Methods + +Beforehand, we encourage the reader to familiarize themselves with the Debugging in `rtables`(xxx link here) article from the `rtables` Developers Guide. This document is generally valid for R programming, but has been tailored to study and understand complex packages that rely heavily on S3 and S4 object programming like `rtables`. + +Here, we explore and study the split machinery with a growing amount of complexity, following relevant functions and methods throughout their execution. By going from basic to complex and by discussing important and special cases, we hope to be able to give you a good understanding of how the split machinery works. + +In practice, the majority of the split engine resides in the source file `R/split_funs.R`, with occasional incursion into `R/make_split_fun.R` for custom split function creation and rarer references to other more general tabulation files. + +## `do_split` + +The split machinery is so fundamental to `rtables` that relevant functions like `do_split` are executed even when no split is requested. The following example shows how we can enter `do_split` and start understanding the class hierarchy and the main split engine. + +```{r, message=FALSE} +library(rtables) +# debugonce(rtables:::do_split) # Uncomment me to enter the function!!! +basic_table() %>% + build_table(DM) +``` + +In the following code, we copied the `do_split` function code to allow the reader to go through the general structure with enhanced comments and sections. Each section in the code reflects roughly one section of this article. + +```{r, eval=FALSE} +# rtables 0.6.2 +### NB This is called at EACH level of recursive splitting +do_split <- function(spl, + df, + vals = NULL, + labels = NULL, + trim = FALSE, + spl_context) { +# - CHECKS - # + ## This will error if, e.g., df does not have columns + ## required by spl, or generally any time the split (spl) + ## can not be applied to df + check_validsplit(spl, df) + +# - SPLIT FUNCTION - # + ## In special cases, we need to partition data (split) + ## in a very specific way, e.g. depending on the data or + ## external values. These can be achieved by using a custom + ## split function. + + ## note the <- here!!! + if (!is.null(splfun <- split_fun(spl))) { + ## Currently split functions take df, vals, labels and + ## return list(values = ..., datasplit = ..., labels = ...), + ## with an optional additional 'extras' element + if (func_takes(splfun, ".spl_context")) { + ret <- tryCatch( + splfun(df, spl, vals, labels, + trim = trim, + .spl_context = spl_context + ), + error = function(e) e + ) ## rawvalues(spl_context)) + } else { + ret <- tryCatch(splfun(df, spl, vals, labels, trim = trim), + error = function(e) e + ) + } + if (is(ret, "error")) { + stop( + "Error applying custom split function: ", ret$message, "\n\tsplit: ", + class(spl), " (", payloadmsg(spl), ")\n", + "\toccured at path: ", + spl_context_to_disp_path(spl_context), "\n" + ) + } + } else { +# - .apply_split_inner - # + ## This is called when no split function is provided. Please note that this function + ## will also probably be called when the split function is provided, as long as the + ## main splitting method is not willingly modified by the split function. + ret <- .apply_split_inner(df = df, spl = spl, vals = vals, labels = labels, trim = trim) + } + +# - EXTRA - # + ## this adds .ref_full and .in_ref_col + if (is(spl, "VarLevWBaselineSplit")) { + ret <- .add_ref_extras(spl, df, ret) + } + +# - FIXUPVALS - # + ## This: + ## - guarantees that ret$values contains SplitValue objects + ## - removes the extras element since its redundant after the above + ## - ensures datasplit and values lists are named according to labels + ## - ensures labels are character not factor + ret <- .fixupvals(ret) + +# - RETURN - # + ret +} +``` + +We will see where and how input parameters are used. The most important parameters are `spl` and `df` - the split objects and the input `data.frame`, respectively. + +### Checks and Classes + +We will start by looking at the first function called from `do_split`. This will give us a good overview of how the split itself is defined. This function is, of course, the check function (`check_validsplit`) that is used to verify if the split is valid for the data. In the following we will describe the split-class hierarchy step-by-step, but we invite the reader to explore this further on their own as well. + +Let's first search the package for `check_validsplit`. You will find that it is defined as a generic in `R/split_funs.R`, where it is applied to the following "split" classes: `VarLevelSplit`, `MultiVarSplit`, `VAnalyzeSplit`, `CompoundSplit`, and `Split`. Another way to find this information, which is more useful for more spread out and complicated objects, is by using `showMethods(check_validsplit)`. The virtual class `VAnalyzeSplit` (by convention virtual classes start with "V") defines the main parent of the analysis split which we discuss in detail in the related vignette `vignette()` (xxx). From this, we can see that the `analyze()` calls actually mimic split objects as they create different results under a specific final split (or node). Now, notice that `check_validsplit` is also called in another location, the main `R/tt_dotabulation.R` source file. This is again something related to making "analyze" rows as it mainly checks for `VAnalyzeSplit` (link to tabulation dev guide xxx). We will discuss the other classes as they appear in our examples (link to class hierarchy xxx). + +For the moment, we see with `class(spl)` (from the main `do_split` function) that we are dealing with an `AllSplit` object. By calling `showMethods(check_validsplit)` we produce the following: + +``` +# rtables 0.6.2 +Function: check_validsplit (package rtables) +spl="AllSplit" + (inherited from: spl="Split") +spl="CompoundSplit" +spl="MultiVarSplit" +spl="Split" +spl="VAnalyzeSplit" +spl="VarLevelSplit" +``` + +This means that each of the listed classes has a dedicated definition of `check_validsplit` that may largely differ from the others. Only the class `AllSplit` does not have its own function definition as it is inherited from the `Split` class. Therefore, we understand that `AllSplit` is a parent class of `Split`. This is one of the first definitions of a virtual class in the package and it is the only one that does not include the "V" prefix. These classes are defined along with their constructors in `R/00tabletrees.R`. Reading about how `AllSplit` is structured can be useful in understanding how split objects are expected to work. Please see the comments in the following: + +```{r, eval=FALSE} +# rtables 0.6.2 +setClass("AllSplit", contains = "Split") + +AllSplit <- function(split_label = "", + cfun = NULL, + cformat = NULL, + cna_str = NA_character_, + split_format = NULL, + split_na_str = NA_character_, + split_name = NULL, + extra_args = list(), + indent_mod = 0L, + cindent_mod = 0L, + cvar = "", + cextra_args = list(), + ...) { + if (is.null(split_name)) { # If the split has no name + if (nzchar(split_label)) { # (std is "") + split_name <- split_label + } else { + split_name <- "all obs" # No label, a standard split with all + # observations is assigned. + } + } + new("AllSplit", + split_label = split_label, + content_fun = cfun, + content_format = cformat, + content_na_str = cna_str, + split_format = split_format, + split_na_str = split_na_str, + name = split_name, + label_children = FALSE, + extra_args = extra_args, + indent_modifier = as.integer(indent_mod), + content_indent_modifier = as.integer(cindent_mod), + content_var = cvar, + split_label_position = "hidden", + content_extra_args = cextra_args, + page_title_prefix = NA_character_, + child_section_div = NA_character_ + ) +} +``` + +We can also print this information by calling `getClass("AllSplit")` for the general slot definition, or by calling `getClass(spl)`. Note that the first call will give also a lot of information about the class hierarchy. For more information regarding class hierarchy, please refer to the relevant article (xxx). We will discuss the majority of the slots by the end of this document. Now, let's see if we can find some of the values described in the constructor within our object. To do so, we will show the more compact representation given by `str`. When there are multiple and hierarchical slots that contain objects themselves, calling `str` will be much less or not at all informative if the maximum level of nesting is not set (e.g. `max.level = 2`). + +```{r, eval=FALSE} +# rtables 0.6.2 +Browse[2]> str(spl, max.level = 2) +Formal class 'AllSplit' [package "rtables"] with 17 slots + ..@ payload : NULL + ..@ name : chr "all obs" + ..@ split_label : chr "" + ..@ split_format : NULL + ..@ split_na_str : chr NA + ..@ split_label_position : chr "hidden" + ..@ content_fun : NULL + ..@ content_format : NULL + ..@ content_na_str : chr NA + ..@ content_var : chr "" + ..@ label_children : logi FALSE + ..@ extra_args : list() + ..@ indent_modifier : int 0 + ..@ content_indent_modifier: int 0 + ..@ content_extra_args : list() + ..@ page_title_prefix : chr NA + ..@ child_section_div : chr NA +``` + +Details about these slots will become necessary in future examples, and we will deal with them at that time. Now, we gave you a hint of the complex class hierarchy that makes up `rtables`, and how to explore it autonomously. Let's go forward in `do_split`. In our case, with `AllSplit` inherited from `Split`, we are sure that the called function will be the following (read the comment!): + +```{r, eval=FALSE} +# rtables 0.6.2 +## Default does nothing, add methods as they become required +setMethod( + "check_validsplit", "Split", + function(spl, df) invisible(NULL) +) +``` + +### Split Functions and `.apply_split_inner` + +Before diving into custom split functions, we need to take a moment to analyze how `.apply_split_inner` works. This function is routinely called whether or not we have a split function. Let's see why this is the case by entering it with `debugonce(.apply_split_inner)`. Of course, we are still currently browsing within `do_split` in debug mode from the first example. We print and comment on the function in the following: + +```{r, eval=FALSE} +# rtables 0.6.2 +.apply_split_inner <- function(spl, df, vals = NULL, labels = NULL, trim = FALSE) { + # - INPUTS - # + # In this case .applysplit_rawvals will attempt to find the split values if vals is NULL. + # Please notice that there may be a non-mutually exclusive set or subset of elements that + # will constitute the split. + + # - SPLIT VALS - # + ## Try to calculate values first - most of the time we can + if (is.null(vals)) { + vals <- .applysplit_rawvals(spl, df) + } + + # - EXTRA PARAMETERS - # + # This call extracts extra parameters from the split, according to the split values + extr <- .applysplit_extras(spl, df, vals) + + # If there are no values to do the split upon, we return an empty final split + if (is.null(vals)) { + return(list( + values = list(), + datasplit = list(), + labels = list(), + extras = list() + )) + } + + # - DATA SUBSETTING - # + dpart <- .applysplit_datapart(spl, df, vals) + + # - LABEL RETRIEVAL - # + if (is.null(labels)) { + labels <- .applysplit_partlabels(spl, df, vals, labels) + } else { + stopifnot(names(labels) == names(vals)) + } + + # - TRIM - # + ## Get rid of columns that would not have any observations, + ## but only if there were any rows to start with - if not + ## we're in a manually constructed table column tree + if (trim) { + hasdata <- sapply(dpart, function(x) nrow(x) > 0) + if (nrow(df) > 0 && length(dpart) > sum(hasdata)) { # some empties + dpart <- dpart[hasdata] + vals <- vals[hasdata] + extr <- extr[hasdata] + labels <- labels[hasdata] + } + } + + # - ORDER RESULTS - # + # Finds relevant order depending on spl_child_order() + if (is.null(spl_child_order(spl)) || is(spl, "AllSplit")) { + vord <- seq_along(vals) + } else { + vord <- match( + spl_child_order(spl), + vals + ) + vord <- vord[!is.na(vord)] + } + + ## FIXME: should be an S4 object, not a list + ret <- list( + values = vals[vord], + datasplit = dpart[vord], + labels = labels[vord], + extras = extr[vord] + ) + ret +} +``` + +After reading through `.apply_split_inner`, we see that there are some fundamental functions - defined strictly for internal use (by convention they start with ".") - that are generics and depend on the kind of split in input. `R/split_funs.R` is very kind and groups generic definitions at the beginning of the file. These functions are the main dispatchers for the majority of the split machinery. This is a clear example that shows how using `S4` logic enables better clarity and flexibility in programming, allowing for easy extension of the program. For compactness we also show the `showMethods` result for each generic. + +```{r, eval=FALSE} +# rtables 0.6.2 +# Retrieves the values that will constitute the splits (facets), not necessarily a unique list. +# They could come from the data cuts for example -> it can be anything that produces a set of strings. +setGeneric( + ".applysplit_rawvals", + function(spl, df) standardGeneric(".applysplit_rawvals") +) +# Browse[2]> showMethods(.applysplit_rawvals) +# Function: .applysplit_rawvals (package rtables) +# spl="AllSplit" +# spl="ManualSplit" +# spl="MultiVarSplit" +# spl="VAnalyzeSplit" +# spl="VarLevelSplit" +# spl="VarStaticCutSplit" +# Nothing here is inherited from the virtual class Split!!! + +# Contains the subset of the data (default, but these can overlap and can also NOT be mutually exclusive). +setGeneric( + ".applysplit_datapart", + function(spl, df, vals) standardGeneric(".applysplit_datapart") +) +# Same as .applysplit_rawvals + +# Extract the extra parameter for the split +setGeneric( + ".applysplit_extras", + function(spl, df, vals) standardGeneric(".applysplit_extras") +) +# Browse[2]> showMethods(.applysplit_extras) +# Function: .applysplit_extras (package rtables) +# spl="AllSplit" +# (inherited from: spl="Split") +# spl="Split" +# This means there is only a function for the virtual class Split. +# So all splits behave the same!!! + +# Split label retrieval and assignment if visible. +setGeneric( + ".applysplit_partlabels", + function(spl, df, vals, labels) standardGeneric(".applysplit_partlabels") +) +# Browse[2]> showMethods(.applysplit_partlabels) +# Function: .applysplit_partlabels (package rtables) +# spl="AllSplit" +# (inherited from: spl="Split") +# spl="MultiVarSplit" +# spl="Split" +# spl="VarLevelSplit" + +setGeneric( + "check_validsplit", # our friend + function(spl, df) standardGeneric("check_validsplit") +) +# Note: check_validsplit is an internal function but may one day be exported. +# This is why it does not have the "." prefix. + +setGeneric( + ".applysplit_ref_vals", + function(spl, df, vals) standardGeneric(".applysplit_ref_vals") +) +# Browse[2]> showMethods(.applysplit_ref_vals) +# Function: .applysplit_ref_vals (package rtables) +# spl="Split" +# spl="VarLevWBaselineSplit" +``` + +Now, we know that `.applysplit_extras` is the function that will be called first. This is because we did not specify any `vals` and it is therefore `NULL`. This is an `S4` generic function as can be seen by `showMethod(.applysplit_extras)`, and its definition can be seen in the following: + +```{r, eval=FALSE} +# rtables 0.6.2 +Browse[3]> getMethod(".applysplit_rawvals", "AllSplit") +Method Definition: + +function (spl, df) +obj_name(spl) + +Signatures: + spl +target "AllSplit" +defined "AllSplit" + +# What is obj_name -> slot in spl +Browse[3]> obj_name(spl) +[1] "all obs" + +# coming from +Browse[3]> getMethod("obj_name", "Split") +Method Definition: + +function (obj) +obj@name ##### Slot that we could see from str(spl, max.level = 2) + +Signatures: + obj +target "Split" +defined "Split" +``` + +Then we have `.applysplit_extras`, which simply extracts the extra arguments from the split objects and assigns them to their relative split values. This function will be covered in more detail in a later section. If still no split values are available, the function will exit here with an empty split. Otherwise, the data will be divided into different splits or data subsets (facets) with `.applysplit_datapart`. In our current example, the resulting list comprises the whole input dataset (do `getMethod(".applysplit_datapart", "AllSplit")` and the list will be evident: `function (spl, df, vals) list(df)`). + +Next, split labels are checked. If they are not present, split values (`vals`) will be used with `.applysplit_partlabels`, transformed into `as.character(vals)` if applied to a `Split` object. Otherwise, the inserted labels are checked against the names of split values. + +Lastly, the split values are ordered according to `spl_child_order`. In our case, which concerns the general `AllSplit`, the sorting will not happen, i.e. it will be dependent simply on the number of split values (`seq_along(vals)`). + +## A Simple Split + +In the following, we demonstrate how row splits work using the features that we have already described. We will add two splits and see how the behavior of `do_split` changes. Note that if we do not add an `analyze` call the split will behave as before, giving an empty table with all observations. By default, calling `analyze` on a variable will calculate the mean for each data subset that has been generated by the splits. We want to go beyond the first call of `do_split` that is by design applied on all observations, with the purpose of generating the root split that contains all data and all splits (indeed `AllSplit`). To achieve this we use `debug(rtables:::do_split)` instead of `debugonce(rtables:::do_split)` as we will need to step into each of the splits. Alternatively, it is possible to use the more powerful `trace` function to enter in cases where input is from a specific class. To do so, the following can be used: `trace("do_split", quote(if(!is(spl, "AllSplit")) browser()), where = asNamespace("rtables"))`. Note that we specify the namespace with `where`. Multiple tracer elements can be added with `expression(E1, E2)`, which is the same as `c(quote(E1), quote(E2))`. Specific _steps_ can be specified with the `at` parameter. Remember to call `untrace("do_split", quote(if(!is(spl, "AllSplit")) browser()), where = asNamespace("rtables"))` once finished to remove the trace. + +```{r, message=FALSE} +# rtables 0.6.2 +library(rtables) +library(dplyr) + +# This filter is added to avoid having too many calls to do_split +DM_tmp <- DM %>% + filter(ARM %in% names(table(DM$ARM)[1:2])) %>% # limit to two + filter(SEX %in% c("M", "F")) %>% # limit to two + mutate(SEX = factor(SEX), ARM = factor(ARM)) # to drop unused levels + +# debug(rtables:::do_split) +lyt <- basic_table() %>% + split_rows_by("ARM") %>% + split_rows_by("SEX") %>% + analyze("BMRKR1") # analyze() is needed for the table to have non-label rows + +lyt %>% + build_table(DM_tmp) +# undebug(rtables:::do_split) +``` + +Before continuing, we want to check the formal class of `spl`. + +```{r, eval=FALSE} +# rtables 0.6.2 +Browse[2]> str(spl, max.level = 2) +Formal class 'VarLevelSplit' [package "rtables"] with 20 slots + ..@ value_label_var : chr "ARM" + ..@ value_order : chr [1:2] "A: Drug X" "B: Placebo" + ..@ split_fun : NULL + ..@ payload : chr "ARM" + ..@ name : chr "ARM" + ..@ split_label : chr "ARM" + ..@ split_format : NULL + ..@ split_na_str : chr NA + ..@ split_label_position : chr "hidden" + ..@ content_fun : NULL + ..@ content_format : NULL + ..@ content_na_str : chr NA + ..@ content_var : chr "" + ..@ label_children : logi NA + ..@ extra_args : list() + ..@ indent_modifier : int 0 + ..@ content_indent_modifier: int 0 + ..@ content_extra_args : list() + ..@ page_title_prefix : chr NA + ..@ child_section_div : chr NA +``` + +From this, we can directly infer that the class is different now (`VarLevelSplit`) and understand that the split label will be hidden (`split_label_position` slot). Moreover, we see a specific value order with specific split values. `VarLevelSplit` also seems to have three more slots than `AllSplit`. What are they precisely? + +```{r, eval=FALSE} +# rtables 0.6.2 +slots_as <- getSlots("AllSplit") # inherits virtual class Split and is general class for all splits +# getClass("CustomizableSplit") # -> Extends: "Split", Known Subclasses: Class "VarLevelSplit", directly +slots_cs <- getSlots("CustomizableSplit") # Adds split function +slots_vls <- getSlots("VarLevelSplit") + +slots_cs[!(names(slots_cs) %in% names(slots_as))] +# split_fun +# "functionOrNULL" +slots_vls[!(names(slots_vls) %in% names(slots_cs))] +# value_label_var value_order +# "character" "ANY" +``` + +Remember to always check the constructor and class definition in `R/00tabletrees.R` if exploratory tools do not suffice. Now, `check_validsplit(spl, df)` will use a different method than before (`getMethod("check_validsplit", "VarLevelSplit")`). It uses the internal utility function `.checkvarsok` to check if `vars`, i.e. the `payload`, is actually present in `names(df)`. + +The next relevant function will be `.apply_split_inner`, and we will exactly what changes using `debugonce(.apply_split_inner)`. Of course, this function is called directly as no custom split function is provided. Since parameter `vals` is not specified (`NULL`), the split values are retrieved from `df` by using the split payload to select specific columns (`varvec <- df[[spl_payload(spl)]]`). Whenever no split values are specified they are retrieved from the selected column as unique values (`character`) or levels (`factor`). + +Next, `.applysplit_datapart` creates a named list of facets or data subsets. In this case, the result is actually a mutually exclusive partition of the data. This is because we did not specify any split values and as such the column content was retrieved via `unique` (in case of a character vector) or `levels` (in case of factors). `.applysplit_partlabels` is a bit less linear as it has to take into account the possibility of having specified labels in the payload. Instead of looking at the function source code with `getMethod(".applysplit_partlabels", "VarLevelSplit")`, we can enter the `S4` generic function in debugging mode as follows: + +```{r, eval=FALSE} +# rtables 0.6.2 +eval(debugcall(.applysplit_partlabels(spl, df, vals, labels))) +# We leave to the smart developer to see how the labels are assigned + +# Remember to undebugcall() similarly! +``` + +In our case, the final labels are `vals` because they were not explicitly assigned. Their order is retrieved from the split object (`spl_child_order(spl)`) and matched with current split values. The returned list is then processed as it was before. + +If we continue with the next call of `do_split`, the same procedure is followed for the second `ARM` split. This is applied to the partition that was created in the first split. The main `df` is now constituted by a subset (facet) of the total data, determined by the first split. This will be repeated iteratively for as many data splits as requested. Before concluding this iteration, we take a moment to discuss in detail how `.fixupvals(partinfo)` works. This is not a generic function and the source code can be easily accessed. We suggest running through it with `debugonce(.fixupvals)` to understand what it does in practice. The fundamental aspects of `.fixupvals(partinfo)` are as follows: + +* Ensures that labels are character and not factor. +* Ensures that the splits of data and list of values are named according to labels. +* Guarantees that `ret$values` contains `SplitValue` objects. +* Removes the list element `extra` since it is now included in the `SplitValue`. + +Note that this function can occasionally be called more than once on the same return object (a named list for now). Of course, after the first call only checks are applied. + +```{r, eval=FALSE} +# rtables 0.6.2 + +# Can find the following core function: +# vals <- make_splvalue_vec(vals, extr, labels = labels) +# ---> Main list of SplitValue objects: iterative call of +# new("SplitValue", value = val, extra = extr, label = label) + +# Structure of ret before calling .fixupvals +Browse[2]> str(ret, max.level = 2) +List of 4 + $ values : chr [1:2] "A: Drug X" "B: Placebo" + $ datasplit:List of 2 + ..$ A: Drug X : tibble [121 × 8] (S3: tbl_df/tbl/data.frame) + ..$ B: Placebo: tibble [106 × 8] (S3: tbl_df/tbl/data.frame) + $ labels : Named chr [1:2] "A: Drug X" "B: Placebo" + ..- attr(*, "names")= chr [1:2] "A: Drug X" "B: Placebo" + $ extras :List of 2 + ..$ : list() + ..$ : list() + +# Structure of ret after the function call +Browse[2]> str(.fixupvals(ret), max.level = 2) +List of 3 + $ values :List of 2 + ..$ A: Drug X :Formal class 'SplitValue' [package "rtables"] with 3 slots + ..$ B: Placebo:Formal class 'SplitValue' [package "rtables"] with 3 slots + $ datasplit:List of 2 + ..$ A: Drug X : tibble [121 × 8] (S3: tbl_df/tbl/data.frame) + ..$ B: Placebo: tibble [106 × 8] (S3: tbl_df/tbl/data.frame) + $ labels : Named chr [1:2] "A: Drug X" "B: Placebo" + ..- attr(*, "names")= chr [1:2] "A: Drug X" "B: Placebo" + +# The SplitValue object is fundamental +Browse[2]> str(ret$values) +List of 2 + $ A: Drug X :Formal class 'SplitValue' [package "rtables"] with 3 slots + .. ..@ extra: list() + .. ..@ value: chr "A: Drug X" + .. ..@ label: chr "A: Drug X" + $ B: Placebo:Formal class 'SplitValue' [package "rtables"] with 3 slots + .. ..@ extra: list() + .. ..@ value: chr "B: Placebo" + .. ..@ label: chr "B: Placebo" +``` + +### Pre-Made Split Functions + +We start by examining a split function that is already defined in `rtables`. Its scope is filtering out specific values as follows: + +```{r, message=FALSE} +library(rtables) +# debug(rtables:::do_split) # uncomment to see into the main split function +basic_table() %>% + split_rows_by("SEX", split_fun = drop_split_levels) %>% + analyze("BMRKR1") %>% + build_table(DM) +# undebug(rtables:::do_split) + +# This produces the same output as before (when filters were used) +``` + +After the root split, we enter the split based on `SEX`. As we have specified a split function, we can retrieve the split function by using `splfun <- split_fun(spl)` and enter an if-else statement for the two possible cases: whether there is split context or not. In both cases, an error catching framework is used to give informative errors in case of failure. Later we will see more in depth how this works. + +We invite the reader to always keep an eye on `spl_context`, as it is fundamental to more sophisticated splits, e.g. in the cases where the split itself depends mainly on preceding splits or values. When the split function is called, please take a moment to look at how `drop_split_levels` is defined. You will see that the function is fundamentally a wrapper of `.apply_split_inner` that drops empty factor levels, therefore avoiding empty splits. + +```{r, eval=FALSE} +# rtables 0.6.2 +> drop_split_levels +function(df, + spl, + vals = NULL, + labels = NULL, + trim = FALSE) { + # Retrieve split column + var <- spl_payload(spl) + df2 <- df + + ## This call is exactly the one we used when filtering to get rid of empty levels + df2[[var]] <- factor(df[[var]]) + + ## Our main function! + .apply_split_inner(spl, df2, + vals = vals, + labels = labels, + trim = trim + ) +} +``` + +There are many pre-made split functions included in `rtables`. A list of these functions can be found in the [Split Functions vignette](https://insightsengineering.github.io/rtables/main/articles/split_functions.html), or via `?split_funcs`. We leave it to the developer to look into how some of these split functions work, in particular `trim_levels_to_map` may be of interest. + +### Creating Custom Split Functions + +Now we will create a custom split function. Firstly, we will see how the system manages error messages. For a general understanding of how custom split functions are created, please read the [Custom Split Functions section](https://insightsengineering.github.io/rtables/main/articles/advanced_usage.html#custom-split-functions) of the Advanced Usage vignette or see `?custom_split_funs`. In the following code we use `browser()` to enter our custom split functions. We invite the reader to activate `options(error = recover)` to investigate cases where we encounter an error. Note that you can revert to default behavior by restarting your `R` session, by caching the default option value, or by using `callr` to retrieve the default as follows: `default_opts <- callr::r(function(){options()}); options(error = default_opts$error)`. + +```{r} +# rtables 0.6.2 +# Table call with only the function changing +simple_table <- function(DM, f) { + lyt <- basic_table() %>% + split_rows_by("ARM", split_fun = f) %>% + analyze("BMRKR1") + + lyt %>% + build_table(DM) +} +# First round will fail because there are unused arguments +exploratory_split_fun <- function(df, spl) NULL +# debug(rtables:::do_split) +err_msg <- tryCatch(simple_table(DM, exploratory_split_fun), error = function(e) e) +# undebug(rtables:::do_split) + +message(err_msg$message) +``` + +The commented debugging lines above will allow you to inspect the error. Alternatively, using the recover option will allow you the possibility to select the frame number, i.e. the trace level, to enter. Selecting the last frame number (10 in this case) will allow you to see the value of `ret` from `rtables:::do_split` that causes the error and how the informative error message that follows is created. + +```{r, eval=FALSE} +# rtables 0.6.2 +# Debugging level +10: tt_dotabulation.R#627: do_split(spl, df, spl_context = spl_context) + +# Original call and final error +> simple_table(DM, exploratory_split_fun) +Error in do_split(spl, df, spl_context = spl_context) : + Error applying custom split function: unused arguments (vals, labels, trim = trim) # This is main error + split: VarLevelSplit (ARM) # Split reference + occured at path: root # Path level (where it occurred) +``` + +The previous split function fails because `exploratory_split_fun` is given more arguments than it accepts. A simple way to avoid this is to add `...` to the function call. Now let's construct an interesting split function (and error): + +```{r} +# rtables 0.6.2 +f_brakes_if <- function(split_col = NULL, error = FALSE) { + function(df, spl, ...) { # order matters! more than naming + # browser() # To check how it works + if (is.null(split_col)) { # Retrieves the default + split_col <- spl_variable(spl) # Internal accessor to split obj + } + my_payload <- split_col # Changing split column value + + vals <- levels(df[[my_payload]]) # Extracting values to split + datasplit <- lapply(seq_along(vals), function(i) { + df[df[[my_payload]] == vals[[i]], ] + }) + names(datasplit) <- as.character(vals) + + # Error + if (isTRUE(error)) { + # browser() # If you need to check how it works + mystery_error_values <- sapply(datasplit, function(x) mean(x$BMRKR1)) + if (any(mystery_error_values > 6)) { + stop( + "It should not be more than 6! Should it be? Found in split values: ", + names(datasplit)[which(mystery_error_values > 6)] + ) + } + } + + # Handy function to return a split result!! + make_split_result(vals, datasplit, vals) + } +} +simple_table(DM, f_brakes_if()) # works! +simple_table(DM, f_brakes_if(split_col = "STRATA1")) # works! + +# simple_table(DM, f_brakes_if(error = TRUE)) # does not work, but returns an informative message + +# Error in do_split(spl, df, spl_context = spl_context) : +# Error applying custom split function: It should not be more than 6! Should it be? Found in split values: B: Placebo +# split: VarLevelSplit (ARM) +# occurred at path: root +``` + +Now we will take a moment to dwell on the machinery included in `rtables` to create custom split functions. Before doing so, please read the relevant documentation at `?make_split_fun`. Most of the pre-made split functions included in `rtables` are or will be written with `make_split_fun` as it is a more stable constructor for such functions than was previously used. We invite the reader to take a look at `make_split_fun.R`. The majority of the functions here should be understandable with the knowledge you have gained from this guide so far. It is important to note that if no core split function is specified, which is commonly the case, `make_split_fun` calls `do_base_split` directly, which is a minimal wrapper of the well-known `do_split`. `drop_facet_levels`, for example, is a pre-processing function that at its core simply removes empty factor levels from the split "column", thus avoiding showing empty lines. + +It is also possible to provide a list of functions, as it can be seen in the examples of `?make_split_fun`. Note that pre- and post-processing requires a list as input to support the possibility of combining multiple functions. In contrast, the core splitting function must be a single function call as it is not expected to have stacked features. This rarely needs to be modified and the majority of the included split functions work with pre- or post-processing. Included post-processing functions are interesting as they interact with the split object, e.g. by reordering the facets or by adding an overall facet (`add_overall_facet`). The attentive reader will have noticed that the core function relies on `do_split` and many of the post-processing functions rely on `make_split_result`, which is the best way to get the correct split return structure. Note that modifying the core split only works in the row space at the moment. + +#### `.spl_context` - Adding Context to Our Splits + +The best way to understand what split context does, and how to use it, is to read the [Leveraging `.spl_context` section](https://insightsengineering.github.io/rtables/main/articles/advanced_usage.html#leveraging--spl_context) of the Advanced Usage vignette, and to use `browser()` within a split function to see how it is structured. As `.spl_context` is needed for rewriting core functions, we propose a wrapper of `do_base_split` here, which is a handy redirection to the standard `do_split` without the split function part (i.e. it is a wrapper of `.apply_split_inner`, the real core splitting machinery). Out of curiosity, we set `trim = TRUE` here. This trimming only works when there is a mixed table (some values are 0s and some have content), for which it will trim 0s. This is rarely the case, and we encourage using the replacement functions `trim_levels_to_group` and `trim_levels_to_map` for trimming. Nowadays, it should even be impossible to set it differently from `trim = FALSE`. + +(write an issue informative error for not list xxx). +```{r, eval=FALSE} +# rtables 0.6.2 +browsing_f <- function(df, spl, .spl_context, ...) { + # browser() + # do_base_split(df, spl, ...) # order matters!! This would fail if done + do_base_split(spl = spl, df = df, vals = NULL, labels = NULL, trim = TRUE) +} + +fnc_tmp <- function(innervar) { # Exploring trim_levels_in_facets (check its form) + function(ret, ...) { + # browser() + for (var in innervar) { # of course AGE is not here, so nothing is dropped!! + ret$datasplit <- lapply(ret$datasplit, function(df) { + df[[var]] <- factor(df[[var]]) + df + }) + } + ret + } +} + +basic_table() %>% + split_rows_by("ARM") %>% + split_rows_by("STRATA1") %>% + split_rows_by_cuts("AGE", + cuts = c(0, 50, 100), + cutlabels = c("young", "old") + ) %>% + split_rows_by("SEX", split_fun = make_split_fun( + pre = list(drop_facet_levels), # This is dropping the SEX levels (AGE is upper level) + core_split = browsing_f, + post = list(fnc_tmp("AGE")) # To drop these we should use a split_fun in the above level + )) %>% + summarize_row_groups() %>% + build_table(DM) + +# The following is the .spl_contest printout: +Browse[1]> .spl_context + split value full_parent_df all_cols_n all obs +1 root root c("S1", .... 356 TRUE, TR.... +2 ARM A: Drug X c("S6", .... 121 TRUE, TR.... +3 STRATA1 A c("S14",.... 36 TRUE, TR.... +4 AGE young c("S14",.... 36 TRUE, TR.... + +# NOTE: make_split_fun(pre = list(drop_facet_levels)) and drop_split_levels +# do the same thing in this case +``` + +Here we can see what the split column variable is (`split`, first column) at this level of the splitting procedure. `value` is the current split value that is being dealt with. For the next column, let's see the number of rows of these data frames: `sapply(.spl_context$full_parent_df, nrow) # [1] 356 121 36 36`. Indeed, the `root` level contains the full input data frame, while the other levels are subgroups of the full data according to the split value. `all_cols_n` shows exactly the numbers just described. `all obs` is the current filter applied to the columns. Applying this to the root data (or the row subgroup data) reveals the current column-wise facet (or row-wise for a row split). It is also possible to use the same information to make complex splits in the column space by using the full data frame and the value splits to select the interested values. This is something we will change and simplify within `rtables` as the need becomes apparent. + +### Extra Arguments: `extra_args` + +This functionality is well-known and used in the setting of analysis functions (a somewhat complicated example of this can be found in the [Example Complex Analysis Function vignette](https://insightsengineering.github.io/rtables/main/articles/example_analysis_coxreg.html#constructing-the-table)), but we will show here how this can also apply to splits. + +```{r, eval=FALSE} +# rtables 0.6.2 + +# Let's use the tracer!! +my_tracer <- quote(if (length(spl@extra_args) > 0) browser()) +trace( + what = "do_split", + tracer = my_tracer, + where = asNamespace("rtables") +) + +custom_mean_var <- function(var) { + function(df, labelstr, na.rm = FALSE, ...) { + # browser() + mean(df[[var]], na.rm = na.rm) + } +} + +DM_ageNA <- DM +DM_ageNA$AGE[1] <- NA + +basic_table() %>% + split_rows_by("ARM") %>% + split_rows_by("SEX", split_fun = drop_split_levels) %>% + summarize_row_groups( + cfun = custom_mean_var("AGE"), + extra_args = list(na.rm = TRUE), format = "xx.x", + label_fstr = "label %s" + ) %>% + # content_extra_args, c_extra_args are different slots!! (xxx) + split_rows_by("STRATA1", split_fun = keep_split_levels("A")) %>% + analyze("AGE") %>% # check with the extra_args (xxx) + build_table(DM_ageNA) +# You can pass extra_args down to other splits. It is possible this will not not +# work. Should it? That is why extra_args lives only in splits (xxx) check if it works +# as is. Difficult to find an use case for this. Maybe it could work for the ref_group +# info. That does not work with nesting already (fairly sure that it will break stuff). +# Does it make sense to have more than one ref_group at any point of the analysis? No docs, +# send a warning if users try to nest things with ref_group (that is passed around via +# extra_args) + +# As we can see that was not possible. What if we now force it a bit? +my_split_fun <- function(df, spl, .spl_context, ...) { + spl@extra_args <- list(na.rm = TRUE) + # does not work because do_split is not changing the object + # the split does not do anything with it + drop_split_levels(df, spl) +} # does not work + +basic_table() %>% + split_rows_by("ARM") %>% + split_rows_by("SEX", split_fun = my_split_fun) %>% + analyze("AGE", inclNAs = TRUE, afun = mean) %>% # include_NAs is set FALSE + build_table(DM_ageNA) +# extra_args is in available in cols but not in rows, because different columns +# may need it for different col space. Row-wise it seems not necessary. +# The only thing that works is adding it to analyze (xxx) check if it is worth adding + +# We invite the developer now to test all the test files of this package with the tracer on +# therefore -> extra_args is not currently used in splits (xxx could be wrong) +# could be not being hooked up +untrace(what = "do_split", where = asNamespace("rtables")) + +# Let's try with the other variables identically +my_tracer <- quote(if (!is.null(vals) || !is.null(labels) || isTRUE(trim)) { + print("A LOT TO SAY") + message("CANT BLOCK US ALL") + stop("NOW FOR SURE") + browser() +}) +trace( + what = "do_split", + tracer = my_tracer, + where = asNamespace("rtables") +) +# Run tests by copying the above in setup-fakedata.R (then devtools::test()) +untrace( + what = "do_split", + where = asNamespace("rtables") +) +``` + +As we have demonstrated, all of the above seem like impossible cases and are to be considered as vestigial and to be deprecated. + +## `MultiVarSplit` & `CompoundSplit` Examples + +The final part of this article is still under construction, hence the non-specific mentions and the to do list. +xxx `CompoundSplit` generates facets from one variable (e.g. cumulative distributions) while `MultiVarSplit` uses different variables for the split. See `AnalyzeMultiVars`, which inherits from `CompoundSplit` for more details on how it analyzes the same facets multiple times. `MultiVarColSplit` works with `analyze_colvars`, which is out of the scope of this article. `.set_kids_sect_sep` adds things between children (can be set from split). + +First, we want to see how the `MultiVarSplit` class behaves for an example case taken from `?split_rows_by_multivar`. + +```{r, eval=FALSE} +# rtables 0.6.2 + +my_tracer <- quote(if (is(spl, "MultiVarSplit")) browser()) +trace( + what = "do_split", + tracer = my_tracer, + where = asNamespace("rtables") +) +# We want also to take a look at the following: +debugonce(rtables:::.apply_split_inner) +lyt <- basic_table() %>% + split_cols_by("ARM") %>% + split_rows_by_multivar(c("BMRKR1", "BMRKR1"), + varlabels = c("SD", "MEAN") + ) %>% + split_rows_by("COUNTRY", + split_fun = keep_split_levels("PAK") + ) %>% # xxx for #690 #691 + summarize_row_groups() %>% + analyze(c("AGE", "SEX")) + +build_table(lyt, DM) + +# xxx check empty space on top -> check if it is a bug, file it +untrace( + what = "do_split", + where = asNamespace("rtables") +) +``` + +If we print the output, we will notice that the two groups (one called "SEX" and the other "STRATA1") are identical along the columns. This is because no subgroup was actually created. This is an interesting way to personalize splits with the help of custom split functions and their split context, and to have widely different subgroups in the table. + +We invite the reader to try to understand why `split_rows_by_multivar` can have other row splits under it (see `xxx` comment in the previous code), while `split_cols_by_multivar` does not. This is a known bug at the moment, and we will work towards a fix for this. Known issues are often linked in the source code by their GitHub issue number (e.g. `#690`). + +Lastly, we will briefly show an example of a split by cut function and how to replace it to solve the empty age groups problem as we did before. We propose the same simplified situation: + +```{r} +# rtables 0.6.2 + +cutfun <- function(x) { + # browser() + cutpoints <- c(0, 50, 100) + names(cutpoints) <- c("", "Younger", "Older") + cutpoints +} + +tbl <- basic_table(show_colcounts = TRUE) %>% + split_rows_by("ARM", split_fun = drop_and_remove_levels(c("B: Placebo", "C: Combination"))) %>% + split_rows_by("STRATA1") %>% + split_rows_by_cutfun("AGE", cutfun = cutfun) %>% + # split_rows_by_cuts("AGE", cuts = c(0, 50, 100), + # cutlabels = c("young", "old")) %>% # Works the same + split_rows_by("SEX", split_fun = drop_split_levels) %>% + summarize_row_groups() %>% # This is degenerate!!! + build_table(DM) + +tbl +``` + +For both row split cases (`*_cuts` and `*_cutfun`), we have empty levels that are not dropped. This is to be expected and can be avoided by using a dedicated split function. Intentionally looking at the future split is possible in order to determine if an element is present in it. At the moment it is not possible to add `spl_fun` to dedicated split functions like `split_rows_by_cuts`. + +Note that in the previous table we only used `summarize_row_groups`, with no `analyze` calls. This rendered the table nicely, but it is not the standard method to use as `summarize_row_groups` is intended *only* to decorate row groups, i.e. rows with labels. Internally, these rows are called content rows and that is why analysis functions in `summarize_row_groups` are called `cfun` instead of `afun`. Indeed, the tabulation machinery also presents these two differently as is described in the [Tabulation with Row Structure section](https://insightsengineering.github.io/rtables/main/articles/tabulation_concepts.html#tabulation-with-row-structure) of the Tabulation vignette. + +We can try to construct the split function for cuts manually with `make_split_fun`: + +```{r, eval=FALSE} +my_count_afun <- function(x, .N_col, .spl_context, ...) { + # browser() + out <- list(c(length(x), length(x) / .N_col)) + names(out) <- .spl_context$value[nrow(.spl_context)] # workaround (xxx #689) + in_rows( + .list = out, + .formats = c("xx (xx.x%)") + ) +} +# ?make_split_fun # To check for docs/examples + +# Core split +cuts_core <- function(spl, df, vals, labels, .spl_context) { + # browser() # file an issue xxx + # variables that are split on are converted to factor during the original clean-up + # cut split are not doing it but it is an exception. xxx + # young_v <- as.numeric(df[["AGE"]]) < 50 + # current solution: + young_v <- as.numeric(as.character(df[["AGE"]])) < 50 + make_split_result(c("young", "old"), + datasplit = list(df[young_v, ], df[!young_v, ]), + labels = c("Younger", "Older") + ) +} +drop_empties <- function(splret, spl, fulldf, ...) { + # browser() + nrows_data_split <- vapply(splret$datasplit, nrow, numeric(1)) + to_keep <- nrows_data_split > 0 + make_split_result( + splret$values[to_keep], + splret$datasplit[to_keep], + splret$labels[to_keep] + ) +} +gen_split <- make_split_fun( + core_split = cuts_core, + post = list(drop_empties) +) + +tbl <- basic_table(show_colcounts = TRUE) %>% + split_rows_by("ARM", split_fun = keep_split_levels(c("A: Drug X"))) %>% + split_rows_by("STRATA1") %>% + split_rows_by("AGE", split_fun = gen_split) %>% + analyze("SEX") %>% # It is the last step!! No need of BMRKR1 right? + # split_rows_by("SEX", split_fun = drop_split_levels, + # child_labels = "hidden") %>% # close issue #689. would it work for + # analyze_colvars? probably (xxx) + # analyze("BMRKR1", afun = my_count_afun) %>% # This is NOT degenerate!!! BMRKR1 is only placeholder + build_table(DM) + +tbl +``` + +Alternatively, we could choose to prune these rows out with `prune_table`! + +```{r} +# rtables 0.6.2 + +tbl <- basic_table(show_colcounts = TRUE) %>% + split_rows_by("ARM", split_fun = keep_split_levels(c("A: Drug X"))) %>% + split_rows_by("STRATA1") %>% + split_rows_by_cuts( + "AGE", + cuts = c(0, 50, 100), + cutlabels = c("young", "old") + ) %>% + split_rows_by("SEX", split_fun = drop_split_levels) %>% + summarize_row_groups() %>% # This is degenerate!!! # we keep it until #689 + build_table(DM) + +tbl + +# Trying with pruning +prune_table(tbl) # (xxx) what is going on here? it is degenerate so it has no real leaves +# It is degenerate -> what to do? +# The same mechanism is applied in the case of NULL leaves, they are rolled up in the +# table tree +``` + +(xxx) add the pre-proc with z-scoring diff --git a/inst/dev-guide/dg_tabulation.Rmd b/inst/dev-guide/dg_tabulation.Rmd new file mode 100644 index 000000000..2fbd2059b --- /dev/null +++ b/inst/dev-guide/dg_tabulation.Rmd @@ -0,0 +1,192 @@ +--- +title: "Tabulation" +author: "Davide Garolini" +date: '`r Sys.Date()`' +output: + html_document: + theme: spacelab + toc: true + toc_float: + collapsed: false +editor_options: + chunk_output_type: console +knit: (function(inputFile, encoding) { + rmarkdown::render(inputFile, encoding = encoding, output_dir = ".")}) +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Disclaimer + +This article is intended for use by developers only and will contain low-level explanations of the topics covered. For user-friendly vignettes, please see the [Articles](https://insightsengineering.github.io/rtables/main/articles/index.html) page on the `rtables` website. + +Any code or prose which appears in the version of this article on the `main` branch of the repository may reflect a specific state of things that can be more or less recent. This guide describes very important aspects of the tabulation process that are unlikely to change. Regardless, we invite the reader to keep in mind that the current repository code may have drifted from the following material in this document, and it is always the best practice to read the code directly on `main`. + +Please keep in mind that `rtables` is still under active development, and it has seen the efforts of multiple contributors across different years. Therefore, there may be legacy mechanisms and ongoing transformations that could look different in the future. + +Being that this a working document that may be subjected to both deprecation and updates, we keep `xxx` comments to indicate placeholders for warnings and to-do's that need further work. + +## Introduction + +Tabulation in `rtables` is a process that takes a pre-defined layout and applies it to data. The layout object, with all of its splits (see xxx link split machinery article) and `analyze`s, can be applied to different data to produce valid tables. This process happens principally within the `tt_dotabulation.R` file and the user-facing function `build_table` that resides in it. We will occasionally use functions and methods that are present in other files, like `colby_construction.R` or `make_subset_expr.R`. We assume the reader is already familiar with the documentation for `build_table`. We suggest reading the split machinery vignette (xxx link) prior to this one, as it is instrumental in understanding how the layout object, which is essentially built out of splits, is tabulated when data is supplied. + +## Tabulation + +We enter into `build_table` using `debugonce` to see how it works. + +```{r, eval=FALSE} +# rtables 0.6.2 +library(rtables) +debugonce(build_table) + +# A very simple layout +lyt <- basic_table() %>% + split_rows_by("STRATA1") %>% + split_rows_by("SEX", split_fun = drop_split_levels) %>% + split_cols_by("ARM") %>% + analyze("BMRKR1") + +# lyt must be a PreDataTableLayouts object +is(lyt, "PreDataTableLayouts") + +lyt %>% build_table(DM) +``` + +Now let's look within our `build_table` call. After the initial check that the layout is a pre-data table layout, it checks if the column layout is defined (`clayout` accessor), i.e. it does not have any column split. If that is the case, a `All obs` column is added automatically with all observations. After this, there are a couple of defensive programming calls that do checks and transformations as we finally have the data. These can be divided into two categories: those that mainly concern the layout, which are defined as generics, and those that concern the data, which are instead a function as they are not dependent on the layout class. Indeed, the layout is structured and can be divided into `clayout` and `rlayout` (column and row layout). The first one is used to create `cinfo`, which is the general object and container of the column splits and information. The second one contains the obligatory all data split, i.e. the root split (accessible with `root_spl`), and the row splits' vectors which are iterative splits in the row space. In the following, we consider the initial checks and defensive programming. + +```{r, eval=FALSE} +## do checks and defensive programming now that we have the data +lyt <- fix_dyncuts(lyt, df) # Create the splits that depends on data +lyt <- set_def_child_ord(lyt, df) # With the data I set the same order for all splits +lyt <- fix_analyze_vis(lyt) # Checks if the analyze last split should be visible +# If there is only one you will not get the variable name, otherwise you get it if you +# have multivar. Default is NA. You can do it now only because you are sure to +# have the whole layout. +df <- fix_split_vars(lyt, df, char_ok = is.null(col_counts)) +# checks if split vars are present + +lyt[] # preserve names - warning if names longer, repeats the name value if only one +lyt@.Data # might not preserve the names # it works only when it is another class that inherits from lists +# We suggest doing extensive testing about these behaviors in order to do choose the appropriate one +``` + +Along with the various checks and defensive programming, we find `PreDataAxisLayout` which is a virtual class that both row and column layouts inherit from. Virtual classes are handy for group classes that need to share things like labels or functions that need to be applicable to their relative classes. See more information about the `rtables` class hierarchy in the dedicated article here (xxx add). + +Now, we continue with `build_table`. After the checks, we notice `TreePos()` which is a constructor for an object that retains a representation of the tree position along with split values and labels. This is mainly used by `create_colinfo`, which we enter now with `debugonce(create_colinfo)`. This function creates the object that represents the column splits and everything else that may be related to the columns. In particular, the column counts are calculated in this function. The parameter inputs are as follows: + +```{r, eval=FALSE} +cinfo <- create_colinfo( + lyt, # Main layout with col split info + df, # df used for splits and col counts if no alt_counts_df is present + rtpos, # TreePos (does not change out of this function) + counts = col_counts, # If we want to overwrite the calculations with df/alt_counts_df + alt_counts_df = alt_counts_df, # alternative data for col counts + total = col_total, # calculated from build_table inputs (nrow of df or alt_counts_df) + topleft # topleft information added into build_table +) +``` + +`create_colinfo` is in `make_subset_expr.R`. Here, we see that if `topleft` is present in `build_table`, it will override the one in `lyt`. Entering `create_colinfo`, we will see the following calls: + +```{r, eval=FALSE} +clayout <- clayout(lyt) # Extracts column split and info + +if (is.null(topleft)) { + topleft <- top_left(lyt) # If top_left is not present in build_table, it is taken from lyt +} + +ctree <- coltree(clayout, df = df, rtpos = rtpos) # Main constructor of LayoutColTree +# The above is referenced as generic and principally represented as +# setMethod("coltree", "PreDataColLayout", (located in `tree_accessor.R`). +# This is a call that restructures information from clayout, df, and rtpos +# to get a more compact column tree layout. Part of this design is related +# to past implementations. + +cexprs <- make_col_subsets(ctree, df) # extracts expressions in a compact fashion. +# WARNING: removing NAs at this step is automatic. This should +# be coupled with a warning for NAs in the split (xxx) + +colextras <- col_extra_args(ctree) # retrieves extra_args from the tree. It may not be used +``` + +Next in the function is the determination of the column counts. Currently, this happens only at the leaf level, but it can certainly be calculated independently for all levels (this is an open issue in `rtables`, i.e. how to print other levels' totals). Precedence for column counts may be not documented (xxx todo). The main use case is when you are analyzing a participation-level dataset, with multiple records per subject, and you would like to retain the total numbers of subjects per column, often taken from a subject-level dataset, to use as column counts. Originally, counts were only able to be added as a vector, but it is often the case that users would like the possibility to use `alt_counts_df`. The `cinfo` object (`InstantiatedColumnInfo`) is created with all the above information. + +If we continue inside `build_table`, we see `.make_ctab` used to make a root split. This is a general procedure that generates the initial root split as a content row. `ctab` is applied to this content row, which is a row that contains only a label. From `?summarize_row_groups`, you know that this is how `rtables` defines label rows, i.e. as content rows. `.make_ctab` is very similar to the function that actual creates the table rows, `.make_tablerows`. Note that this function uses `parent_cfun` and `.make_caller` to retrieve the content function inserted in above levels. Here we split the structural handling of the table object and the row-creation engine, which are divided by a `.make_tablerows` call. If you search the package, you will find that this function is only called twice, once in `.make_ctab` and once in `.make_analyzed_tab`. These two are the final elements of the table construction: the creation of rows. + +Going back to `build_table`, you will see that the row layout is actually a list of split vectors. The fundamental line, `kids <- lapply(seq_along(rlyt), function(i) {`, allows us to appreciate this. Going forward we see how `recursive_applysplit` is applied to each split vector. It may be worthwhile to check what this vector looks like in our test case. + +```{r, eval=FALSE} +# rtables 0.6.2 +# A very simple layout +lyt <- basic_table() %>% + split_rows_by("STRATA1") %>% + split_rows_by("SEX", split_fun = drop_split_levels) %>% + split_cols_by("ARM") %>% + analyze("BMRKR1") + +rlyt <- rtables:::rlayout(lyt) +str(rlyt, max.level = 2) + +Formal class 'PreDataRowLayout' [package "rtables"] with 2 slots + ..@ .Data :List of 2 # rlyt is a rtables object (PreDataRowLayout) that is also a list! + ..@ root_split:Formal class 'RootSplit' [package "rtables"] with 17 slots # another object! + # If you do summarize_row_groups before anything you act on the root split. We need this to + # have a place for the content that is valid for the whole table. + +str(rtables:::root_spl(rlyt), max.level = 2) # it is still a split + +str(rlyt[[1]], max.level = 3) # still a rtables object (SplitVector) that is a list +Formal class 'SplitVector' [package "rtables"] with 1 slot + ..@ .Data:List of 3 + .. ..$ :Formal class 'VarLevelSplit' [package "rtables"] with 20 slots + .. ..$ :Formal class 'VarLevelSplit' [package "rtables"] with 20 slots + .. ..$ :Formal class 'AnalyzeMultiVars' [package "rtables"] with 17 slots +``` + +The last print is very informative. We can see from the layout construction that this object is built with 2 `VarLevelSplit`s on the rows and one final `AnalyzeMultiVars`, which is the leaf analysis split that has the final level rows. The second split vector is the following `AnalyzeVarSplit`. + +xxx To get multiple split vectors, you need to escape the nesting with `nest = FALSE` or by adding a `split_rows_by` call after an `analyze` call. + +```{r, eval=FALSE} +# rtables 0.6.2 +str(rlyt[[2]], max.level = 5) +Formal class 'SplitVector' [package "rtables"] with 1 slot + ..@ .Data:List of 1 + .. ..$ :Formal class 'AnalyzeVarSplit' [package "rtables"] with 21 slots + .. .. .. ..@ analysis_fun :function (x, ...) + .. .. .. .. ..- attr(*, "srcref")= 'srcref' int [1:8] 1723 5 1732 5 5 5 4198 4207 + .. .. .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilealias', 'srcfile' + .. .. .. ..@ default_rowlabel : chr "Var3 Counts" + .. .. .. ..@ include_NAs : logi FALSE + .. .. .. ..@ var_label_position : chr "default" + .. .. .. ..@ payload : chr "VAR3" + .. .. .. ..@ name : chr "VAR3" + .. .. .. ..@ split_label : chr "Var3 Counts" + .. .. .. ..@ split_format : NULL + .. .. .. ..@ split_na_str : chr NA + .. .. .. ..@ split_label_position : chr(0) + .. .. .. ..@ content_fun : NULL + .. .. .. ..@ content_format : NULL + .. .. .. ..@ content_na_str : chr(0) + .. .. .. ..@ content_var : chr "" + .. .. .. ..@ label_children : logi FALSE + .. .. .. ..@ extra_args : list() + .. .. .. ..@ indent_modifier : int 0 + .. .. .. ..@ content_indent_modifier: int 0 + .. .. .. ..@ content_extra_args : list() + .. .. .. ..@ page_title_prefix : chr NA + .. .. .. ..@ child_section_div : chr NA +``` + +Continuing in `recursive_applysplit`, this is made up of two main calls: one to `.make_ctab` which makes the content row and calculates the counts if specified, and `.make_split_kids`. This eventually contains `recursive_applysplit` which is applied if the split vector is built of `Split`s that are not `analyze` splits. It being a generic is very handy here to switch between different downstream processes. In our case (`rlyt[[1]]`) we will call the method `getMethod(".make_split_kids", "Split")` twice before getting to the analysis split. There, we have a (xxx) multi-variable split which applies `.make_split_kids` to each of its elements, in turn calling the main `getMethod(".make_split_kids", "VAnalyzeSplit")` which would in turn go to `.make_analyzed_tab`. + +There are interesting edge cases here for different split cases, like `split_by_multivars` or when one of the splits has a reference group. In the internal code here, it is called `baseline`. If we follow this variable across the function layers, we will see that where the split (`do_split`) happens (in `getMethod(".make_split_kids", "Split")`) we have a second split for the reference group. This is done to make this available in each row to calculate, for example, differences from the reference group. + +Now we move towards `.make_tablerows`, and here analysis functions become key as this is the place where these are applied and analyzed. First, the external `tryCatch` is used to cache errors at a higher level, so as to differentiate the two major blocks. The function parameters here are quite intuitive, with the exception of `spl_context`. This is a fundamental parameter that keeps information about splits so that it can be visible from analysis functions. If you look into this value, you will see that is carried and updated everywhere a split happens, except for columns. Column-related information is added last, when in `gen_onerv`, which is the lowest level where one result value is produced. From `.make_tablerows` we go to `gen_rowvalues`, aside from some row and referential footers handling. `gen_rowvalues` unpacks the `cinfo` object and crosses it with the arriving row split information to generate rows. In particular, `rawvals <- mapply(gen_onerv,` maps the columns to generate a list of values corresponding to a table row. Looking at the final `if` in `gen_onerv` we see `if (!is(val, "RowsVerticalSection"))` and the function `in_rows` is called. We invite the reader to explore what the building blocks of `in_rows` are, and how `.make_tablerows` constructs a data row (`DataRow`) or a content row (`ContentRow`) depending on whether it is called from `.make_ctab` or `.make_analyzed_tab`. + +`.make_tablerows` either makes a content table or an "analysis table". +`gen_rowvalues` generates a list of stacks (`RowsVerticalSection`, more than one rows potentially!) for each column. + +To add: conceptual part -> calculating things by column and putting them side by side and slicing them by rows and putting it together -> rtables is row dominant diff --git a/rtables.Rproj b/rtables.Rproj index 828602dd8..eaa6b8186 100644 --- a/rtables.Rproj +++ b/rtables.Rproj @@ -6,7 +6,7 @@ AlwaysSaveHistory: Default EnableCodeIndexing: Yes UseSpacesForTab: Yes -NumSpacesForTab: 4 +NumSpacesForTab: 2 Encoding: UTF-8 RnwWeave: Sweave