Closes #2142 Supersede get_summary_records() and enhance `derive_su…

…mmary_records()` (#2158) * feat: #2142 intiial superseding step * rough draft without missing_values implementation * rough draft of new enhancement * feat: #2142 get a good clean slate * feat: #2142 working enhanced function * feat: #2142 add appropriate test suite * feat: #2142 run styler, lintr, add news and roxygen documentation * chore: #2142 spelling/grammar * chore: #2142 fix test * should we inform superseded * retain deprecated arguments to pass cicd * chore: #2142 add remotes for admiraldev for proper branching strategy * min dev versioning * Update DESCRIPTION * chore: #2142 address feedback * upversion our description page to match current version up on github * docs: little note for running website versions * feat: #2142 rename filter to filter_add * Update R/derive_summary_records.R Co-authored-by: Ben Straub <[email protected]> * feat: #2142 get checks appropriately running * roxygen stuff and vignettes * chore: #2142 roxygen stuff * get past check-templates * finally get past templates * feat: #2142 clear up missing_values usage * chore: #2142 rename filter to filter_add internally in codebase too * chore: #2142 adopt and address all other feedback * missed a renaming * feat: #2142 remove extra fluff for missing values * update news blurb * chore: #2142 update documentation based on feedback --------- Co-authored-by: Zelos Zhu <[email protected]> Co-authored-by: Daniel Sjoberg <[email protected]> Co-authored-by: Jerry Johnson <[email protected]> Co-authored-by: Ben Straub <[email protected]>
pharmaverse · Nov 9, 2023 · 205e0d1 · 205e0d1
1 parent 708d232
commit 205e0d1
Show file tree

Hide file tree

Showing 32 changed files with 439 additions and 104 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -38,8 +38,9 @@ LazyData: true
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.2.3
 Depends: R (>= 4.0)
+Remotes: pharmaverse/admiraldev 
 Imports:
-    admiraldev (>= 0.4.0),
+    admiraldev (>= 0.5.0.9000),
     dplyr (>= 0.8.4),
     hms (>= 0.5.3),
     lifecycle (>= 0.1.0),

diff --git a/NEWS.md b/NEWS.md
@@ -17,6 +17,10 @@ character vector (`'--DTC'`), was imputed. (#2146)
 were enhanced such that more than one summary variable can be derived, e.g.,
 `AVAL` as the sum and `ADT` as the maximum of the contributing records. (#1792)
 
+- `derive_summary_records()` was enhanced with the following optional arguments: `dataset_add`, `dataset_ref`, `missing_values`. These arguments respectively, generate summary variables from additional datasets, retain/add specific records from a reference dataset, and impute user-defined missing values. `derive_param_exposure()` was enhanced with `dataset_add` as well. (#2142)
+
+- The argument `dataset` is now optional for `derive_summary_records()` and `derive_param_exposure()`. (#2142)
+
 - The "joined" functions (`derive_vars_joined()`, `derive_var_joined_exist_flag()`,
 `filter_joined()`, and `event_joined()`) were unified: (#2126)
     - The `dataset_add` and `filter_add` arguments were added to
@@ -37,6 +41,7 @@ were enhanced such that more than one summary variable can be derived, e.g.,
 allow more control of the selection of records. It creates a temporary variable
 for the event number, which can be used in `order`. (#2140)
 
+
 ## Breaking Changes
 
 - `derive_extreme_records()` the `dataset_add` argument is now mandatory. (#2139)
@@ -45,6 +50,8 @@ for the event number, which can be used in `order`. (#2140)
 `analysis_var` and `summary_fun` were deprecated in favor of `set_values_to`.
 (#1792)
 
+- In `derive_summary_records()` and `derive_param_exposure()` the argument `filter` was renamed to `filter_add` (#2142)
+
 - In `derive_var_merged_summary()` the arguments `new_var`, `analysis_var`, and
 `summary_fun` were deprecated in favor of `new_vars`. (#1792)
 

diff --git a/R/derive_param_exposure.R b/R/derive_param_exposure.R
@@ -4,17 +4,43 @@
 #' start (`ASTDT(M)`)and end date (`AENDT(M)`) as the minimum and maximum date by `by_vars`.
 #'
 #' @param dataset
-#'   `r roxygen_param_dataset(expected_vars = c("by_vars", "analysis_var"))`
-#'   `PARAMCD` is expected as well,
-#'   + Either `ASTDTM` and `AENDTM` or `ASTDT` and `AENDT` are also expected.
+#'   `r roxygen_param_dataset(expected_vars = c("by_vars"))`
 #'
-#' @param filter Filter condition
+#' @param dataset_add Additional dataset
 #'
-#'   The specified condition is applied to the input dataset before deriving the
-#'   new parameter, i.e., only observations fulfilling the condition are taken
-#'   into account.
+#'   The variables specified for `by_vars`, `analysis_var`, `PARAMCD`,
+#'   alongside either `ASTDTM` and `AENDTM` or `ASTDT` and `AENDT` are also expected.
+#'   Observations from the specified dataset are going to be used to calculate and added
+#'   as new records to the input dataset (`dataset`).
 #'
-#'   *Permitted Values:* a condition
+#'
+#' @param filter
+#'
+#'  `r lifecycle::badge("deprecated")` Please use `filter_add` instead.
+#'
+#'   Filter condition as logical expression to apply during
+#'   summary calculation. By default, filtering expressions are computed within
+#'   `by_vars` as this will help when an aggregating, lagging, or ranking
+#'   function is involved.
+#'
+#'   For example,
+#'
+#'   + `filter = (AVAL > mean(AVAL, na.rm = TRUE))` will filter all `AVAL`
+#'   values greater than mean of `AVAL` with in `by_vars`.
+#'   + `filter = (dplyr::n() > 2)` will filter n count of `by_vars` greater
+#'   than 2.
+#'
+#' @param filter_add Filter condition as logical expression to apply during
+#'   summary calculation. By default, filtering expressions are computed within
+#'   `by_vars` as this will help when an aggregating, lagging, or ranking
+#'   function is involved.
+#'
+#'   For example,
+#'
+#'   + `filter_add = (AVAL > mean(AVAL, na.rm = TRUE))` will filter all `AVAL`
+#'   values greater than mean of `AVAL` with in `by_vars`.
+#'   + `filter_add = (dplyr::n() > 2)` will filter n count of `by_vars` greater
+#'   than 2.
 #'
 #' @param input_code Required parameter code
 #'
@@ -95,6 +121,7 @@
 #' # Cumulative dose
 #' adex %>%
 #'   derive_param_exposure(
+#'     dataset_add = adex,
 #'     by_vars = exprs(USUBJID),
 #'     set_values_to = exprs(PARAMCD = "TDOSE", PARCAT1 = "OVERALL"),
 #'     input_code = "DOSE",
@@ -106,6 +133,7 @@
 #' # average dose in w2-24
 #' adex %>%
 #'   derive_param_exposure(
+#'     dataset_add = adex,
 #'     by_vars = exprs(USUBJID),
 #'     filter = VISIT %in% c("WEEK 2", "WEEK 24"),
 #'     set_values_to = exprs(PARAMCD = "AVDW224", PARCAT1 = "WEEK2-24"),
@@ -118,19 +146,22 @@
 #' # Any dose adjustment?
 #' adex %>%
 #'   derive_param_exposure(
+#'     dataset_add = adex,
 #'     by_vars = exprs(USUBJID),
 #'     set_values_to = exprs(PARAMCD = "TADJ", PARCAT1 = "OVERALL"),
 #'     input_code = "ADJ",
 #'     analysis_var = AVALC,
 #'     summary_fun = function(x) if_else(sum(!is.na(x)) > 0, "Y", NA_character_)
 #'   ) %>%
 #'   select(-ASTDTM, -AENDTM)
-derive_param_exposure <- function(dataset,
+derive_param_exposure <- function(dataset = NULL,
+                                  dataset_add,
                                   by_vars,
                                   input_code,
                                   analysis_var,
                                   summary_fun,
                                   filter = NULL,
+                                  filter_add = NULL,
                                   set_values_to = NULL) {
   by_vars <- assert_vars(by_vars)
   analysis_var <- assert_symbol(enexpr(analysis_var))
@@ -155,25 +186,36 @@ derive_param_exposure <- function(dataset,
     )
   }
 
-  assert_data_frame(dataset,
+  assert_data_frame(dataset, required_vars = by_vars, optional = TRUE)
+  assert_data_frame(dataset_add,
     required_vars = expr_c(by_vars, analysis_var, exprs(PARAMCD), dates)
   )
-  filter <- assert_filter_cond(enexpr(filter), optional = TRUE)
+
+  if (!missing(filter)) {
+    deprecate_warn(
+      "1.0.0",
+      I("derive_param_exposure(filter = )"),
+      "derive_param_exposure(filter_add = )"
+    )
+    filter_add <- assert_filter_cond(enexpr(filter), optional = TRUE)
+  }
+  filter_add <- assert_filter_cond(enexpr(filter_add), optional = TRUE)
   assert_varval_list(set_values_to, required_elements = "PARAMCD")
   assert_param_does_not_exist(dataset, set_values_to$PARAMCD)
   assert_character_scalar(input_code)
   params_available <- unique(dataset$PARAMCD)
   assert_character_vector(input_code, values = params_available)
   assert_s3_class(summary_fun, "function")
 
-  if (is.null(filter)) {
-    filter <- TRUE
+  if (is.null(filter_add)) {
+    filter_add <- TRUE
   }
 
   derive_summary_records(
     dataset,
+    dataset_add,
     by_vars = by_vars,
-    filter = PARAMCD == !!input_code & !!filter,
+    filter_add = PARAMCD == !!input_code & !!filter_add,
     set_values_to = exprs(
       !!analysis_var := {{ summary_fun }}(!!analysis_var),
       !!!set_dtm,

diff --git a/R/derive_summary_records.R b/R/derive_summary_records.R
@@ -11,13 +11,29 @@
 #' retain those common values in the newly derived records. Otherwise new value
 #' will be set to `NA`.
 #'
-#' @param dataset  `r roxygen_param_dataset(expected_vars = c("by_vars", "analysis_var"))`
+#' @param dataset  `r roxygen_param_dataset(expected_vars = c("by_vars"))`
+#'
+#' @param dataset_add Additional dataset
+#'
+#'    The variables specified for `by_vars` are expected.
+#'   Observations from the specified dataset are going to be used to calculate and added
+#'   as new records to the input dataset (`dataset`).
+#'
+#' @param dataset_ref Reference dataset
+#'
+#'   The variables specified for `by_vars` are expected. For each
+#'   observation of the specified dataset a new observation is added to the
+#'   input dataset.
 #'
 #' @param by_vars Variables to consider for generation of groupwise summary
 #'   records. Providing the names of variables in [exprs()] will create a
 #'   groupwise summary and generate summary records for the specified groups.
 #'
-#' @param filter Filter condition as logical expression to apply during
+#' @param filter
+#'
+#'  `r lifecycle::badge("deprecated")` Please use `filter_add` instead.
+#'
+#'   Filter condition as logical expression to apply during
 #'   summary calculation. By default, filtering expressions are computed within
 #'   `by_vars` as this will help when an aggregating, lagging, or ranking
 #'   function is involved.
@@ -29,6 +45,46 @@
 #'   + `filter = (dplyr::n() > 2)` will filter n count of `by_vars` greater
 #'   than 2.
 #'
+#' @param filter_add Filter condition as logical expression to apply during
+#'   summary calculation. By default, filtering expressions are computed within
+#'   `by_vars` as this will help when an aggregating, lagging, or ranking
+#'   function is involved.
+#'
+#'   For example,
+#'
+#'   + `filter_add = (AVAL > mean(AVAL, na.rm = TRUE))` will filter all `AVAL`
+#'   values greater than mean of `AVAL` with in `by_vars`.
+#'   + `filter_add = (dplyr::n() > 2)` will filter n count of `by_vars` greater
+#'   than 2.
+#'
+#' @param set_values_to Variables to be set
+#'
+#'   The specified variables are set to the specified values for the new
+#'   observations.
+#'
+#'   Set a list of variables to some specified value for the new records
+#'   + LHS refer to a variable.
+#'   + RHS refers to the values to set to the variable. This can be a string, a
+#'   symbol, a numeric value, an expression or NA. If summary functions are
+#'   used, the values are summarized by the variables specified for `by_vars`.
+#'
+#'   For example:
+#'   ```
+#'     set_values_to = exprs(
+#'       AVAL = sum(AVAL),
+#'       DTYPE = "AVERAGE",
+#'     )
+#'   ```
+#'
+#' @param missing_values Values for missing summary values
+#'
+#'   For observations of the reference dataset (`dataset_ref`) which do not have a
+#'   complete mapping defined by the summarization defined in `set_values_to`.  Only variables
+#'   specified for `set_values_to` can be specified for `missing_values`.
+#'
+#'   *Permitted Values*: named list of expressions, e.g.,
+#'   `exprs(AVAL = -9999)`
+#'
 #' @inheritParams get_summary_records
 #'
 #' @return A data frame with derived records appended to original dataset.
@@ -72,6 +128,7 @@
 #' # Summarize the average of the triplicate ECG interval values (AVAL)
 #' derive_summary_records(
 #'   adeg,
+#'   dataset_add = adeg,
 #'   by_vars = exprs(USUBJID, PARAM, AVISIT),
 #'   set_values_to = exprs(
 #'     AVAL = mean(AVAL, na.rm = TRUE),
@@ -83,6 +140,7 @@
 #' # Derive more than one summary variable
 #' derive_summary_records(
 #'   adeg,
+#'   dataset_add = adeg,
 #'   by_vars = exprs(USUBJID, PARAM, AVISIT),
 #'   set_values_to = exprs(
 #'     AVAL = mean(AVAL),
@@ -116,27 +174,36 @@
 #' # by group
 #' derive_summary_records(
 #'   adeg,
+#'   dataset_add = adeg,
 #'   by_vars = exprs(USUBJID, PARAM, AVISIT),
-#'   filter = n() > 2,
+#'   filter_add = n() > 2,
 #'   set_values_to = exprs(
 #'     AVAL = mean(AVAL, na.rm = TRUE),
 #'     DTYPE = "AVERAGE"
 #'   )
 #' ) %>%
 #'   arrange(USUBJID, AVISIT)
-derive_summary_records <- function(dataset,
+derive_summary_records <- function(dataset = NULL,
+                                   dataset_add,
+                                   dataset_ref = NULL,
                                    by_vars,
                                    filter = NULL,
+                                   filter_add = NULL,
                                    analysis_var,
                                    summary_fun,
-                                   set_values_to) {
+                                   set_values_to,
+                                   missing_values = NULL) {
   assert_vars(by_vars)
-  filter <- assert_filter_cond(enexpr(filter), optional = TRUE)
+  assert_data_frame(dataset, required_vars = by_vars, optional = TRUE)
+  assert_data_frame(dataset_add, required_vars = by_vars)
   assert_data_frame(
-    dataset,
-    required_vars = by_vars
+    dataset_ref,
+    required_vars = by_vars,
+    optional = TRUE
   )
+
   assert_varval_list(set_values_to)
+  assert_expr_list(missing_values, named = TRUE, optional = TRUE)
 
   if (!missing(analysis_var) || !missing(summary_fun)) {
     deprecate_warn(
@@ -149,14 +216,47 @@ derive_summary_records <- function(dataset,
     set_values_to <- exprs(!!analysis_var := {{ summary_fun }}(!!analysis_var), !!!set_values_to)
   }
 
-  # Summarise the analysis value and bind to the original dataset
-  bind_rows(
-    dataset,
-    get_summary_records(
-      dataset,
-      by_vars = by_vars,
-      filter = !!filter,
-      set_values_to = set_values_to
+  if (!missing(filter)) {
+    deprecate_warn(
+      "1.0.0",
+      I("derive_summary_records(filter = )"),
+      "derive_summary_records(filter_add = )"
     )
+    filter_add <- assert_filter_cond(enexpr(filter), optional = TRUE)
+  }
+  filter_add <- assert_filter_cond(enexpr(filter_add), optional = TRUE)
+
+  summary_records <- dataset_add %>%
+    group_by(!!!by_vars) %>%
+    filter_if(filter_add) %>%
+    summarise(!!!set_values_to) %>%
+    ungroup()
+
+  df_return <- bind_rows(
+    dataset,
+    summary_records
   )
+
+  if (!is.null(dataset_ref)) {
+    add_vars <- colnames(dataset_add)
+    ref_vars <- colnames(dataset_ref)
+
+    new_ref_obs <- anti_join(
+      select(dataset_ref, intersect(add_vars, ref_vars)),
+      select(summary_records, !!!by_vars),
+      by = map_chr(by_vars, as_name)
+    )
+
+    if (!is.null(missing_values)) {
+      new_ref_obs <- new_ref_obs %>%
+        mutate(!!!missing_values)
+    }
+
+    df_return <- bind_rows(
+      df_return,
+      new_ref_obs
+    )
+  }
+
+  df_return
 }