Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove chunks + merge_expression_srv #77

Merged
merged 21 commits into from
Jul 1, 2022
Merged

Conversation

mhallal1
Copy link
Contributor

closes #61
replaces #75 to have the same branch name as insightsengineering/teal#674

Dependency on datasets and chunks has been removed from data merge module is such a way:

  1. New modules merge_expression_module and merge_expression_srv which are the new versions of data_merge_module and data_merge_srv. The new modules have an additional argument join_keys and the argument datasets accepts a list of reactive or non-reactive data.frames. So we have a separation between the old modules that accept FilteredData and the new modules that accept a named list of reactive/non-reactive data.frames.
  2. Merge_datasets is modified to use a list of reactive data.frames instead of FilteredData and chunks was completely removed from it. It now returns a list of expr, columns_source, key and filter_info. No data or chunks are returned anymore. Be aware that expr now is a list of calls instead of character.
  3. get_dplyr_call and get_merge_call have been modified accordingly.
  4. data_merge_srv has been modified in a way to use the same merge_datasets, get_dplyr_call and get_merge_call scripts. It defines data and join_keys internally and calls merge_expression_srv where it appends data and chunks to the returned list.
  5. join_keys argument replaces keys argument of data_extract_XXX. This is to avoid the confusion of having to use join_keys(data_merge_XXX) and keys (data_extract_XXX) in the same module. We extract internally the keys corresponding to every dataset from the join_keys input.
  6. Tests are added and modified as needed.

Please test with the example apps in merge_expression_module.R and data_merge_module.R.

The design of the apps would be like this now:

server = function(input, output, session) {
#initialize chunks
chunks_h <- teal.code::chunks_new()

#push data in chunks

#call data_merge
merged_data <- merge_expression_srv(
      selector_list = selector_list,
      datasets = data_list,
      join_keys = join_keys,
      merge_function = "dplyr::left_join"
    )

#push data_merge code in chunks
ch_merge <- reactive({ 
      ch <- teal.code::chunks_deep_clone(chunks_h)
      for (chunk in merged_data()$expr) teal.code::chunks_push(chunks = ch, expression = chunk)
      ch$eval()
      ch
    })

#get output from chunks

}

@mhallal1 mhallal1 added the core label Jun 29, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jun 29, 2022

Unit Tests Summary

    1 files    24 suites   52s ⏱️
195 tests 194 ✔️ 1 💤 0
674 runs  673 ✔️ 1 💤 0

Results for commit 06e0c06.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 29, 2022

Code Coverage Summary

Filename                          Stmts    Miss  Cover    Missing
------------------------------  -------  ------  -------  ---------------------------------------------------------------------
R/all_choices.R                       1       0  100.00%
R/call_utils.R                      156     124  20.51%   14-23, 64, 66, 68, 107-431
R/check_selector.R                   31       0  100.00%
R/choices_labeled.R                 202      61  69.80%   60, 71, 76, 87, 103, 221-225, 229-234, 264-277, 390-391, 393, 425-473
R/choices_selected.R                 81      11  86.42%   201-229, 260
R/column_functions.R                  3       3  0.00%    13-16
R/data_extract_datanames.R           32       8  75.00%   9-13, 64-66
R/data_extract_filter_module.R       92      11  88.04%   75-82, 84, 87-88, 141
R/data_extract_module.R             250      60  76.00%   3, 121, 126, 143, 146-151, 153, 172-175, 203-249, 445, 450, 481
R/data_extract_read_module.R        122      13  89.34%   28, 32-35, 102-107, 116, 133
R/data_extract_select_module.R       32      18  43.75%   31-48
R/data_extract_single_module.R       53       2  96.23%   29, 42
R/data_extract_spec.R                32       0  100.00%
R/data_merge_module.R                53      10  81.13%   101-113
R/filter_spec.R                     186       1  99.46%   373
R/format_data_extract.R              16       1  93.75%   49
R/get_dplyr_call.R                  299       0  100.00%
R/get_merge_call.R                  285      30  89.47%   29-36, 47, 210-219, 372, 388-400
R/include_css_js.R                    5       0  100.00%
R/input_checks.R                     11       2  81.82%   18-19
R/merge_data_utils.R                  2       0  100.00%
R/merge_datasets.R                  143       6  95.80%   72, 229-233
R/merge_expression_module.R          42       0  100.00%
R/resolve_delayed.R                  16       0  100.00%
R/resolve.R                         114      44  61.40%   229-312
R/select_spec.R                      49       8  83.67%   149, 213-220
R/utils.R                            15      11  26.67%   26-39
R/zzz.R                               2       2  0.00%    2-3
TOTAL                              2325     426  81.68%

Results for commit: cb426c7

Minimum allowed coverage is 80%

♻️ This comment has been updated with latest results

R/data_merge_module.R Outdated Show resolved Hide resolved
Copy link
Contributor

@nikolas-burkoff nikolas-burkoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments from me in-line - mostly for work after this PR

Comment on lines +216 to +224
#' join_keys <- list(
#' ADSL = list(
#' ADSL = c(STUDYID = "STUDYID", USUBJID = "USUBJID"),
#' ADLB = c(STUDYID = "STUDYID", USUBJID = "USUBJID")
#' ),
#' ADLB = list(
#' ADLB = c(STUDYID = "STUDYID", USUBJID = "USUBJID", PARAMCD = "PARAMCD", AVISIT = "AVISIT"),
#' ADSL = c(STUDYID = "STUDYID", USUBJID = "USUBJID")
#' )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this issue insightsengineering/teal.slice#53 is really important I think, we're going to end up creating duplication i.e. helper functions for creating join_keys embedded inside teal.data::cdisc_data, the same inside teal.slice and then also here.

key_list <- sapply(X = datasets$datanames(), simplify = FALSE, FUN = function(x) {
datasets$get_keys(dataname = x)
})
join_keys_list <- datasets$get_join_keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have created filtered_data not from teal.data then the primary keys probably won't be in the join_keys

#won't have join_keys
init_filtered_data(list(ADSL = ..., keys = c("USUBJID", "STUDYID"))
# will do
init_filtered_data(list(ADSL = ..., keys = c("USUBJID", "STUDYID"), join_keys = join_keys(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue create in teal.slice as discussed:
insightsengineering/teal.slice#58

@@ -429,6 +428,12 @@ data_extract_srv.list <- function(id, datasets, data_extract_spec, keys = NULL,
"data_extract_srv.list initialized with datasets: { paste(names(datasets), collapse = ', ') }."
)

# get keys out of join_keys
is_nested_list <- any(sapply(join_keys, is.list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - so it feels weird to need this sort of thing - if we actually used a JoinKeys object throughout instead of nested lists (which to be fair are just as painful to specify) then we wouldn't need to care about if something is a nested list or not we could just call join_keys$get()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this comes back to how we are splitting things up. Users need to specify the keys for merging/joining data.frames together, we then need to use this information throughout the code base and we probably need helper functions to create this information in a simpler way for standard cases such as cdisc.

This suggests we need a (possibly very lightweight) object to standardize creation and usage. At the moment we have both JoinKeys and a named nested list, which I think is confusing for everyone - for users as there are two distinct ways for creating the information (tbh both about as equally complex) and for us - as there are two different ways for accessing the information. Should we use JoinKeys throughout? Or should we abandon them everywhere in favour of a nested list and manually extract what's needed each time from it.

This sort of comment probably applies to other things as well - you can create a TealDataset or just create a data.frame with some attributes etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just simplified this step, there is no more check for a nested list. We can accept a nested join_keys list format only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have created an issue in TealData to simplify the JoinKeys and to make wrappers which can ease up the creation of the join_keys list.

insightsengineering/teal.data#78

@@ -258,7 +258,7 @@ check_data_extract_spec_react <- function(datasets, data_extract) {
#' @param datasets (`FilteredData` or `list` of `reactive` or non-`reactive` `data.frame`)\cr
#' object containing data either in the form of [teal.slice::FilteredData] or as a list of `data.frame`.
#' When passing a list of non-reactive `data.frame`s, they are converted to reactive `data.frame`s internally.
#' When passing a list of reactive or non-reactive `data.frame`s, the argument `keys` is required also.
#' When passing a list of reactive or non-reactive `data.frame`s, the argument `join_keys` is required also.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future you could imagine join_keys to be reactive...

Comment on lines +228 to +231
"ADSL_FILTERED <- ADSL",
"ADLB_FILTERED <- ADLB",
"ANL_1 <- ADSL_FILTERED %>% dplyr::select(STUDYID, USUBJID, AGE)",
"ANL_2 <- ADLB_FILTERED %>% dplyr::select(STUDYID, USUBJID, AVAL, CHG)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the longer term - should the merge expression actually contain the filter call - I guess the advantage is it contains everything needed to recreate the data at this point, the disadvantage is that if you have multiple data merges in the same module you'll end up with multiple expressions of the filtering and this could get very confusing/wrong if we are removing "_FILTERED" etc. as the datasets coming into a module may or may not have been filtered

So I wonder if the module developer should be in charge of what to do with the "pure" merge expression and if they wish to append it to some chunk which contains the filtered data expression then they can - and maybe for a second merge they wish to do something different

The dplyr::select stuff lives with the data_extract_spec doesn't it -> so you could imagine the merge call not actually returning them either (and the d-e-s doing so further decoupling things) - but I suspect that would be harder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decoupling the steps inside merge_datasets is necessary to avoid repeating expressions. I guess this would be a part of decoupling data_extract from data_merge modules initiative. @gogonzo will this be in action soon?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikolas-burkoff we need to continue discussion about filtering in data_extract_spec/data_merge and find the solution for this. I made a comment in my PR which adds more into this equation.

I think it is nuts that we need to depend on hardcoded _FILTERED in data_merge_expr but we have no alternative at this moment. We should rather keep the same name as in the reactive list (environment should be consisten with the code).

Let's re-initialize discussions about data_merge soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this was definitely not to be addressed in this PR...

Copy link
Contributor

@gogonzo gogonzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works with insightsengineering/teal#674
Former tests return the same output 👍
Please add the NEWS entry

R/data_merge_module.R Outdated Show resolved Hide resolved
@mhallal1 mhallal1 mentioned this pull request Jul 1, 2022
3 tasks
Copy link
Contributor

@nikolas-burkoff nikolas-burkoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments - the ANL one is probably the key one

NEWS.md Outdated Show resolved Hide resolved
R/data_extract_module.R Outdated Show resolved Hide resolved
R/data_extract_module.R Outdated Show resolved Hide resolved
R/data_merge_module.R Outdated Show resolved Hide resolved
R/data_merge_module.R Outdated Show resolved Hide resolved
@mhallal1 mhallal1 merged commit 2140417 into main Jul 1, 2022
@cicdguy cicdguy deleted the 654_pass_reactive_data@main branch June 9, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data_merge - remove dependency on FilteredData and chunks.
3 participants