-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove chunks + merge_expression_srv #77
Conversation
Co-authored-by: Dawid Kałędkowski <[email protected]>
Co-authored-by: Pawel Rucki <[email protected]>
…nto 61_data_merge_dependency@main
Code Coverage Summary
Results for commit: cb426c7 Minimum allowed coverage is ♻️ This comment has been updated with latest results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments from me in-line - mostly for work after this PR
#' join_keys <- list( | ||
#' ADSL = list( | ||
#' ADSL = c(STUDYID = "STUDYID", USUBJID = "USUBJID"), | ||
#' ADLB = c(STUDYID = "STUDYID", USUBJID = "USUBJID") | ||
#' ), | ||
#' ADLB = list( | ||
#' ADLB = c(STUDYID = "STUDYID", USUBJID = "USUBJID", PARAMCD = "PARAMCD", AVISIT = "AVISIT"), | ||
#' ADSL = c(STUDYID = "STUDYID", USUBJID = "USUBJID") | ||
#' ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this issue insightsengineering/teal.slice#53 is really important I think, we're going to end up creating duplication i.e. helper functions for creating join_keys embedded inside teal.data::cdisc_data, the same inside teal.slice and then also here.
key_list <- sapply(X = datasets$datanames(), simplify = FALSE, FUN = function(x) { | ||
datasets$get_keys(dataname = x) | ||
}) | ||
join_keys_list <- datasets$get_join_keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have created filtered_data not from teal.data then the primary keys probably won't be in the join_keys
#won't have join_keys
init_filtered_data(list(ADSL = ..., keys = c("USUBJID", "STUDYID"))
# will do
init_filtered_data(list(ADSL = ..., keys = c("USUBJID", "STUDYID"), join_keys = join_keys(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue create in teal.slice
as discussed:
insightsengineering/teal.slice#58
R/data_extract_module.R
Outdated
@@ -429,6 +428,12 @@ data_extract_srv.list <- function(id, datasets, data_extract_spec, keys = NULL, | |||
"data_extract_srv.list initialized with datasets: { paste(names(datasets), collapse = ', ') }." | |||
) | |||
|
|||
# get keys out of join_keys | |||
is_nested_list <- any(sapply(join_keys, is.list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - so it feels weird to need this sort of thing - if we actually used a JoinKeys object throughout instead of nested lists (which to be fair are just as painful to specify) then we wouldn't need to care about if something is a nested list or not we could just call join_keys$get()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this comes back to how we are splitting things up. Users need to specify the keys for merging/joining data.frames together, we then need to use this information throughout the code base and we probably need helper functions to create this information in a simpler way for standard cases such as cdisc.
This suggests we need a (possibly very lightweight) object to standardize creation and usage. At the moment we have both JoinKeys and a named nested list, which I think is confusing for everyone - for users as there are two distinct ways for creating the information (tbh both about as equally complex) and for us - as there are two different ways for accessing the information. Should we use JoinKeys throughout? Or should we abandon them everywhere in favour of a nested list and manually extract what's needed each time from it.
This sort of comment probably applies to other things as well - you can create a TealDataset or just create a data.frame with some attributes etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just simplified this step, there is no more check for a nested list. We can accept a nested join_keys list format only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have created an issue in TealData to simplify the JoinKeys and to make wrappers which can ease up the creation of the join_keys list.
R/data_extract_module.R
Outdated
@@ -258,7 +258,7 @@ check_data_extract_spec_react <- function(datasets, data_extract) { | |||
#' @param datasets (`FilteredData` or `list` of `reactive` or non-`reactive` `data.frame`)\cr | |||
#' object containing data either in the form of [teal.slice::FilteredData] or as a list of `data.frame`. | |||
#' When passing a list of non-reactive `data.frame`s, they are converted to reactive `data.frame`s internally. | |||
#' When passing a list of reactive or non-reactive `data.frame`s, the argument `keys` is required also. | |||
#' When passing a list of reactive or non-reactive `data.frame`s, the argument `join_keys` is required also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future you could imagine join_keys to be reactive...
"ADSL_FILTERED <- ADSL", | ||
"ADLB_FILTERED <- ADLB", | ||
"ANL_1 <- ADSL_FILTERED %>% dplyr::select(STUDYID, USUBJID, AGE)", | ||
"ANL_2 <- ADLB_FILTERED %>% dplyr::select(STUDYID, USUBJID, AVAL, CHG)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the longer term - should the merge expression actually contain the filter call - I guess the advantage is it contains everything needed to recreate the data at this point, the disadvantage is that if you have multiple data merges in the same module you'll end up with multiple expressions of the filtering and this could get very confusing/wrong if we are removing "_FILTERED" etc. as the datasets coming into a module may or may not have been filtered
So I wonder if the module developer should be in charge of what to do with the "pure" merge expression and if they wish to append it to some chunk which contains the filtered data expression then they can - and maybe for a second merge they wish to do something different
The dplyr::select stuff lives with the data_extract_spec doesn't it -> so you could imagine the merge call not actually returning them either (and the d-e-s doing so further decoupling things) - but I suspect that would be harder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decoupling the steps inside merge_datasets
is necessary to avoid repeating expressions. I guess this would be a part of decoupling data_extract
from data_merge
modules initiative. @gogonzo will this be in action soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikolas-burkoff we need to continue discussion about filtering in data_extract_spec/data_merge and find the solution for this. I made a comment in my PR which adds more into this equation.
I think it is nuts that we need to depend on hardcoded _FILTERED in data_merge_expr but we have no alternative at this moment. We should rather keep the same name as in the reactive list (environment should be consisten with the code).
Let's re-initialize discussions about data_merge soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, this was definitely not to be addressed in this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything works with insightsengineering/teal#674
Former tests return the same output 👍
Please add the NEWS entry
Co-authored-by: Dawid Kałędkowski <[email protected]>
…o 654_pass_reactive_data@main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments - the ANL one is probably the key one
closes #61
replaces #75 to have the same branch name as insightsengineering/teal#674
Dependency on
datasets
andchunks
has been removed from data merge module is such a way:merge_expression_module
andmerge_expression_srv
which are the new versions ofdata_merge_module
anddata_merge_srv
. The new modules have an additional argumentjoin_keys
and the argumentdatasets
accepts a list of reactive or non-reactive data.frames. So we have a separation between the old modules that acceptFilteredData
and the new modules that accept a named list of reactive/non-reactive data.frames.Merge_datasets
is modified to use a list of reactive data.frames instead ofFilteredData
and chunks was completely removed from it. It now returns a list ofexpr
,columns_source
,key
andfilter_info
. Nodata
orchunks
are returned anymore. Be aware thatexpr
now is a list of calls instead of character.get_dplyr_call
andget_merge_call
have been modified accordingly.data_merge_srv
has been modified in a way to use the samemerge_datasets
,get_dplyr_call
andget_merge_call
scripts. It definesdata
andjoin_keys
internally and callsmerge_expression_srv
where it appendsdata
andchunks
to the returned list.join_keys
argument replaceskeys
argument ofdata_extract_XXX
. This is to avoid the confusion of having to usejoin_keys
(data_merge_XXX) andkeys
(data_extract_XXX) in the same module. We extract internally the keys corresponding to every dataset from thejoin_keys
input.Please test with the example apps in
merge_expression_module.R
anddata_merge_module.R
.The design of the apps would be like this now: