Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to customize merge keys #29

Open
gogonzo opened this issue Sep 28, 2021 · 0 comments
Open

Allow to customize merge keys #29

gogonzo opened this issue Sep 28, 2021 · 0 comments

Comments

@gogonzo
Copy link
Contributor

gogonzo commented Sep 28, 2021

Problem description

Design of data_extract/data_merge is as follows. Consider columns taken from datasets, filtered separately and merged all together.
image

Above can be express in the following code

library(scda)
library(dplyr)
ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte

ANL1 <- ADTTE %>% filter(PARAMCD == "CRSD") %>% select(STUDYID, USUBJID, AVAL)
ANL2 <- ADSL %>% select(STUDYID, USUBJID, AGE)
ANL <- left_join(ANL1, ANL2, by = c("STUDYID", "USUBJID"))

# STUDYID USUBJID                 AVAL   AGE
#      <chr>   <chr>              <dbl> <int>
# 1 AB12345 AB12345-BRA-1-id-105  1274.     38
# 2 AB12345 AB12345-BRA-1-id-134   545.     47
# 3 AB12345 AB12345-BRA-1-id-141    77.2    35
# 4 AB12345 AB12345-BRA-1-id-236   336.     32
# 5 AB12345 AB12345-BRA-1-id-265   270.     25

Merge module supports multiple scenarios and keys management is the "key" here. Consider following scenario where we have a two selectors to compare specific AVAL level from the same dataset across different factor values - for example AVAL increase between two visits

ADLB <- synthetic_cdisc_data("latest")$adlb

keys <- c("STUDYID", "USUBJID", "PARAMCD", "AVISITN")

ANL1 <- ADLB %>% 
 filter(PARAMCD == "CRP" & AVISITN == "-1") %>% 
 select(keys, -PARAMCD, -AVISITN, AVAL1 = AVAL)
 
ANL2 <- ADLB %>% 
 filter(PARAMCD == "CRP" & AVISITN == "0") %>% 
 select(keys, -PARAMCD, -AVISITN, AVAL2 = AVAL)

ANL <- left_join(ANL1, AN2, by = c("STUDYID", "PARAMCD"))

In above example, PARAMCD and AVISITN in both selectors have just one value which means that it's no longer the key for these ANLs as they were filtered-out (does not distinguish the rows). The reason to drop these keys is that it wouldn't be possible if by = c("USUBJID", "STUDYID", "PARAMCD", "AVISITN") because they have a different values and it's not possible to merge them together.

For scda data we have no problem with this as we have a static data and keys are relevant for our test data. Problems starts when SMEs create their own apps which are using real data. The keys used to be different than default teal keys, but the keys follow GDSR requirements. For example ADLB keys are STUDYID USUBJID PARAMCD BASETYPE AVISITN ATPTN DTYPE ADTM LBSEQ ASPID vs teal STUDYID USUBJID PARAMCD AVISIT. SMEs require the change in the keys to fit GDSR standards which complicates the design little bit.

If we go back to the last example which is using ADLB. Extending set of keys will result in joining by keys which have not been filtered-out. This means that ANLs will be joined by "STUDYID", "USUBJID", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID" if we put only PARAMCD and AVISITN as a filter_spec.

ADLB <- synthetic_cdisc_data("latest")$adlb
keys <- c("STUDYID", "USUBJID", "PARAMCD", "AVISITN", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID")

ANL1 <- ADLB %>% 
  filter(PARAMCD == "CRP" & AVISITN == "-1") %>% 
  select(keys, -PARAMCD, -AVISITN, AVAL1 = AVAL)

ANL2 <- ADLB %>% 
  filter(PARAMCD == "CRP" & AVISITN == "0") %>% 
  select(keys, -PARAMCD, -AVISITN, AVAL2 = AVAL)

ANL <- inner_join(ANL1, ANL2, by =  c("STUDYID", "USUBJID", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID"))

Changing the keys only will result in:

  1. Current apps which are created with PARAMCD and/or AVISIT filter_spec might fail because output ANL will be merged on a columns which have a different values across the subjects. One can run above code and see that the result of inner_join is empty data.frame. Also our problem is that GDSR keys are not the same what we think they are. Some of the keys in the datasets does not make any row unique adding them - which means that for us they are just values. For example AVISITN is a number of a patient visit while ADTM can be a data of the visit. This means that adding ADTM to the keys does not make any row unique, because STUDYID, USUBJID, PARAMCD and AVISITN make them unique already.
ADLB %>%
  group_by(STUDYID, USUBJID, PARAMCD, AVISIT) %>%
  summarize(n = n()) %>%
  .$n

Above is not the case of our sample datasets only, but might occur also in real data see the issue.

  1. In case of merging selectors from the same dataset we need to add filter_spec for all of them with drop_keys = TRUE to avoid cartesian join.

Because keys/data_merge have been problematic since introduction, we need to simplify the way to handle them before reaching wider users base.

Proposition

Change the keys in teal according to GDSR and implement merge module to visualize the join keys.
See a [good example]](https://help.tableau.com/current/pro/desktop/en-us/joining_tables.htm)
image

Otherwise if we want to keep tea.devel as is we need to fix the documentation to visualize multiple scenarios like:

  • multiple single datasets
  • reshaping
  • mixing datasets
  • drop_keys = TRUE consequences
    Current documentation in my opinion is not sufficient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant