Allow to customize merge keys #29

gogonzo · 2021-09-28T10:21:44Z

Problem description

Design of data_extract/data_merge is as follows. Consider columns taken from datasets, filtered separately and merged all together.

Above can be express in the following code

library(scda)
library(dplyr)
ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte

ANL1 <- ADTTE %>% filter(PARAMCD == "CRSD") %>% select(STUDYID, USUBJID, AVAL)
ANL2 <- ADSL %>% select(STUDYID, USUBJID, AGE)
ANL <- left_join(ANL1, ANL2, by = c("STUDYID", "USUBJID"))

# STUDYID USUBJID                 AVAL   AGE
#      <chr>   <chr>              <dbl> <int>
# 1 AB12345 AB12345-BRA-1-id-105  1274.     38
# 2 AB12345 AB12345-BRA-1-id-134   545.     47
# 3 AB12345 AB12345-BRA-1-id-141    77.2    35
# 4 AB12345 AB12345-BRA-1-id-236   336.     32
# 5 AB12345 AB12345-BRA-1-id-265   270.     25

Merge module supports multiple scenarios and keys management is the "key" here. Consider following scenario where we have a two selectors to compare specific AVAL level from the same dataset across different factor values - for example AVAL increase between two visits

ADLB <- synthetic_cdisc_data("latest")$adlb

keys <- c("STUDYID", "USUBJID", "PARAMCD", "AVISITN")

ANL1 <- ADLB %>% 
 filter(PARAMCD == "CRP" & AVISITN == "-1") %>% 
 select(keys, -PARAMCD, -AVISITN, AVAL1 = AVAL)
 
ANL2 <- ADLB %>% 
 filter(PARAMCD == "CRP" & AVISITN == "0") %>% 
 select(keys, -PARAMCD, -AVISITN, AVAL2 = AVAL)

ANL <- left_join(ANL1, AN2, by = c("STUDYID", "PARAMCD"))

In above example, PARAMCD and AVISITN in both selectors have just one value which means that it's no longer the key for these ANLs as they were filtered-out (does not distinguish the rows). The reason to drop these keys is that it wouldn't be possible if by = c("USUBJID", "STUDYID", "PARAMCD", "AVISITN") because they have a different values and it's not possible to merge them together.

For scda data we have no problem with this as we have a static data and keys are relevant for our test data. Problems starts when SMEs create their own apps which are using real data. The keys used to be different than default teal keys, but the keys follow GDSR requirements. For example ADLB keys are STUDYID USUBJID PARAMCD BASETYPE AVISITN ATPTN DTYPE ADTM LBSEQ ASPID vs teal STUDYID USUBJID PARAMCD AVISIT. SMEs require the change in the keys to fit GDSR standards which complicates the design little bit.

If we go back to the last example which is using ADLB. Extending set of keys will result in joining by keys which have not been filtered-out. This means that ANLs will be joined by "STUDYID", "USUBJID", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID" if we put only PARAMCD and AVISITN as a filter_spec.

ADLB <- synthetic_cdisc_data("latest")$adlb
keys <- c("STUDYID", "USUBJID", "PARAMCD", "AVISITN", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID")

ANL1 <- ADLB %>% 
  filter(PARAMCD == "CRP" & AVISITN == "-1") %>% 
  select(keys, -PARAMCD, -AVISITN, AVAL1 = AVAL)

ANL2 <- ADLB %>% 
  filter(PARAMCD == "CRP" & AVISITN == "0") %>% 
  select(keys, -PARAMCD, -AVISITN, AVAL2 = AVAL)

ANL <- inner_join(ANL1, ANL2, by =  c("STUDYID", "USUBJID", "ATPTN", "DTYPE", "ADTM", "LBSEQ", "ASPID"))

Changing the keys only will result in:

Current apps which are created with PARAMCD and/or AVISIT filter_spec might fail because output ANL will be merged on a columns which have a different values across the subjects. One can run above code and see that the result of inner_join is empty data.frame. Also our problem is that GDSR keys are not the same what we think they are. Some of the keys in the datasets does not make any row unique adding them - which means that for us they are just values. For example AVISITN is a number of a patient visit while ADTM can be a data of the visit. This means that adding ADTM to the keys does not make any row unique, because STUDYID, USUBJID, PARAMCD and AVISITN make them unique already.

ADLB %>%
  group_by(STUDYID, USUBJID, PARAMCD, AVISIT) %>%
  summarize(n = n()) %>%
  .$n

Above is not the case of our sample datasets only, but might occur also in real data see the issue.

In case of merging selectors from the same dataset we need to add filter_spec for all of them with drop_keys = TRUE to avoid cartesian join.

Because keys/data_merge have been problematic since introduction, we need to simplify the way to handle them before reaching wider users base.

Proposition

Change the keys in teal according to GDSR and implement merge module to visualize the join keys.
See a [good example]](https://help.tableau.com/current/pro/desktop/en-us/joining_tables.htm)

Otherwise if we want to keep tea.devel as is we need to fix the documentation to visualize multiple scenarios like:

multiple single datasets
reshaping
mixing datasets
drop_keys = TRUE consequences
Current documentation in my opinion is not sufficient.

The text was updated successfully, but these errors were encountered:

gogonzo added the discussion label Sep 28, 2021

gogonzo mentioned this issue Sep 28, 2021

cdisc_datasets primary keys updated insightsengineering/teal#256

Closed

gogonzo transferred this issue from another repository Apr 8, 2022

gogonzo mentioned this issue Jun 2, 2022

Design data extract and data merge insightsengineering/NEST-roadmap#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to customize merge keys #29

Allow to customize merge keys #29

gogonzo commented Sep 28, 2021 •

edited

Loading

Allow to customize merge keys #29

Allow to customize merge keys #29

Comments

gogonzo commented Sep 28, 2021 • edited Loading

Problem description

Proposition

gogonzo commented Sep 28, 2021 •

edited

Loading