Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider migration to dm #8

Open
Tracked by #32
gogonzo opened this issue Feb 14, 2022 · 4 comments
Open
Tracked by #32

Consider migration to dm #8

gogonzo opened this issue Feb 14, 2022 · 4 comments

Comments

@gogonzo
Copy link
Contributor

gogonzo commented Feb 14, 2022

Blocked by insightsengineering/teal#628 - we need more insights about how chevron and teal fits together.

  • Research CDISCData, CDISCFilteredData to store dm object to manage relationships between data.frames object.
  • Research how MAE can be stored
  • Can dm be primary data object passed to the teal_module?

Some initial research

  1. Easy initialization of the relational data
library(dm)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |>
  dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal::get_cdisc_keys("ADSL"), ref_table = ADSL)
  1. Get keys
# get primary and foreign keys
dm |> dm_get_all_pks()
dm |> dm_get_all_fks()
  1. Check metadata
    dm throws an error if the primary keys are duplicated
# check if keys are unique
dm |> dm_examine_constraints()
dm(ADSL, ADTTE, ADRS, IRIS) |>
  dm_add_pk(ADSL, teal::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal::get_cdisc_keys("ADSL")) |>
  dm_examine_constraints() 
  1. dm holds data in list, so the data is copied by reference
# same objects
identical(dm[["ADSL"]], ADSL) 
  1. Filters can be added to metadata with the dm_filter
# apply filter
dm |>
  dm_filter(ADSL, SEX == "M") |>
  dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
  dm_filter(ADRS, AVISIT %in% c("BASELINE", "SCREENING", "CYCLE2", "DAY1"))
# filter and select
dm |>
  dm_filter(ADSL, SEX == "M") |>
  dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
  dm_apply_filters() |>
  dm_select(ADSL, USUBJID, STUDYID, AGE) |>
  dm_flatten_to_tbl(ADRS) |>
  nrow()
  1. Joining is performed automatically by dm
# join tables
joined <- dm |>
  dm_filter(ADSL, SEX == "M") |>
    dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
    dm_apply_filters() |>
    dm_select(ADSL, USUBJID, STUDYID, AGE) |>
    dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
    dm_join_to_tbl(ADSL, ADRS, join = left_join)
  1. dm object is mutable
library(dm)
library(scda)


ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))



dm <- dm(ADSL, ADTTE, ADRS, IRIS) 
dm2 <- dm |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)

lobstr::obj_addr(dm)
lobstr::obj_addr(dm2)
  1. dm operations preserve attributes
dm_filtered <- dm |>
  dm_filter(ADSL, SEX == "M") |>
  dm_filter(ADRS, PARAMCD %in% c("BESRSPI", "INVET")) |>
  dm_apply_filters() |>
  dm_select(ADSL, USUBJID, STUDYID, AGE) |>
  dm_select(ADRS, USUBJID, STUDYID, PARAMCD, AVISIT, AGE, SEX, AVAL) |>
  dm_select_tbl(ADSL, ADRS) |>
  dm_flatten_to_tbl(ADRS, ADSL)


attr(joined, "label")
formatters::var_labels(joined)

Issues:

  • dm renames duplicated the columns automatically so we might have a problem with difference between input column names and output column names. input$select_from_adsl and input$select_from_adrs won't be found in colnames(data)

  • won't support same dataset selectors by default. Following will not possible
    Not needed since we are going to fix data_merge Design data extract and data merge NEST-roadmap#36

    ANL1 <- ADRS %>% filter(PARAMCD == "PARAM1") %>% select(<keys without PARAMCD>, PARAM1 = AVAL) 
    ANL2 <- ADRS %>% filter(PARAMCD == "PARAM2") %>% select(<keys without, PARAMCD>, PARAM2 = AVAL)
    ANL <- merge(ANL1, ANL2, by = c(<keys without PARAMCD>))

    Maybe we should reconsider this problematic option and just to do this:

    ANL <- ADRS %>% filter(PARAMCD %in% c("PARAM1", "PARAM2")_) %>% select(<keys>, AVAL)
@kpagacz
Copy link
Contributor

kpagacz commented May 19, 2022

Instead of replacing what we currently have with dm let's think about how we can support dm in the filter panel. I recommend first looking into how to make filter panel work with dm and then go to the data models. It goes to what I think should be the API of a teal module (so raw data passed independently of the filter state with a possibility of the filtering state being applied to the data inside the module).

@gogonzo
Copy link
Contributor Author

gogonzo commented May 19, 2022

@kpagacz Do you think about having FilteredData.dm? Could you elaborate more how would you expect this to look like?

@gogonzo
Copy link
Contributor Author

gogonzo commented May 25, 2022

Having dm in the filter panel could be a obstacle to have independent datasets reactivity. dm is one object so every change in the object (filter in any dataset) will trigger reactivity of the whole object, as we can't separate each dataset from the reactivity map.
See the example - change the filter in IRIS and ADSL-related output will wait for computation. Currently in the filter panel only child/parents datasets reactively-depended, other are unconnected.

library(shiny)
library(dm)
library(scda)


ADSL <- synthetic_cdisc_data("latest")$adsl
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADRS <- synthetic_cdisc_data("latest")$adrs
IRIS <- iris
IRIS$id <- seq_len(nrow(IRIS))

dm <- dm(ADSL, ADTTE, ADRS, IRIS) |>
  dm_add_pk(ADSL, teal.data::get_cdisc_keys("ADSL")) |>
  dm_add_pk(ADTTE, teal.data::get_cdisc_keys("ADTTE")) |>
  dm_add_pk(ADRS, teal.data::get_cdisc_keys("ADRS")) |>
  dm_add_pk(IRIS, "id") |>
  dm_add_fk(table = ADTTE, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL) |>
  dm_add_fk(table = ADRS, columns = teal.data::get_cdisc_keys("ADSL"), ref_table = ADSL)


shinyApp(
  ui = fluidPage(
    sidebarPanel(
      selectInput("Species", "Select species", choices = unique(iris$Species), selected = "Setosa"),
      verbatimTextOutput("ADSL_rows"),
      verbatimTextOutput("iris_rows")
    )
  ),
  server = function(input, output, session) {
    filtered_data <- eventReactive(input$Species, {
      dm_filter(dm, IRIS, Species %in% input$Species)
    })

    filtered_adsl <- reactive(filtered_data()[["ADSL"]])
    filtered_iris <- reactive(filtered_data()[["IRIS"]])

    output$ADSL_rows <- renderText({
      print("ADSL triggered")
      Sys.sleep(2)
      nrow(filtered_adsl())
    })

    output$iris_rows <- renderText({
      print("IRIS triggered")
      Sys.sleep(2)
      nrow(filtered_iris())
    })
  }
)

@gogonzo
Copy link
Contributor Author

gogonzo commented May 25, 2022

I have an opinion that dm can fix some of our problems but will introduce other:

  1. FilterPanel - We can't build any reactivity between datasets which will trigger re-computation of all reactives regardless of datasets they are using. Specifically, in our case it means that each module within app will be affected by the change in any dataset. We might consider though usage of the same datasets in all modules, which is not a good idea as users requested filter-panel display being module specific.
    UPDATE: Above is not necessary true - bindCache can actually handle if particular datasets changed.

  2. If we allow "choices_selected" from multiple datasets, then dm wouldn't help us so much with the merge - when one selects duplicated variable names from different datasets then dm renames them automatically, making them hard to track in the merged dataset (unless our merge_datasets module will provide this information about renamed variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants