Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a function to generate --SEQ. #15

Closed
houtel opened this issue Oct 30, 2023 · 19 comments · Fixed by #53
Closed

Develop a function to generate --SEQ. #15

houtel opened this issue Oct 30, 2023 · 19 comments · Fixed by #53
Assignees
Labels
enhancement New feature or request

Comments

@houtel
Copy link
Collaborator

houtel commented Oct 30, 2023

Feature Idea

Purpose
Generate --SEQ when the set of columns which define the natural key for a domain and an initial value for each USUBJID (defaulted to 1) are provided by parameter.

Functionality
The feature should generate --SEQ for a given domain using the following algorithm. Note that this algorithm assumes that any split domains are combined into a single data frame prior to generating the --SEQ column.

  1. Sort the domain by the provided natural key. If the keys do not identify distinct rows, produce a warning.
  2. Set --SEQ to the initial value (provided by parameter) for the first record having a given USUBJID. The column name for --SEQ is determined by concatenating the value of the DOMAIN column with “SEQ” (e.g. VSSEQ when DOMAIN = “VS”).
  3. Increment --SEQ by 1 for each successive record for the given USUBJID.

Relevant Input

Data frame containing all domain columns except for the --SEQ column and a vector containing the natural key columns for the domain.

Relevant Output

Data frame containing all domain columns including the --SEQ column.

Reproducible Example/Pseudo Code

generate_seq (tar_dat, tar_var = "xxSEQ" , key_vars = c(“USUBJID”, “XXCAT”, “XXSCAT”, “XXTERM”), init_val = 1)
Example:
generate_seq (ds, "dsseq", key_vars = c(“USUBJID”, “DSCAT”, “DSSCAT”, “DSTERM”))

NOTE: CRAN package {sdtmval} contains assign_SEQ function which could be utilized for this purpose unless we want to minimize the dependency.

@houtel houtel added the enhancement New feature or request label Oct 30, 2023
@houtel houtel moved this to In Progress in sdtm.oak R package Oct 30, 2023
@houtel houtel changed the title Develop a function to generation --SEQ. Develop a function to generate --SEQ. Oct 30, 2023
@venkatamaguluri
Copy link

Comments from Venkata Maguluri (Pfizer)****

  1. Not necessarily starting value can be "1" all the time hence we give control by end user.
  2. Assume data has been sorted based on key elements before calling this function.
  3. ensure data sorting did not changed during SEQ assignment.

@edgar-manukyan
Copy link
Collaborator

In roak we warn user if the keys do not identify distinct rows. Shall we build such functionality as well? We can help them to identify those duplicate rows.

@edgar-manukyan
Copy link
Collaborator

We can try if to determine whether data is already sorted.

@edgar-manukyan
Copy link
Collaborator

edgar-manukyan commented Nov 1, 2023

As a separate step, in roak we convert all empty string values "" into NA_character_ in all raw datasets before any of the functions start manipulating the data.

@ynsec37
Copy link

ynsec37 commented Dec 8, 2023

Dear developer,

If the function will work like the tidyeval way that quotations are not needed for data frame vairables generate_seq (ds_temp, key_vars = c(USUBJID, DSCAT, DSSCAT, DSTERM)), and there is a similar function in admiral::derive_var_obs_number

@ramiromagno
Copy link
Collaborator

Hi Adam (@galachad):

I see that this issue is assigned to you, but it hasn't seen any activity for a long while. Are you still working on this?

@rammprasad
Copy link
Collaborator

Reference function in {roak} - https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R

@ramiromagno ramiromagno self-assigned this Apr 10, 2024
@ramiromagno
Copy link
Collaborator

Hi @rammprasad and @edgar-manukyan:

Can't you provide a few input and output data set examples for me to test my implementation?

@ramiromagno ramiromagno linked a pull request May 15, 2024 that will close this issue
14 tasks
@edgar-manukyan
Copy link
Collaborator

edgar-manukyan commented May 15, 2024

@ramiromagno, please find the domain_key_variables.csv which determines how the domain needs to be sorted (ideally into unique rows) then the --SEQ variable gets derived.

Test 1

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID, ~VSTESTCD,             ~VSDTC, ~VSTPTNUM,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",   "DIABP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",    "TEMP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",   "DIABP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",    "TEMP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",   "DIABP",       "2020-09-20",        NA,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",    "TEMP",       "2020-09-20",        NA
  )
  result <- oak_derive_seq(ds_in)

  expect_equal(result$VSSEQ,
               c(1L, 2L, 3L, 4L, 1L, 2L))

Test 2

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID,
    "ABC123",    "ZZ",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",
  )

  expect_error(
    oak_derive_seq(ds_in),
    paste(
      "ZZ domain keys must be in the domain_key_variables.csv",
      "Please update the file and use oak_load_study_config().",
      sep = "\n"
    )
  )

Test 3

  ds_in <- tibble::tribble(
    ~STUDYID,      ~RSUBJID,    ~SCTESTCD, ~DOMAIN,     ~SREL,           ~SCCAT,
    "ABC123",  "ABC123-210",   "LVSBJIND",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",   "EDULEVEL",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",     "TMSPPT",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",    "CAREDUR",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",   "LVSBJIND",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-212",    "JOBCLAS",  "APSC",  "SPOUSE", "CAREGIVERSTUDY"
  )

  result <- oak_derive_seq(ds_in)

  expect_equal(result$SCSEQ,
               c(1L, 2L, 3L, 1L, 2L, 1L))

@ramiromagno
Copy link
Collaborator

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

@edgar-manukyan
Copy link
Collaborator

edgar-manukyan commented May 15, 2024

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

Awesome observation @ramiromagno. This is testing so called associated person domain and I see in the roak https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L42

@ramiromagno
Copy link
Collaborator

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

@ramiromagno
Copy link
Collaborator

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

@edgar-manukyan
Copy link
Collaborator

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

Interestingly roak just ignores them and you should ask Ram about this :) https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L80

@ramiromagno
Copy link
Collaborator

I see. Could it be that not all keys are mandatory? There might be a few that are optional, and in that case it could fine to sort only with what is available...? @rammprasad help please! :)

@edgar-manukyan
Copy link
Collaborator

edgar-manukyan commented May 15, 2024

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

No worries, you are picking up SDTM concepts so quickly. After three years, I still feel dizzy about it. The attached file
was used for the tests. This one domain_key_variables (2).csv is more comprehensive, though as Ram said it is dynamic and study teams will change it based on their setup. That's the reason why they call it a configuration file.

@ramiromagno
Copy link
Collaborator

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

@armenic
Copy link

armenic commented May 15, 2024

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

They are suppose to be key to uniquely identify the rows and we even warn them if we notice that they don't.

@ramiromagno
Copy link
Collaborator

Thanks @edgar-manukyan. I've updated the PR according to your feedback so far. But we will have to wait for @rammprasad's feedback on these other corner cases.

@ramiromagno ramiromagno moved this from In Progress to In review in sdtm.oak R package May 29, 2024
@ramiromagno ramiromagno moved this from In review to Done in sdtm.oak R package Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

9 participants