Develop a function to generate --SEQ. #15

houtel · 2023-10-30T01:09:36Z

Feature Idea

Purpose
Generate --SEQ when the set of columns which define the natural key for a domain and an initial value for each USUBJID (defaulted to 1) are provided by parameter.

Functionality
The feature should generate --SEQ for a given domain using the following algorithm. Note that this algorithm assumes that any split domains are combined into a single data frame prior to generating the --SEQ column.

Sort the domain by the provided natural key. If the keys do not identify distinct rows, produce a warning.
Set --SEQ to the initial value (provided by parameter) for the first record having a given USUBJID. The column name for --SEQ is determined by concatenating the value of the DOMAIN column with “SEQ” (e.g. VSSEQ when DOMAIN = “VS”).
Increment --SEQ by 1 for each successive record for the given USUBJID.

Relevant Input

Data frame containing all domain columns except for the --SEQ column and a vector containing the natural key columns for the domain.

Relevant Output

Data frame containing all domain columns including the --SEQ column.

Reproducible Example/Pseudo Code

generate_seq (tar_dat, tar_var = "xxSEQ" , key_vars = c(“USUBJID”, “XXCAT”, “XXSCAT”, “XXTERM”), init_val = 1)
Example:
generate_seq (ds, "dsseq", key_vars = c(“USUBJID”, “DSCAT”, “DSSCAT”, “DSTERM”))

NOTE: CRAN package {sdtmval} contains assign_SEQ function which could be utilized for this purpose unless we want to minimize the dependency.

venkatamaguluri · 2023-10-30T18:30:56Z

Comments from Venkata Maguluri (Pfizer)****

Not necessarily starting value can be "1" all the time hence we give control by end user.
Assume data has been sorted based on key elements before calling this function.
ensure data sorting did not changed during SEQ assignment.

edgar-manukyan · 2023-11-01T16:14:46Z

In roak we warn user if the keys do not identify distinct rows. Shall we build such functionality as well? We can help them to identify those duplicate rows.

edgar-manukyan · 2023-11-01T16:16:35Z

We can try if to determine whether data is already sorted.

edgar-manukyan · 2023-11-01T16:23:37Z

As a separate step, in roak we convert all empty string values "" into NA_character_ in all raw datasets before any of the functions start manipulating the data.

ynsec37 · 2023-12-08T10:08:59Z

Dear developer,

If the function will work like the tidyeval way that quotations are not needed for data frame vairables generate_seq (ds_temp, key_vars = c(USUBJID, DSCAT, DSSCAT, DSTERM)), and there is a similar function in admiral::derive_var_obs_number

ramiromagno · 2024-04-06T00:06:12Z

Hi Adam (@galachad):

I see that this issue is assigned to you, but it hasn't seen any activity for a long while. Are you still working on this?

rammprasad · 2024-04-10T16:08:25Z

Reference function in {roak} - https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R

ramiromagno · 2024-05-15T00:49:59Z

Hi @rammprasad and @edgar-manukyan:

Can't you provide a few input and output data set examples for me to test my implementation?

edgar-manukyan · 2024-05-15T18:03:28Z

@ramiromagno, please find the domain_key_variables.csv which determines how the domain needs to be sorted (ideally into unique rows) then the --SEQ variable gets derived.

Test 1

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID, ~VSTESTCD,             ~VSDTC, ~VSTPTNUM,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",   "DIABP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",    "TEMP", "2020-09-01T13:31",        NA,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",   "DIABP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-375", "/F:VTLS2-D:9795533-R:2",    "TEMP", "2020-09-28T11:00",         2,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",   "DIABP",       "2020-09-20",        NA,
    "ABC123",    "VS",  "ABC123-376", "/F:VTLS1-D:9795591-R:1",    "TEMP",       "2020-09-20",        NA
  )
  result <- oak_derive_seq(ds_in)

  expect_equal(result$VSSEQ,
               c(1L, 2L, 3L, 4L, 1L, 2L))

Test 2

  ds_in <- tibble::tribble(
    ~STUDYID, ~DOMAIN,      ~USUBJID,                  ~VSSPID,
    "ABC123",    "ZZ",  "ABC123-375", "/F:VTLS1-D:9795532-R:2",
  )

  expect_error(
    oak_derive_seq(ds_in),
    paste(
      "ZZ domain keys must be in the domain_key_variables.csv",
      "Please update the file and use oak_load_study_config().",
      sep = "\n"
    )
  )

Test 3

  ds_in <- tibble::tribble(
    ~STUDYID,      ~RSUBJID,    ~SCTESTCD, ~DOMAIN,     ~SREL,           ~SCCAT,
    "ABC123",  "ABC123-210",   "LVSBJIND",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",   "EDULEVEL",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-210",     "TMSPPT",  "APSC",  "FRIEND", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",    "CAREDUR",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-211",   "LVSBJIND",  "APSC", "SIBLING", "CAREGIVERSTUDY",
    "ABC123",  "ABC123-212",    "JOBCLAS",  "APSC",  "SPOUSE", "CAREGIVERSTUDY"
  )

  result <- oak_derive_seq(ds_in)

  expect_equal(result$SCSEQ,
               c(1L, 2L, 3L, 1L, 2L, 1L))

ramiromagno · 2024-05-15T19:01:50Z

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

edgar-manukyan · 2024-05-15T19:27:07Z

Thanks @edgar-manukyan!

In test 3, ds_in does not contain all key variables. According to file domain_key_variables.csv, these variables: USUBJID, SCSPID, SCTESTCD and VISITNUM should also be there, isn't it? How could then the function oak_derive_seq() work in that case?

Awesome observation @ramiromagno. This is testing so called associated person domain and I see in the roak https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L42

ramiromagno · 2024-05-15T19:34:48Z

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

ramiromagno · 2024-05-15T19:41:19Z

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

edgar-manukyan · 2024-05-15T19:49:17Z

I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in ds_in?

Interestingly roak just ignores them and you should ask Ram about this :) https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L80

ramiromagno · 2024-05-15T19:52:48Z

I see. Could it be that not all keys are mandatory? There might be a few that are optional, and in that case it could fine to sort only with what is available...? @rammprasad help please! :)

edgar-manukyan · 2024-05-15T19:53:45Z

I see, sorry for the oversight!

BTW: Just one more question: is the domain_key_variables.csv comprehensive?

No worries, you are picking up SDTM concepts so quickly. After three years, I still feel dizzy about it. The attached file
was used for the tests. This one domain_key_variables (2).csv is more comprehensive, though as Ram said it is dynamic and study teams will change it based on their setup. That's the reason why they call it a configuration file.

ramiromagno · 2024-05-15T20:05:11Z

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

armenic · 2024-05-15T20:15:32Z

Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard?

They are suppose to be key to uniquely identify the rows and we even warn them if we notice that they don't.

ramiromagno · 2024-05-15T20:39:19Z

Thanks @edgar-manukyan. I've updated the PR according to your feedback so far. But we will have to wait for @rammprasad's feedback on these other corner cases.

houtel added the enhancement New feature or request label Oct 30, 2023

houtel added this to sdtm.oak R package Oct 30, 2023

houtel moved this to In Progress in sdtm.oak R package Oct 30, 2023

houtel changed the title ~~Develop a function to generation --SEQ.~~ Develop a function to generate --SEQ. Oct 30, 2023

edgar-manukyan assigned venkatamaguluri and tataphani and unassigned venkatamaguluri Nov 1, 2023

galachad assigned galachad and unassigned tataphani Nov 29, 2023

ramiromagno self-assigned this Apr 10, 2024

ramiromagno linked a pull request May 15, 2024 that will close this issue

0015 derive seq #53

Merged

14 tasks

ramiromagno moved this from In Progress to In review in sdtm.oak R package May 29, 2024

ramiromagno closed this as completed in #53 May 30, 2024

ramiromagno moved this from In review to Done in sdtm.oak R package Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a function to generate --SEQ. #15

Develop a function to generate --SEQ. #15

houtel commented Oct 30, 2023 •

edited by rammprasad

Loading

venkatamaguluri commented Oct 30, 2023

edgar-manukyan commented Nov 1, 2023

edgar-manukyan commented Nov 1, 2023

edgar-manukyan commented Nov 1, 2023 •

edited

Loading

ynsec37 commented Dec 8, 2023

ramiromagno commented Apr 6, 2024

rammprasad commented Apr 10, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 •

edited

Loading

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 •

edited

Loading

ramiromagno commented May 15, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 •

edited

Loading

ramiromagno commented May 15, 2024

armenic commented May 15, 2024

ramiromagno commented May 15, 2024

Develop a function to generate --SEQ. #15

Develop a function to generate --SEQ. #15

Comments

houtel commented Oct 30, 2023 • edited by rammprasad Loading

Feature Idea

Relevant Input

Relevant Output

Reproducible Example/Pseudo Code

venkatamaguluri commented Oct 30, 2023

edgar-manukyan commented Nov 1, 2023

edgar-manukyan commented Nov 1, 2023

edgar-manukyan commented Nov 1, 2023 • edited Loading

ynsec37 commented Dec 8, 2023

ramiromagno commented Apr 6, 2024

rammprasad commented Apr 10, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 • edited Loading

Test 1

Test 2

Test 3

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 • edited Loading

ramiromagno commented May 15, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024

ramiromagno commented May 15, 2024

edgar-manukyan commented May 15, 2024 • edited Loading

ramiromagno commented May 15, 2024

armenic commented May 15, 2024

ramiromagno commented May 15, 2024

houtel commented Oct 30, 2023 •

edited by rammprasad

Loading

edgar-manukyan commented Nov 1, 2023 •

edited

Loading

edgar-manukyan commented May 15, 2024 •

edited

Loading

edgar-manukyan commented May 15, 2024 •

edited

Loading

edgar-manukyan commented May 15, 2024 •

edited

Loading