-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop a function to generate --SEQ. #15
Comments
Comments from Venkata Maguluri (Pfizer)****
|
In roak we warn user if the keys do not identify distinct rows. Shall we build such functionality as well? We can help them to identify those duplicate rows. |
We can try if to determine whether data is already sorted. |
As a separate step, in roak we convert all empty string values "" into NA_character_ in all raw datasets before any of the functions start manipulating the data. |
Dear developer, If the function will work like the tidyeval way that quotations are not needed for data frame vairables |
Hi Adam (@galachad): I see that this issue is assigned to you, but it hasn't seen any activity for a long while. Are you still working on this? |
Reference function in {roak} - https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R |
Hi @rammprasad and @edgar-manukyan: Can't you provide a few input and output data set examples for me to test my implementation? |
@ramiromagno, please find the domain_key_variables.csv which determines how the domain needs to be sorted (ideally into unique rows) then the --SEQ variable gets derived. Test 1 ds_in <- tibble::tribble(
~STUDYID, ~DOMAIN, ~USUBJID, ~VSSPID, ~VSTESTCD, ~VSDTC, ~VSTPTNUM,
"ABC123", "VS", "ABC123-375", "/F:VTLS1-D:9795532-R:2", "DIABP", "2020-09-01T13:31", NA,
"ABC123", "VS", "ABC123-375", "/F:VTLS1-D:9795532-R:2", "TEMP", "2020-09-01T13:31", NA,
"ABC123", "VS", "ABC123-375", "/F:VTLS2-D:9795533-R:2", "DIABP", "2020-09-28T11:00", 2,
"ABC123", "VS", "ABC123-375", "/F:VTLS2-D:9795533-R:2", "TEMP", "2020-09-28T11:00", 2,
"ABC123", "VS", "ABC123-376", "/F:VTLS1-D:9795591-R:1", "DIABP", "2020-09-20", NA,
"ABC123", "VS", "ABC123-376", "/F:VTLS1-D:9795591-R:1", "TEMP", "2020-09-20", NA
)
result <- oak_derive_seq(ds_in)
expect_equal(result$VSSEQ,
c(1L, 2L, 3L, 4L, 1L, 2L)) Test 2 ds_in <- tibble::tribble(
~STUDYID, ~DOMAIN, ~USUBJID, ~VSSPID,
"ABC123", "ZZ", "ABC123-375", "/F:VTLS1-D:9795532-R:2",
)
expect_error(
oak_derive_seq(ds_in),
paste(
"ZZ domain keys must be in the domain_key_variables.csv",
"Please update the file and use oak_load_study_config().",
sep = "\n"
)
) Test 3 ds_in <- tibble::tribble(
~STUDYID, ~RSUBJID, ~SCTESTCD, ~DOMAIN, ~SREL, ~SCCAT,
"ABC123", "ABC123-210", "LVSBJIND", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-210", "EDULEVEL", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-210", "TMSPPT", "APSC", "FRIEND", "CAREGIVERSTUDY",
"ABC123", "ABC123-211", "CAREDUR", "APSC", "SIBLING", "CAREGIVERSTUDY",
"ABC123", "ABC123-211", "LVSBJIND", "APSC", "SIBLING", "CAREGIVERSTUDY",
"ABC123", "ABC123-212", "JOBCLAS", "APSC", "SPOUSE", "CAREGIVERSTUDY"
)
result <- oak_derive_seq(ds_in)
expect_equal(result$SCSEQ,
c(1L, 2L, 3L, 1L, 2L, 1L)) |
Thanks @edgar-manukyan! In test 3, |
Awesome observation @ramiromagno. This is testing so called associated person domain and I see in the roak https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L42 |
I see, sorry for the oversight! BTW: Just one more question: is the domain_key_variables.csv comprehensive? |
I'm sorry if I am overseeing something here again, but if the domain is APSC, shouldn't the column APID be there in |
Interestingly roak just ignores them and you should ask Ram about this :) https://github.com/pharmaverse/roak_pilot/blob/main/R/oak_derive_seq.R#L80 |
I see. Could it be that not all keys are mandatory? There might be a few that are optional, and in that case it could fine to sort only with what is available...? @rammprasad help please! :) |
No worries, you are picking up SDTM concepts so quickly. After three years, I still feel dizzy about it. The attached file |
Thank you @edgar-manukyan, that really helps! You're the best. I thought those set of variables used for sorting were the actual keys that defined a record in a specific SDTM domain data set. Isn't this set on stone in the standard? |
They are suppose to be key to uniquely identify the rows and we even warn them if we notice that they don't. |
Thanks @edgar-manukyan. I've updated the PR according to your feedback so far. But we will have to wait for @rammprasad's feedback on these other corner cases. |
Feature Idea
Purpose
Generate --SEQ when the set of columns which define the natural key for a domain and an initial value for each USUBJID (defaulted to 1) are provided by parameter.
Functionality
The feature should generate --SEQ for a given domain using the following algorithm. Note that this algorithm assumes that any split domains are combined into a single data frame prior to generating the --SEQ column.
Relevant Input
Data frame containing all domain columns except for the --SEQ column and a vector containing the natural key columns for the domain.
Relevant Output
Data frame containing all domain columns including the --SEQ column.
Reproducible Example/Pseudo Code
generate_seq (tar_dat, tar_var = "xxSEQ" , key_vars = c(“USUBJID”, “XXCAT”, “XXSCAT”, “XXTERM”), init_val = 1)
Example:
generate_seq (ds, "dsseq", key_vars = c(“USUBJID”, “DSCAT”, “DSSCAT”, “DSTERM”))
NOTE: CRAN package {sdtmval} contains assign_SEQ function which could be utilized for this purpose unless we want to minimize the dependency.
The text was updated successfully, but these errors were encountered: