Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: General Issue: glossary #37

Open
kamilsi opened this issue Jan 31, 2024 · 1 comment
Open

WIP: General Issue: glossary #37

kamilsi opened this issue Jan 31, 2024 · 1 comment
Assignees

Comments

@kamilsi
Copy link
Collaborator

kamilsi commented Jan 31, 2024

Background Information

Based on discussions on Slack and #30 we want to start a glossary of key terms in the project.

Definition of Done

No response

@ramiromagno
Copy link
Collaborator

Hi @kamilsi, @rammprasad, @edgar-manukyan:

May I make a request? Can we start by trying to clarify these terms in the glossary?

  • dataset
  • domain
  • raw as in raw_dataset (as used in hardcode_no_ct())
  • target as in target_dataset (as used in hardcode_no_ct())
  • topic

I am aware of https://www.cdisc.org/kb/articles/domain-vs-dataset-whats-difference. However, I think it still warrants clarification.

Take the case of the domain definition.

Domain: A collection of logically related observations with a common, specific topic that are normally collected for all subjects in a clinical investigation.

If we were to apply the concept of domain to R's iris data set, then I guess we could call it a domain in the sense that the iris data frame is a collection of related observations with a common topic, i.e. plant leaves (not plant species, right?). So even if we split the iris data into two data frames, the set of the two data frames would still be that one domain whose topic is about leaves, right? So one domain is typically materialized as one dataset, but it needs not to. Real life examples would help here.

Then, the CDISC definition of dataset:

A collection of structured data in a single file.

It is, perhaps, not so obvious either... To start, because of the reference to a file. I am guessing that the original intention was to refer to the implementation on a computer, be it a file, an object in memory, database, etc.. Right? It feels like the idea is to say that the domain definition corresponds to the conceptual idea of a data set, and that the dataset is the actual instantiation of that concept on a computer.

Regarding topic, can we say that it equates with the concept of observational unit in tidy data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Product Backlog
Development

No branches or pull requests

3 participants