[Feature Request]: Consider `set_datanames` to support negative selection. #1380

gogonzo · 2024-10-15T12:06:39Z

Feature description

Imagine following

data <- teal_data() |> within({
  iris <- iris
  mtcars <- mtcars
  dummy_1 <- 1
  dummy_2 <- 2
})

modules(
  example_module() |> set_datanames("mtcars"),
  example_module()
) |> set_datanames(-dummy_1, -dummy2)

Thanks to above, there will be an alternative solution to exclude datanames without using dot-prefix.

Code of Conduct

I agree to follow this project's Code of Conduct.

Contribution Guidelines

I agree to follow this project's Contribution Guidelines.

Security Policy

I agree to follow this project's Security Policy.

donyunardi · 2024-10-17T14:39:36Z

Acceptance Criteria

Enhance set_datanames function to allow enable user to define negative selection when setting datanames
Analyze and Define clear condition on what happens to datanames when negative selection is applied to different scenarios (i.e. all, specific datasets selected, and transformer)

gogonzo · 2024-12-11T07:44:23Z

I'm not a fan of any solution involving set_datanames. I don't consider set_datanames having value added as it is possible to set datanames in the module constructor. See below:

# generic solution which was always possible
modules(
  tm_data_table(datasets_selected = c("ds1", "ds2", ...)),
  tm_variable_browser(datasets_selected = c("ds1", "ds2", ...)),
  module_with_datanames_set(x = choices_selected, y = choices_selected) # not alterable
)

# I don't see the point of this
modules(
  tm_data_table() |> set_datanames(c("ds1", "ds2", ...)),
  tm_variable_browser()  |> set_datanames(c("ds1", "ds2", ...)),
  module_with_datanames_set(x = choices_selected, y = choices_selected) # not alterable
)

# this is the only real use case 
modules(
  tm_data_table(),
  tm_variable_browser(),
  module_with_datanames_set(x = choices_selected, y = choices_selected) # not alterable
)  |> set_datanames(c("ds1", "ds2", ...))

set_datanames won't ever be a method comparable with tidyselect what we have imagined could happen. The biggest limitation for module$datanames is that when it is set WE MUSTN'T change it, as those $datanames are most probably crucial for module to work ($datanames are derivatives of the data_extract/choices_selected information. Therefore, set_datanames is limited only to modules which have datanames = "all" and others are ignored.
On individual level, each of those modules which initially have datanames = "all" (tm_data_table, tm_variable_browser) have an option to set custom datanames as I've shown in the example above. What set_datanames provides is that one can set_datanames of modules() collection but applied only to those which doesn't have datanames already set.

Ok, I admit modules() |> set_datanames() adds some value, but are gains exceeding costs?

can be only applied to modules which have datanames = "all"
Problem of the precedence which might be misleading for the user. In the example below "individual" set_datanames have always precedence over "modules()", which might cause some redundancy.

modules(
  tm_data_table() |> set_datanames("ds1", "ds2"),
  tm_variable_browser() |> set_datanames("ds1", "ds2")
) |> set_datanames("ds1", "ds2", "ds3") # redundant

This is another can of worms. When I've created the issue I thought that set_datanames could be similar to dplyr::select in some sense. Circumstances differ as select() |> select() |> select() limits a list of columns in each evaluation, while set_datanames() |> set_datanames() resets the "selection". Here is example:

mtcars |>
  select(-mpg) |> # all except mpg 
  select(cyl, disp, hp, drat) |> # only: cyl, disp, hp, drat
  select(-cyl, -disp) # dp, drat

For set_datanames api could be confusing. First, because only datanames = "all" could be changed which makes following impossible:

module() |> set_datanames(-a) |> set_datanames(-b, -c) |> set_datanames(d, e)

With @averissimo we considered that maybe module shouldn't automatically inherit parent (modules) datanames so that this could be possible:

modules(
  label = "m1",
  modules(
    label = "m1.1",
    module(label = "m1.1.1) |> set_datanames(-c)
  ) |> set_datanames(-b)
) |> set_datanames(-a)

# can be translated to:
m1: all except a
  m1.1: all (from m1) except b
    m1.1.1: all (from m1.1) except c => all except a, b, c

Above would require following changes:

Extend modules to have $datanames attribute (didn't have before)
Don't set parent datanames when using set_datanames. These datanames should be set and kept on the level where they have been assigned and "inheritance" mechanism should be triggered when calling a module.

Summary: I don't like the idea of set_datanames at all. I believe this is unclear functionality which adds even more to the "datanames" problem. My recommendation is to remove set_datanames at all and await end-user comments if they are really need it.

gogonzo · 2024-12-13T10:35:10Z

Another reason why I am against having set_datanames. It is not clear which modules have datanames = "all" unless app-developer uses print.teal_module. App developer might not be aware of the fact that tm_xyz has some datanames property, because it is set "privately" via module call in tm_xyz constructor. App developer when calling tm_xyz doesn't have a direct control over datanames property because it is dependent how the tm_xyz has been developed. This is why tm_data_table and tm_variable_browser have datasets_selected argument for extra datanames control. Therefore I think datanames is a property which should handled in most of the cases by the module constructor so app-developer doesn't need to know anything about this.

m7pr · 2024-12-13T14:16:50Z

I completely agree with your assessment of set_datanames. As you've clearly demonstrated, its functionality introduces more complexity than value, especially given the redundancy and confusion it can create.

The ability to set datanames directly in the module constructor (as shown in your first example) already provides a clean and intuitive solution for the majority of cases. The set_datanames method adds unnecessary ambiguity, particularly when it comes to precedence. Your point about redundancy is especially valid—having both individual set_datanames and a global modules() |> set_datanames() makes the behavior difficult to predict and prone to errors.

Furthermore, the comparison with dplyr::select highlights a critical limitation: while select() can progressively refine a selection, set_datanames merely overwrites previous settings. This inconsistency makes the API unintuitive and less versatile for users who might expect similar behavior.

Your proposed example of non-inheriting datanames at different levels within nested modules is intriguing and could potentially solve some issues. However, this would require significant changes, such as adding $datanames attributes to modules and redefining the inheritance mechanism. While this is worth exploring, it might add even more complexity in its current form.

In summary, I share your concerns. set_datanames seems like a solution in search of a problem. It introduces additional overhead, unclear precedence rules, and functionality that users might not actually need. Removing it and waiting for end-user feedback would be a prudent approach. If real-world use cases emerge where this is necessary, the design can be revisited with clearer requirements in mind.

gogonzo · 2024-12-13T14:37:25Z

Your proposed example of non-inheriting datanames at different levels within nested modules is intriguing and could potentially solve some issues. However, this would require significant changes, such as adding $datanames attributes to modules and redefining the inheritance mechanism. While this is worth exploring, it might add even more complexity in its current form.

There is some complexity but still there will be a confusion which you've highlighted in previous section about dplyr::select - for example when positive selection overwrites parent settings, while negative selection substracts from the previous session. Proposed concept is so confusing because there is set, remove but no append. What I mean is set_datanames("a") sets independently datanames, while set_datanames("-a") removes (from something). Theoretically we need set_datanames("+a") to justify "-a". Too much!

donyunardi · 2024-12-16T07:57:00Z

For what I recall, set_datanames was designed to conveniently limit the data displayed in the module when multiple datasets passed to the module (or all datasets). For instance, user may want to show only the ANL dataset after merging multiple datasets.

# The module displays only ANL
my_module(datanames = "all", transformator = transformator_where_ANL_is_created) |> set_datanames("ANL")

I agree that the user can set datanames = "ANL" during the module construction and before the transformation that creates ANL:

# The module displays only ANL but shows a warning
my_module(datanames = "ANL", transformator = transformator_where_ANL_is_created)

The user will receive a warning that ANL does not exist, but the module will still execute.

Questions

If we no longer support set_datanames, users will need to define datanames during the module construction. How will this affect existing modules? Do all our current modules already have a datanames (or equivalent) argument, or will we need to add the datanames argument to all teal module functions?
Why do we have to align set_datanames with the dplyr::select design? Can we instead train users to understand how to use this helper function and its limitations?

llrs-roche · 2024-12-16T10:51:50Z

I've been reading the comments and went to check the feature, I'll digress a little bit to provide some background as a "new user" trying to understand the feature until I come back to the current discussion. Skip to under the title if you wish.

On NEWS it is described as "Introduced a function set_datanames() to change a datanames of the teal_module.": So it is currently being describing as a renaming (unless one already knows what datanames is). I would propose to clarify on the NEWS the purpose of the function. As this is quite different from "limit the data displayed in the module when multiple datasets passed to the module (or all datasets)" rationale provided above.

While checking the current examples of set_datanames this one stood out:

> module()
|- module
|  |- Datasets         : all
|  |- Properties:
|  |  |- Bookmarkable  : FALSE
|  |  L- Reportable    : FALSE
|  |- UI Arguments     : 
|  |- Server Arguments : 
|  L- Transformators       : 
# change modules' datanames
set_datanames(
  modules(
    module(datanames = "all"),
    module(datanames = "a")
  ),
  "b"
)
> module(datanames = "all")
|- module
|  |- Datasets         : all
|  |- Properties:
|  |  |- Bookmarkable  : FALSE
|  |  L- Reportable    : FALSE
|  |- UI Arguments     : 
|  |- Server Arguments : 
|  L- Transformators       : 
> set_datanames(
+     modules(
+         module(datanames = "all"),
+         module(datanames = "a")
+     ),
+     "b"
+ )
[WARN] 2024-12-16 10:36:17.4508 pid:23260 token:[] teal Not possible to modify datanames of the module module. set_datanames() can only change datanames if it was set to "all".
TEAL ROOT
  |- module
  |  |- Datasets         : b
  |  |- Properties:
  |  |  |- Bookmarkable  : FALSE
  |  |  L- Reportable    : FALSE
  |  |- UI Arguments     : 
  |  |- Server Arguments : 
  |  L- Transformators       : 
  L- module
     |- Datasets         : a
     |- Properties:
     |  |- Bookmarkable  : FALSE
     |  L- Reportable    : FALSE
     |- UI Arguments     : 
     |- Server Arguments : 
     L- Transformators       :

The function currently only applies to the first module, not all modules?

The name might also be confusing: set_datanames seems like could be named differently and be more helpful, like use_datasets. While the name on the module creation might be okay, but having the parameter and the function be named more similar might help users.

From what I understood from the comments this function (why it is not a method?) should be run at a later stage instead of initialization like (validation?). It would be great if the datanames could be check with the data (qenv?) parameter dynamically. This way no warning should display before the required datasets are created.

About set_datanames/datanames discussion

Utility
I think it only makes sense to use negative selection if it is faster/shorter than using positive selection. Adding negative selection requires knowing which datasets are available on a given module. My limited testing and knowledge haven't shown up a way to retrieve that easily from a module (no datasets() or names() function returned this data and even the printing function just prints the ones selected via datanames as examples above show).

Using set_datanames requires being familiar with the objects the module creates.
Using the datanames parameter of the module requires being the module developer. On this case if one is already the module developer they could remove data with a call to rm() inside the qenv/teal.data (if using it)

Notation
Mixing standard selection with NSE notation might confuse users: "-dataset" doesn't look a good approach to me.
If there's a way to retrieve the data names available (not just those selected) on the module it might be faster to use something like (set_datanames(module1, setdiff(names(module1), "a"))
NSE notation could be used easier if the datanames are possible to load dynamically from the qenv environment. Then probably something like tidyselect helpers could be used (here is how to use it)

Answers to questions:

I'm not sure, checking some modules from tmg:
tm_data_table has datasets_selected = character(0),, tm_front_page has tables = list() for the same effect. tm_missing_data has a parent_dataname but nothing else relevant. tm_variable_browser has datasets_selected = character(0). So it seems some sort of apply this module to only this dataset already exists for some modules.
I think we can explain to users that it is different from select() even if we use the same functionality.

gogonzo · 2024-12-17T12:11:35Z

For what I recall, set_datanames was designed to conveniently limit the data displayed in the module when multiple datasets passed to the module (or all datasets)

@donyunardi - set_datanames was created only to limit datanames of the modules with property datanames = "all". set_datanames works only when module$datanames = "all", when datanames is set explicitly there is no way to change it.

The user will receive a warning that ANL does not exist, but the module will still execute.

@donyunardi same would happen if you do example_module(datanames = "all") |> set_datanames("ANL"). It means set_datanames doesn't give any special rescue in these situation.

If we no longer support set_datanames, users will need to define datanames during the module construction. How will this affect existing modules? Do all our current modules already have a datanames (or equivalent) argument, or will we need to add the datanames argument to all teal module functions?

@donyunardi no it wouldn't be necessary. Most of our modules have datanames set via data_extract_spec or choices_selected in tm_xyz. For these modules set_datanames function won't make any effect as they have datanames != "all". The only use case for set_datanames was following:

app has datanames d1, d2, d3, d4
some module has datanames = "all" (for example tm_data_table) but app developer wants only d1 and d2 to be displayed in this module. There are two possible options at this moment:
1. tm_data_table(datasets_selected = c("d1", "d2")) (this option was here since forever)
2. tm_data_table() |> set_datanames(c("d1", "d2")) (i think this is unnecessary)

Why do we have to align set_datanames with the dplyr::select design? Can we instead train users to understand how to use this helper function and its limitations?

@donyunardi it was the first thought of mine that we could have a nice API for datanames handling which could be easy to understand by the community. As I've explained in previous comments tidyselect and set_datanames are equivalent.

gogonzo · 2024-12-17T12:28:51Z

The function currently only applies to the first module, not all modules?

Yes, because the second module has datanames != "all". When datanames are set, it means that nobody else should be able to change them, because modules are build to work with specific datasets and if they are not provided (if they would be limited) then there would be failure. This is why set_datanames applies only when there are unspecific datasets provided ("all"). set_datanames has unfortunate name, I admit. use_datanames is not good neither as it should be something like specify_unspecified_datanames

It would be great if the datanames could be check with the data (qenv?) parameter dynamically. This way no warning should display before the required datasets are created.

This is what we do. In srv_teal we check if the objects hold in data are sufficient for module$datanames. It is not possible to check it before application starts as modules and data are separate arguments of teal::init, they are two different objects.

Adding negative selection requires knowing which datasets are available on a given module

@llrs-roche theoretically no. If one doesn't need dataset names x in some module then it won't be provided in both cases - if x is in data or not. It would be just a matter of error, like: "can't remove object "x" for a module - continue without stop"

averissimo · 2024-12-17T13:26:26Z

Handling this function as a each call being areplacement of datanames seems that it is just too complex and will create more problems that don't fully justify it.

If we want to keep set_datanames, I think it should only be applied to individual module and in one swoop we remove all the tree-like problems.

Alternatively/complementary we could move closer to dplyr::select by adding select_datanames (deprecating set_datanames)
- It progressively limits the selection, instead of re-setting each time

If I'm not mistaken, this would remove any complexity with the order of operation.

Some toy examples with select_datanames

#> "all" = c("a", "b", "c", "d")
modules(
  label = "m1",
  modules(
    label = "m1.1",
    module(label = "m1.1.1") |> set_datanames(-c)
  ) |> set_datanames(-b)
) |> set_datanames(-a)

#> can be translated to:
#> m1: b, c, d
#>   m1.1: c, d
#>     m1.1.1: d

#> "all" = c("a", "b", "c", "d")
modules(
  label = "m1",
  module(label = "m1.1") |> select_datanames(a, b)
  module(label = "m1.2") |> select_datanames(-c)
) |> select_datanames(a, c, d)

#> can be translated to:
#> m1: a, c
#>   m1.1: a
#>   m1.2: a, d

As a path forward, I'm thinking that we could start by limiting set_datanames to a single module and allow for negative selection.

Step 2. would be to contact stakeholders to see if there's an appetite for select_datanames

donyunardi no it wouldn't be necessary. Most of our modules have datanames set via data_extract_spec or choices_selected in tm_xyz

This is the "law" for any approach we have here, there would be a question of whether there should be a warning (if directly applied)

And the same with argument-specified datanames (this could be discussed, but I think it should behave the same way as the data_extract_spec/choices_selected modules

gogonzo · 2024-12-17T14:13:11Z

If we want to keep set_datanames, I think it should only be applied to individual module and in one swoop we remove all the tree-like problems.

That's a good observation. But stil what is the value added of this utility? To change datanames using set_datanames would be equivalently simple as using datanames argument in the module constructor:

# abstract scenario
modules(
  module() |> set_datanames(a, b),
  module(datanames = c("a", "b"))
)

# more realistic scenario
modules(
  tm_data_table() |> set_datanames(a, b),
  tm_data_table(datasets_selected = c("a", "b"))

)

m7pr · 2024-12-18T11:52:36Z

set_datanames brings too much confusion without adding enough value to justify its existence. When comparing the two approaches side by side - using set_datanames versus directly specifying datanames in the constructor - it is clear that the second option is just as simple, more explicit, and avoids unnecessary complexity.

The problem with set_datanames is not just the redundancy but also the ambiguity it introduces. As pointed out earlier, the function works only when datanames = "all", and this limitation makes it unpredictable for users. They need to know when they can use set_datanames and when they cannot, which defeats the purpose of having a helper utility in the first place.

For example, this:

tm_data_table() |> set_datanames(c("a", "b"))

is functionally no different from this:

tm_data_table(datanames = c("a", "b"))

Why add another layer of abstraction when the same result can be achieved with more clarity using an existing argument? For users, the constructor-based approach is simpler to understand and consistent across all modules.

m7pr · 2024-12-18T12:26:02Z

@averissimo about keeping set_datanames but applying it to individual modules: yes, this simplifies the tree-like issues and order-of-operation problems. But still, I think we are solving a problem that doesn't need to exist. If datanames can always be specified during module creation, why add another way to do the same thing?

The progressive selection idea with select_datanames is interesting, but I think this makes things even more complex. It might resemble dplyr::select but at the cost of making the datanames API harder to follow. Users would need to remember rules about inheritance, precedence, and whether they are appending, subtracting, or resetting values. This is not what users want from a simple helper function.

To summarize: set_datanames adds unnecessary complexity and redundancy. Removing it won’t break anything, as the same functionality exists with datanames arguments in constructors. Instead of patching the issues with set_datanames, we should focus on simplifying the API and sticking to what works. If end-users really miss this functionality, they will let us know, but I suspect they won’t.

donyunardi · 2024-12-19T18:41:13Z

After our discussion, the team has decided to remove the set_datanames functionality to provide a simpler solution for users, as outlined here and here. Since this feature has not been released yet, we will remove it directly without deprecation.

However, this change requires us to review all teal modules that currently use datanames = "all" to ensure there is a way for users to control the datanames in the module's constructor. Otherwise, with our updated concept of datanames = "all" as defined in this PR, all objects in the teal_data object will become visible, which may not be desirable.

While most modules include an argument (or equivalent to it) to configure datanames, some do not:

We will close this issue and open a new issue to handle the removal of set_datanames feature and address other teal modules.

gogonzo added the enhancement New feature or request label Oct 15, 2024

This was referenced Oct 15, 2024

[Feature Request]: Don't use @datanames anymore insightsengineering/teal.data#333

Closed

teal framework Release insightsengineering/nestdevs-tasks#66

Open

donyunardi added the core label Oct 17, 2024

gogonzo self-assigned this Dec 9, 2024

gogonzo mentioned this issue Dec 10, 2024

negative datanames selection #1426

Closed

gogonzo mentioned this issue Dec 12, 2024

1380 negative datanames2 #1428

Draft

vedhav mentioned this issue Dec 13, 2024

Update Vignette due to datanames updates insightsengineering/teal.data#338

Open

gogonzo mentioned this issue Dec 19, 2024

Datanames in vignettes #1434

Open

donyunardi closed this as completed Dec 19, 2024

donyunardi mentioned this issue Dec 19, 2024

Remove set_datanames functionality #1436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Consider `set_datanames` to support negative selection. #1380

[Feature Request]: Consider `set_datanames` to support negative selection. #1380

gogonzo commented Oct 15, 2024

donyunardi commented Oct 17, 2024 •

edited

Loading

gogonzo commented Dec 11, 2024 •

edited

Loading

gogonzo commented Dec 13, 2024

m7pr commented Dec 13, 2024

gogonzo commented Dec 13, 2024

donyunardi commented Dec 16, 2024 •

edited

Loading

llrs-roche commented Dec 16, 2024

gogonzo commented Dec 17, 2024 •

edited by donyunardi

Loading

gogonzo commented Dec 17, 2024 •

edited by donyunardi

Loading

averissimo commented Dec 17, 2024 •

edited

Loading

gogonzo commented Dec 17, 2024

m7pr commented Dec 18, 2024

m7pr commented Dec 18, 2024

donyunardi commented Dec 19, 2024

[Feature Request]: Consider set_datanames to support negative selection. #1380

[Feature Request]: Consider set_datanames to support negative selection. #1380

Comments

gogonzo commented Oct 15, 2024

Feature description

Code of Conduct

Contribution Guidelines

Security Policy

donyunardi commented Oct 17, 2024 • edited Loading

Acceptance Criteria

gogonzo commented Dec 11, 2024 • edited Loading

gogonzo commented Dec 13, 2024

m7pr commented Dec 13, 2024

gogonzo commented Dec 13, 2024

donyunardi commented Dec 16, 2024 • edited Loading

Questions

llrs-roche commented Dec 16, 2024

About set_datanames/datanames discussion

gogonzo commented Dec 17, 2024 • edited by donyunardi Loading

gogonzo commented Dec 17, 2024 • edited by donyunardi Loading

averissimo commented Dec 17, 2024 • edited Loading

gogonzo commented Dec 17, 2024

m7pr commented Dec 18, 2024

m7pr commented Dec 18, 2024

donyunardi commented Dec 19, 2024

[Feature Request]: Consider `set_datanames` to support negative selection. #1380

[Feature Request]: Consider `set_datanames` to support negative selection. #1380

donyunardi commented Oct 17, 2024 •

edited

Loading

gogonzo commented Dec 11, 2024 •

edited

Loading

donyunardi commented Dec 16, 2024 •

edited

Loading

gogonzo commented Dec 17, 2024 •

edited by donyunardi

Loading

gogonzo commented Dec 17, 2024 •

edited by donyunardi

Loading

averissimo commented Dec 17, 2024 •

edited

Loading