WiP: Datasets reworked #61

gumityolcu · 2024-06-26T16:52:31Z

utils.datasets.toy_datasets is created and it currently includes base, label_poisoning, sample_perturbation, label_grouping.

Base Class
The most general class. One thing needs explaining:
the parameters p and subset_idx both determine which datapoints will be effected by sample_fn and label_fn

subset_idx:

Can be int: then it is the id of the class that is to be effected.
Can be a list or tensor: then it is treated as the ids of samples to be effected
Can be None, which means "effect the whole dataset"

p: determines the probability with which each datapoints filtered by subset_idx will be effected. This is computed during the initalization and if a datapoint is effected, it is always effected.

So for example, for grouping labels we give subset_idx=None and p=1.0 to the base class ( effects all datapoints with certainty)

For label poisoning we give subset_idx=None and p=some value. Effects a randomly selected subset of the training data-

for sample perturbation (changing x in however way you want to), i left these two open to user choice. For CleverHans or Backdoor or Shortcut detection, we will give subset_idx = integer (a certain class) and p=some value. Effects a random subset of the inmages from a single class.

No tests, no guarantees, work in progress

dilyabareeva

Hi @gumityolcu, great work, left a few comments!

For the future, I think we need to think of logical connections between sample and label transformations. For example, we flip the label if a sample is transformed with certain probability.

src/utils/toy_datasets/base.py

dilyabareeva · 2024-06-26T17:07:48Z

src/utils/toy_datasets/base.py

+        p: float = 1.0,
+        seed: int = 42,
+        device: str = "cpu",
+        sample_fn: Optional[Union[Callable, str]] = None,


something for future work: replace str with some Literals

src/utils/toy_datasets/base.py

dilyabareeva

Hi @gumityolcu, looks good, added a couple comments here and there.

dilyabareeva · 2024-06-28T14:22:04Z

src/utils/datasets/transformed_datasets/label_grouping.py

+            subset_idx=None,  # apply with certainty, to all datapoints
+            cls_idx=None,
+        )
+        self.n_classes = n_classes


I've changed some logic in my current PR. I can incorporate it here

src/utils/datasets/transformed_datasets/sample_perturbation.py

src/utils/datasets/transformed_datasets/base.py

dilyabareeva · 2024-06-28T14:35:01Z

src/utils/datasets/transformed_datasets/base.py

+        for i in range(self.__len__()):
+            x, y = dataset[i]
+            perturb_sample = (self.cls_idx is None) or (y == self.cls_idx)
+            p_condition = (random.random() <= self.p) if self.p < 1.0 else True


do the papers actually require to transform p % of the data or to do this?

some cases, we need to poison some random subset or we need to watermark some random subset of a class. For other cases, the child class does not have a p parameter, and passes p=1.0 to the base class

dilyabareeva · 2024-06-28T14:39:07Z

src/utils/datasets/transformed_datasets/base.py

+        if self.seed is not None:
+            random.seed(self.seed)
+        self.samples_to_perturb = []
+        for i in range(self.__len__()):


[nit-picking ]list comprehensions are generally faster and nicer-looking than loops

I disagree that it would look better for this particular case. It's just too many components to put in a line. However, if it is actually faster, then that should probably be the priority because we are looping over a whole train dataset.

I can not commit the single line version. Black (version 24.0.0) reformats it into a form that flake8 (6.0.0) (does not accept.) I am commenting it out for your hands to solve 🙏🏼

src/utils/datasets/transformed_datasets/base.py

# Conflicts: # src/downstream_tasks/subclass_identification.py # src/utils/datasets/group_label_dataset.py # tests/utils/test_grouped_label_dataset.py

gumityolcu · 2024-06-28T21:34:34Z

This was a great read. Learned a lot about python. When are you publishing this? 🤓

dilyabareeva · 2024-06-29T05:19:10Z

This was a great read. Learned a lot about python. When are you publishing this? 🤓

I don't know what you mean @gumityolcu exactly, but I will just assume that you are being sarcastic 😄

I wasn't quite done with this yet: we now have Grouped Datasets twice and Subclass Detection twice in main, there were some conflict between the new and the old versions. Will push an update soon.

Resolving left-over merge conflicts from PR #61

gumityolcu added 2 commits June 26, 2024 17:20

minor changes

63dbee2

change dir structure and add toydataset base class§

b089b7a

dilyabareeva reviewed Jun 26, 2024

View reviewed changes

Galip Ümit Yolcu and others added 7 commits June 27, 2024 21:29

label poisoning, label grouping and sample perturbation dataset classes

7652b8d

style changes

28bfd24

updates

6b1f0c6

standardize parameters

d71c9e2

Use python random for random number generation, fix some mypy complaints

b974798

fix mypy complaints

15d461d

final mypy fix

a2ad1e9

dilyabareeva reviewed Jun 28, 2024

View reviewed changes

gumityolcu and others added 6 commits June 28, 2024 17:18

switch to using self.rng and non-optional seed

4ec1522

feedback from dilya incorporated

dfe4dc9

rename some files and remove for-loop in a dataset util

128432a

remove for loop in label_poisoning.py

f24cbe7

Merge branch 'main' into galip_datasets

e2c9370

# Conflicts: # src/downstream_tasks/subclass_identification.py # src/utils/datasets/group_label_dataset.py # tests/utils/test_grouped_label_dataset.py

fix label grouping bugs

246ffd5

gumityolcu merged commit d20d9b2 into main Jun 28, 2024
2 checks passed

dilyabareeva deleted the galip_datasets branch June 29, 2024 05:19

dilyabareeva added a commit that referenced this pull request Jun 29, 2024

Merge pull request #63 from dilyabareeva/merge_resolution_#61

ef231c5

Resolving left-over merge conflicts from PR #61

dilyabareeva mentioned this pull request Jun 29, 2024

Dataset Class #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WiP: Datasets reworked #61

WiP: Datasets reworked #61

gumityolcu commented Jun 26, 2024

dilyabareeva left a comment

dilyabareeva Jun 26, 2024

dilyabareeva left a comment

dilyabareeva Jun 28, 2024

dilyabareeva Jun 28, 2024

gumityolcu Jun 28, 2024

dilyabareeva Jun 28, 2024

gumityolcu Jun 28, 2024

gumityolcu commented Jun 28, 2024

dilyabareeva commented Jun 29, 2024

WiP: Datasets reworked #61

WiP: Datasets reworked #61

Conversation

gumityolcu commented Jun 26, 2024

dilyabareeva left a comment

Choose a reason for hiding this comment

dilyabareeva Jun 26, 2024

Choose a reason for hiding this comment

dilyabareeva left a comment

Choose a reason for hiding this comment

dilyabareeva Jun 28, 2024

Choose a reason for hiding this comment

dilyabareeva Jun 28, 2024

Choose a reason for hiding this comment

gumityolcu Jun 28, 2024

Choose a reason for hiding this comment

dilyabareeva Jun 28, 2024

Choose a reason for hiding this comment

gumityolcu Jun 28, 2024

Choose a reason for hiding this comment

gumityolcu commented Jun 28, 2024

dilyabareeva commented Jun 29, 2024