Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiP: Datasets reworked #61

Merged
merged 15 commits into from
Jun 28, 2024
Merged

WiP: Datasets reworked #61

merged 15 commits into from
Jun 28, 2024

Conversation

gumityolcu
Copy link
Collaborator

utils.datasets.toy_datasets is created and it currently includes base, label_poisoning, sample_perturbation, label_grouping.

Base Class
The most general class. One thing needs explaining:
the parameters p and subset_idx both determine which datapoints will be effected by sample_fn and label_fn

subset_idx:

  • Can be int: then it is the id of the class that is to be effected.
  • Can be a list or tensor: then it is treated as the ids of samples to be effected
  • Can be None, which means "effect the whole dataset"

p: determines the probability with which each datapoints filtered by subset_idx will be effected. This is computed during the initalization and if a datapoint is effected, it is always effected.

So for example, for grouping labels we give subset_idx=None and p=1.0 to the base class ( effects all datapoints with certainty)

For label poisoning we give subset_idx=None and p=some value. Effects a randomly selected subset of the training data-

for sample perturbation (changing x in however way you want to), i left these two open to user choice. For CleverHans or Backdoor or Shortcut detection, we will give subset_idx = integer (a certain class) and p=some value. Effects a random subset of the inmages from a single class.

No tests, no guarantees, work in progress

Copy link
Owner

@dilyabareeva dilyabareeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gumityolcu, great work, left a few comments!

For the future, I think we need to think of logical connections between sample and label transformations. For example, we flip the label if a sample is transformed with certain probability.

src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
p: float = 1.0,
seed: int = 42,
device: str = "cpu",
sample_fn: Optional[Union[Callable, str]] = None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something for future work: replace str with some Literals

src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
src/utils/toy_datasets/base.py Outdated Show resolved Hide resolved
Copy link
Owner

@dilyabareeva dilyabareeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gumityolcu, looks good, added a couple comments here and there.

subset_idx=None, # apply with certainty, to all datapoints
cls_idx=None,
)
self.n_classes = n_classes
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed some logic in my current PR. I can incorporate it here

src/utils/datasets/transformed_datasets/base.py Outdated Show resolved Hide resolved
src/utils/datasets/transformed_datasets/base.py Outdated Show resolved Hide resolved
src/utils/datasets/transformed_datasets/base.py Outdated Show resolved Hide resolved
src/utils/datasets/transformed_datasets/base.py Outdated Show resolved Hide resolved
for i in range(self.__len__()):
x, y = dataset[i]
perturb_sample = (self.cls_idx is None) or (y == self.cls_idx)
p_condition = (random.random() <= self.p) if self.p < 1.0 else True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the papers actually require to transform p % of the data or to do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some cases, we need to poison some random subset or we need to watermark some random subset of a class. For other cases, the child class does not have a p parameter, and passes p=1.0 to the base class

if self.seed is not None:
random.seed(self.seed)
self.samples_to_perturb = []
for i in range(self.__len__()):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit-picking ]list comprehensions are generally faster and nicer-looking than loops

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that it would look better for this particular case. It's just too many components to put in a line. However, if it is actually faster, then that should probably be the priority because we are looping over a whole train dataset.

I can not commit the single line version. Black (version 24.0.0) reformats it into a form that flake8 (6.0.0) (does not accept.) I am commenting it out for your hands to solve 🙏🏼

src/utils/datasets/transformed_datasets/base.py Outdated Show resolved Hide resolved
@gumityolcu gumityolcu merged commit d20d9b2 into main Jun 28, 2024
2 checks passed
@gumityolcu
Copy link
Collaborator Author

This was a great read. Learned a lot about python. When are you publishing this? 🤓

@dilyabareeva
Copy link
Owner

This was a great read. Learned a lot about python. When are you publishing this? 🤓

I don't know what you mean @gumityolcu exactly, but I will just assume that you are being sarcastic 😄

I wasn't quite done with this yet: we now have Grouped Datasets twice and Subclass Detection twice in main, there were some conflict between the new and the old versions. Will push an update soon.

@dilyabareeva dilyabareeva deleted the galip_datasets branch June 29, 2024 05:19
dilyabareeva added a commit that referenced this pull request Jun 29, 2024
Resolving left-over merge conflicts from PR #61
@dilyabareeva dilyabareeva mentioned this pull request Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants