-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WiP: Datasets reworked #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gumityolcu, great work, left a few comments!
For the future, I think we need to think of logical connections between sample and label transformations. For example, we flip the label if a sample is transformed with certain probability.
src/utils/toy_datasets/base.py
Outdated
p: float = 1.0, | ||
seed: int = 42, | ||
device: str = "cpu", | ||
sample_fn: Optional[Union[Callable, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something for future work: replace str with some Literals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gumityolcu, looks good, added a couple comments here and there.
subset_idx=None, # apply with certainty, to all datapoints | ||
cls_idx=None, | ||
) | ||
self.n_classes = n_classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed some logic in my current PR. I can incorporate it here
for i in range(self.__len__()): | ||
x, y = dataset[i] | ||
perturb_sample = (self.cls_idx is None) or (y == self.cls_idx) | ||
p_condition = (random.random() <= self.p) if self.p < 1.0 else True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the papers actually require to transform p % of the data or to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some cases, we need to poison some random subset or we need to watermark some random subset of a class. For other cases, the child class does not have a p parameter, and passes p=1.0 to the base class
if self.seed is not None: | ||
random.seed(self.seed) | ||
self.samples_to_perturb = [] | ||
for i in range(self.__len__()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit-picking ]list comprehensions are generally faster and nicer-looking than loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that it would look better for this particular case. It's just too many components to put in a line. However, if it is actually faster, then that should probably be the priority because we are looping over a whole train dataset.
I can not commit the single line version. Black (version 24.0.0) reformats it into a form that flake8 (6.0.0) (does not accept.) I am commenting it out for your hands to solve 🙏🏼
# Conflicts: # src/downstream_tasks/subclass_identification.py # src/utils/datasets/group_label_dataset.py # tests/utils/test_grouped_label_dataset.py
This was a great read. Learned a lot about python. When are you publishing this? 🤓 |
I don't know what you mean @gumityolcu exactly, but I will just assume that you are being sarcastic 😄 I wasn't quite done with this yet: we now have Grouped Datasets twice and Subclass Detection twice in main, there were some conflict between the new and the old versions. Will push an update soon. |
Resolving left-over merge conflicts from PR #61
utils.datasets.toy_datasets is created and it currently includes base, label_poisoning, sample_perturbation, label_grouping.
Base Class
The most general class. One thing needs explaining:
the parameters p and subset_idx both determine which datapoints will be effected by sample_fn and label_fn
subset_idx:
p: determines the probability with which each datapoints filtered by subset_idx will be effected. This is computed during the initalization and if a datapoint is effected, it is always effected.
So for example, for grouping labels we give subset_idx=None and p=1.0 to the base class ( effects all datapoints with certainty)
For label poisoning we give subset_idx=None and p=some value. Effects a randomly selected subset of the training data-
for sample perturbation (changing x in however way you want to), i left these two open to user choice. For CleverHans or Backdoor or Shortcut detection, we will give subset_idx = integer (a certain class) and p=some value. Effects a random subset of the inmages from a single class.
No tests, no guarantees, work in progress