Introduce functionality for chunking and breaking IID experiments #20

mcw92 · 2024-12-09T12:11:35Z

This PR introduces functionality for the chunking and breaking IID experiments. In particular, the evaluation has been extended to calculate and save local and global confusion matrices in order to enable calculation of arbitrary metrics for the breaking IID experiments.

The following changes have been made:

Make the synthetic data generation consistent throughout the code. This means that in the serial case, the dataset generated with generate_and_distribute_synthetic_dataset without local or global imbalances equals the completely balanced dataset generated with make_classification_dataset when using the same random state. This ensures comparability of the strong scaling experiment series with and without chunking as the same datasets are created when passing the same random state.
Fix passing additional keyword arguments in both train_parallel_on_synthetic data and train_parallel_on_balanced_synthetic_data. This was completely missing in the former case. In addition, the argument parser was lacking some of the keyword arguments used in sklearn's make_classification and train_test_split used under the hood.
Introduce job script generation scripts for both chunking and breaking IID experiments.
Add calculation and saving of local and global confusion matrices, including tests.
Add evaluation from checkpoints for breaking IID experiments.
Refactor train module into train_serial and train_parallel.
Remove MacOS from build matrix used for the tests as this caused problems with tests hanging randomly forever for different Python versions in different test runs (see Re-introduce complete build matrix in tests #19).

The plotting scripts are still kind of messy with many code redundancies. I will fix this in the future. For now, I would like to prioritize the things required to run the experiments. If the PR is too messy, please just tell me 🙈.

Notes to self:

sklearn's RandomForestClassifier internally uses weighted voting in its predict() method, i.e., the predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. The predicted class thus is the one with highest mean probability estimate across the trees. As the DistributedRandomForest class in specialcouscous only implements plain voting, I also implemented plain voting for calculation of the local confusion matrices instead of using predict() in order to ensure consistency and comparability.
Possible problem with the confusion matrix might occur when the local data does not contain all classes for extremely imbalanced datasets / data partitioning. However, I am not sure about this.
As building a globally shared model turned out to be infeasible for most of our use cases / experiments, the functionality for calculating the confusion matrix and also evaluating breaking IID experiments from checkpoints mainly focuses on the case where the global model is not shared but distributed. That is why a shared test set is required in all our experiments.

… different train functions

…ynthetic data

…om checkpoints

…points

…reaking IID experiments

… from checkpoints

…I-Energy/special-couscous into feature/breaking_iid

fluegelk

Looks mostly good to me. I left a few minor comments which should be easy to fix.

I did not go through the plotting scripts in scripts/analysis in detail, a lot of it can probably be deduplicated whenever we address issue #12.

specialcouscous/train/train_parallel.py

specialcouscous/synthetic_classification_data.py

fluegelk · 2025-01-07T09:58:25Z

specialcouscous/train/train_parallel.py

    random_state: int | np.random.RandomState = 0,
-    checkpoint_path: str | pathlib.Path = pathlib.Path("./"),
+    checkpoint_path: str | pathlib.Path = pathlib.Path("../"),
    checkpoint_uid: str = "",
    random_state_model: int | None = None,


random_state_model necessary in this function (see details in comment above)?

See comment above.

specialcouscous/train/train_serial.py

specialcouscous/synthetic_classification_data.py

scripts/experiments/generate_breaking_iid_job_scripts.py

scripts/analysis/strong_scaling.py

Co-authored-by: fluegelk <[email protected]>

…I-Energy/special-couscous into feature/breaking_iid

mcw92 added 27 commits September 24, 2024 10:29

add job script generation script for chunking experiment series

c3c760c

Merge branch 'feature/inference_flavor' into feature/chunking

27202a0

update chunking job script generation script with more data seeds

337df8a

make synthetic dataset generation consistent over example scripts and…

ab78953

… different train functions

update target columns

2bdc3ec

merge main into branch

1817cc3

fix bug with random state for model

8ca8fa1

add plotting scripts for strong and weak scaling results

dae1b5d

remove testing for MacOS for now

f9f1367

consistify plotting style

d6ce7cb

add efficiency plot

3c90685

add chunking plotting script

1855e69

consistify plotting style with error bars

5034e52

consistify plotting style

b01cc93

consistify plotting style

ae3c1f2

refactor plotting settings

43ed00d

add bash script for all the plotting

1a48ec8

clean up scripts (WIP) and consistify plotting style

06399bc

add breaking IID job script generation script

d7a4d92

add breaking IID job script generation script

ba3c003

add plotting options

d300873

remove shared axes

16d32c9

add breaking IID plotting scripts

0b1b9d7

remove shared x axes

bcfc796

fix wrong parameter names

3490145

add breaking IID results plotting script

7c91986

Merge branch 'main' into feature/breaking_iid

7d7d901

mcw92 added the enhancement New feature or request label Dec 9, 2024

mcw92 requested a review from fluegelk December 9, 2024 12:11

mcw92 self-assigned this Dec 9, 2024

mcw92 added 4 commits December 10, 2024 13:57

refactor

b043caf

add test for evaluating from checkpoints for potentially unbalanced s…

ae7f7fd

…ynthetic data

remove detailed evaluation for evaluation breaking IID experiments fr…

c39be7c

…om checkpoints

add example script for evaluating breaking IID experiments from check…

d079d5c

…points

mcw92 marked this pull request as ready for review January 7, 2025 07:09

mcw92 added 11 commits January 7, 2025 09:15

fix docstring

f9f4e0a

fix folder name

6f57ede

add function to find correct breaking IID checkpoint path

3a93a23

fix paths and comments

4bbf9db

fix merge conflict

eeebbca

fix function to find correct checkpoint path and UID for restarting b…

0bfcd8b

…reaking IID experiments

add job script generation script to evaluate breaking IID experiments…

73a93b9

… from checkpoints

remove project account information

c0832b6

fix search pattern for UUID extraction

bb2d960

fix number of nodes to use

d9a2e0c

Merge branch 'feature/breaking_iid' of https://github.com/Helmholtz-A…

4a914cb

…I-Energy/special-couscous into feature/breaking_iid

mcw92 linked an issue Jan 7, 2025 that may be closed by this pull request

Re-evaluate breaking-IID experiments from checkpoints #23

Closed

update number of nodes back to default

b4adf2e

fluegelk approved these changes Jan 7, 2025

View reviewed changes

mcw92 and others added 9 commits January 8, 2025 09:25

change mutable make_classification_kwargs defaults back to None

2b926da

change mutable make_classification_kwargs defaults back to None

acbfd80

pass model random state down to DistributedForest instance

a533d5d

refactor function to calculate confusion matrix in serial case

016f639

remove DEFAULT_CONFIG_MAKE_CLASSIFICATION class variable

aa953c3

fix docstring

9b7437d

remove duplicate definition

aae8e1e

Co-authored-by: fluegelk <[email protected]>

improve docstring

d83e8e3

Merge branch 'feature/breaking_iid' of https://github.com/Helmholtz-A…

adee7a2

…I-Energy/special-couscous into feature/breaking_iid

mcw92 merged commit a38d3c1 into main Jan 8, 2025
4 checks passed

mcw92 deleted the feature/breaking_iid branch January 8, 2025 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce functionality for chunking and breaking IID experiments #20

Introduce functionality for chunking and breaking IID experiments #20

mcw92 commented Dec 9, 2024 •

edited

Loading

fluegelk left a comment

fluegelk Jan 7, 2025

mcw92 Jan 8, 2025

Introduce functionality for chunking and breaking IID experiments #20

Introduce functionality for chunking and breaking IID experiments #20

Conversation

mcw92 commented Dec 9, 2024 • edited Loading

Notes to self:

fluegelk left a comment

Choose a reason for hiding this comment

fluegelk Jan 7, 2025

Choose a reason for hiding this comment

mcw92 Jan 8, 2025

Choose a reason for hiding this comment

mcw92 commented Dec 9, 2024 •

edited

Loading