Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce functionality for chunking and breaking IID experiments #20

Merged
merged 56 commits into from
Jan 8, 2025

Conversation

mcw92
Copy link
Member

@mcw92 mcw92 commented Dec 9, 2024

This PR introduces functionality for the chunking and breaking IID experiments. In particular, the evaluation has been extended to calculate and save local and global confusion matrices in order to enable calculation of arbitrary metrics for the breaking IID experiments.

The following changes have been made:

  • Make the synthetic data generation consistent throughout the code. This means that in the serial case, the dataset generated with generate_and_distribute_synthetic_dataset without local or global imbalances equals the completely balanced dataset generated with make_classification_dataset when using the same random state. This ensures comparability of the strong scaling experiment series with and without chunking as the same datasets are created when passing the same random state.
  • Fix passing additional keyword arguments in both train_parallel_on_synthetic data and train_parallel_on_balanced_synthetic_data. This was completely missing in the former case. In addition, the argument parser was lacking some of the keyword arguments used in sklearn's make_classification and train_test_split used under the hood.
  • Introduce job script generation scripts for both chunking and breaking IID experiments.
  • Add calculation and saving of local and global confusion matrices, including tests.
  • Add evaluation from checkpoints for breaking IID experiments.
  • Refactor train module into train_serial and train_parallel.
  • Remove MacOS from build matrix used for the tests as this caused problems with tests hanging randomly forever for different Python versions in different test runs (see Re-introduce complete build matrix in tests #19).

The plotting scripts are still kind of messy with many code redundancies. I will fix this in the future. For now, I would like to prioritize the things required to run the experiments. If the PR is too messy, please just tell me 🙈.

Notes to self:

  • sklearn's RandomForestClassifier internally uses weighted voting in its predict() method, i.e., the predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. The predicted class thus is the one with highest mean probability estimate across the trees. As the DistributedRandomForest class in specialcouscous only implements plain voting, I also implemented plain voting for calculation of the local confusion matrices instead of using predict() in order to ensure consistency and comparability.
  • Possible problem with the confusion matrix might occur when the local data does not contain all classes for extremely imbalanced datasets / data partitioning. However, I am not sure about this.
  • As building a globally shared model turned out to be infeasible for most of our use cases / experiments, the functionality for calculating the confusion matrix and also evaluating breaking IID experiments from checkpoints mainly focuses on the case where the global model is not shared but distributed. That is why a shared test set is required in all our experiments.

@mcw92 mcw92 added the enhancement New feature or request label Dec 9, 2024
@mcw92 mcw92 requested a review from fluegelk December 9, 2024 12:11
@mcw92 mcw92 self-assigned this Dec 9, 2024
@mcw92 mcw92 marked this pull request as ready for review January 7, 2025 07:09
@mcw92 mcw92 linked an issue Jan 7, 2025 that may be closed by this pull request
Copy link
Contributor

@fluegelk fluegelk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me. I left a few minor comments which should be easy to fix.

I did not go through the plotting scripts in scripts/analysis in detail, a lot of it can probably be deduplicated whenever we address issue #12.

specialcouscous/train/train_parallel.py Outdated Show resolved Hide resolved
specialcouscous/synthetic_classification_data.py Outdated Show resolved Hide resolved
specialcouscous/synthetic_classification_data.py Outdated Show resolved Hide resolved
specialcouscous/synthetic_classification_data.py Outdated Show resolved Hide resolved
specialcouscous/synthetic_classification_data.py Outdated Show resolved Hide resolved
random_state: int | np.random.RandomState = 0,
checkpoint_path: str | pathlib.Path = pathlib.Path("./"),
checkpoint_path: str | pathlib.Path = pathlib.Path("../"),
checkpoint_uid: str = "",
random_state_model: int | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state_model necessary in this function (see details in comment above)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

specialcouscous/train/train_serial.py Outdated Show resolved Hide resolved
scripts/experiments/generate_breaking_iid_job_scripts.py Outdated Show resolved Hide resolved
scripts/analysis/strong_scaling.py Outdated Show resolved Hide resolved
@mcw92 mcw92 merged commit a38d3c1 into main Jan 8, 2025
4 checks passed
@mcw92 mcw92 deleted the feature/breaking_iid branch January 8, 2025 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-evaluate breaking-IID experiments from checkpoints
3 participants