Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched metrics #351

Conversation

davor10105
Copy link
Contributor

Description

  • Implementing true batch processing for the available metrics
  • This follows the previously raised issue.
  • The improvements gained from the batched processing seem to hover at around 12.6x speed-up on CPU (see image below):

image

  • Further testing needs to be done on GPU (I do not have access to one at the moment :/)

Implemented changes

  • The following changes were made:
    • Removed evaluate_instance method in PixelFlipping, Monotonicity, MonotonicityCorrelation, FaithfulnessCorrelation and FaithfulnessEstimate classes and replaced the existing evaluate_batch methods with their "true" batch implementation
    • Added batched parameter to correlation_spearman, correlation_pearson and correlation_kendall_tausimilarity functions to support batch processing, batched parameter to get_baseline_dict to support batched baseline creation, and similarly added the same parameter to calculate_auc

Implementation validity

  • To verify that the batched implementation and the for-loop type implementation return the same (or close) results, each metric was invoked over the same sample 30 times (using different seeds) utilizing the batched implementation and the for-loop (unbatched) implementation. Firstly, a simple np.allclose check was made and PixelFlipping, Monotonicity and FaithfulnessEstimate were verified as valid. MonotonicityCorrelation and FaithfulnessCorrelation did not pass this test, as they include stochastic elements in their calculations. To verify their validity, a two-way t-test was utilized over the 30 runs for each sample of the respective implementations. The resulting p-values can be seen below:
pixel_flipping is VALID (all close)
monotonicity is VALID (all close)
monotonicity_correlation is INVALID (t-test) (p > 0.05 elements: 91.67%)
p-values
 [0.04856062        inf 0.71486889 0.15810997 0.526356   0.74277285
 0.07734966        inf 0.79329449 0.25064761 0.63218201 0.04654986
        inf        inf 0.84467497 0.11548366 0.27559421 0.29898369
 0.65902698        inf 0.30107825 0.6182171  0.9439362  0.26287838]
faithfulness_estimate is VALID (all close)
faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 83.33%)
p-values
 [0.20491799 0.01031497 0.83503175 0.2578373  0.30944439 0.87939701
 0.47807545 0.05502236 0.7996986  0.02895154 0.02735908 0.90580175
 0.14312554 0.25870476 0.09197358 0.08645692 0.50516479 0.68482126
 0.09873465 0.64258948 0.35872379 0.94694292 0.03611762 0.78976033]
  • The mean metric values for 92% and 83% of the batch samples showed no statistically significant difference between the two approaches, respectively. The reason for the lack of a 100% match is unclear. To further investigate, I conducted an additional experiment comparing the scores from two runs (each repeated 30 times) of the old loop-style implementation. Despite being identical implementations, the t-test revealed instances where the means of the two groups differed, specifically in the case of FaithfulnessCorrelation (see results below):
monotonicity_correlation is VALID (t-test)
faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 91.67%)
p-values
 [0.81952674 0.22507615 0.33738925 0.90864344 0.81490267 0.20901662
 0.81953659 0.17859758 0.62796515 0.04876216 0.13012812 0.58632642
 0.84153836 0.17535798 0.52076956 0.84998221 0.57439399 0.1188701
 0.53963641 0.77368691 0.68040962 0.9938426  0.00461874 0.07818067]
  • So the reason might just be that this metric is very unstable for some examples, but I am not sure. I do not think my batched implementation is wrong, since it does produce results that can be statistically verified in the majority of cases, which would be difficult to do "by chance".
  • Validation and visualization scripts are attached here - testing_utils.zip. batched_tests directory contains batch_implementation_verification.py script which runs the validation tests utilizing a copy of the repo (in the quantus directory also contained within the zip file) that has both the batched and the old implementation versions. Results of the runs mentioned above can be found in the results.pickle. This file is used by test_visualization.py to display the box visualization and check the validity of the batch implementation as described above.

Minimum acceptance criteria

  • Implementing batch processing for all other metrics and supporting functions
    @annahedstroem

@@ -1015,6 +1034,8 @@ def calculate_auc(values: np.array, dx: int = 1):
np.ndarray
Definite integral of values.
"""
if batched:
return np.trapz(values, dx=dx, axis=1)
return np.trapz(np.array(values), dx=dx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, this could be simplified to smth like:

axis = 1 if batched else None
return np.trapz(np.asarray(values), dx=dx, axis=axis)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in the latest commit.

# Indices
indices = params["indices"]

if isinstance(expected, (int, float)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected value is provided by the pytest.mark.parametrize, and its type is known beforehand. Why do we need this check?

@@ -30,6 +30,11 @@ def input_zeros_2d_3ch_flattened():
return np.zeros(shape=(3, 224, 224)).flatten()


@pytest.fixture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixture used only in one place?
If that's the case, please inline it.

x_batch_shape = x_batch.shape
for perturbation_step_index in range(n_perturbations):
# Perturb input by indices of attributions.
a_ix = a_indices[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_ix is an array with shape (batch_size, n_features*n_perturbations), right?
I'd suggest we create a view with shape (batch_size, n_features, n_perturbations).
Then we can index each step with [...,perturbation_step_index] instead of manually calculating offsets into the array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_indices is an array with shape (batch_size, n_features) and the resulting a_ixs are of shape (batch_size, self.features_in_step). I believe calculating the offsets manually here is the only option.

@@ -118,6 +118,58 @@ def baseline_replacement_by_indices(
return arr_perturbed


def batch_baseline_replacement_by_indices(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import numpy.typing as npt

def batch_baseline_replacement_by_indices(
    arr: np.ndarray,
    indices: np.ndarray,
    perturb_baseline: npt.ArrayLike,
    **kwargs,
) -> np.ndarray:


# Predict on input.
x_input = model.shape_input(
x_batch, x_batch.shape, channel_first=True, batched=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik channel_first is a models parameter, so we should not hardcode it. @annahedstroem could you please help us on that one 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was hardcoded in the original implementation as well. Is that a bug?

# Randomly mask by subset size.
a_ix = np.stack(
[
np.random.choice(n_features, self.subset_size, replace=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mb add fixed PRNG seed for reproducibility?

pred_deltas = np.stack(pred_deltas, axis=1)
att_sums = np.stack(att_sums, axis=1)

similarity = self.similarity_func(a=att_sums, b=pred_deltas, batched=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't batch_baseline_replacement_by_indices always batched? Why do we need batched=True argument?

@annahedstroem have you ever used a different similarity_func here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the batched=True argument goes into a similarity function (for example correlation_pearson), not the batch_baseline_replacement_by_indices. Similarity functions can be batched and not batched (at the moment at least) so this argument is needed here.

return_shape=(
batch_size,
n_features,
), # TODO. Double-check this over using = (1,).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this TODO would need a bit more detail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relic of the past implementation, accidentally left it in the new one as well. I have deleted the TODO in the latest commit.

if batched:
assert len(a.shape) == 2 and len(b.shape) == 2, "Batched arrays must be 2D"
# No support for axis currently, so just iterating over the batch
return np.array([scipy.stats.kendalltau(a_i, b_i)[0] for a_i, b_i in zip(a, b)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mb we could use np.vectorize (https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) for this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be used but I also like the simplicity of @davor10105's suggestion!

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 95.88235% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.29%. Comparing base (6857561) to head (4f44510).
Report is 16 commits behind head on main.

Files Patch % Lines
quantus/helpers/utils.py 72.72% 3 Missing ⚠️
...s/metrics/faithfulness/faithfulness_correlation.py 96.15% 1 Missing ⚠️
...ntus/metrics/faithfulness/faithfulness_estimate.py 96.55% 1 Missing ⚠️
quantus/metrics/faithfulness/monotonicity.py 95.45% 1 Missing ⚠️
quantus/metrics/faithfulness/pixel_flipping.py 96.15% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
+ Coverage   91.15%   91.29%   +0.13%     
==========================================
  Files          66       66              
  Lines        3925     4010      +85     
==========================================
+ Hits         3578     3661      +83     
- Misses        347      349       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -14,7 +14,9 @@
import skimage


def correlation_spearman(a: np.array, b: np.array, **kwargs) -> float:
def correlation_spearman(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super!

Copy link
Member

@annahedstroem annahedstroem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work @davor10105, looking forward to our chat.

@@ -139,7 +139,7 @@ def __init__(

# Save metric-specific attributes.
if perturb_func is None:
perturb_func = baseline_replacement_by_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss: where in the code should we make it explicit for the user that they no longer can use any other perturb function that batch_baseline_replacement_by_indices.

@annahedstroem
Copy link
Member

Do we know why most of the python checks are failing? Thanks

@davor10105
Copy link
Contributor Author

Do we know why most of the python checks are failing? Thanks

It seems that the installed versions of scipy on python3.8 and 3.9 do not have the axis parameter in the pearsonr correlation. As for 3.11, I am not sure, the test fails early on fixture loading, I will look into it.

@davor10105
Copy link
Contributor Author

@leanderweber Hey Leander, I have been working on implementing batched versions of all the metrics present in Quantus and have encountered two questions that @annahedstroem said you might be able to answer:

  1. The first question pertains to the perturbation patch filtering done in region perturbation and selectivity metrics Link to relevant lines of the region perturbation metric. I am wondering why is the patching done by convolving over the image with a stride 1, which entails overlapping patches, instead of them being non-overlapping to begin with by using a step of size patch_size? The way it is done right now, there might be a situation where there are two patches with high attributions that are spaced by at most patch_size - 1, and in this situation, the gap between the two (which might be highly relevant as well) will never get perturbed since it will be discarded by this filtering process as all patches within this gap will overlap the already selected perturbation area in at least one pixel (See image below, hopefully it's a bit more clear).
    image
    I've tried looking through the relevant papers and found nothing regarding this filtering process, but I might just be missing something here. I have only found this visualization in the Montavon paper for selectivity, which suggests that the patching was done in this non-overlapping fashion, as all 8 of them are arranged on a grid, but I can not be sure:
    image
  2. The answer to the second question is probably related to the question above, but why is the padding size in all metrics that utilize patches set to patch_size - 1? For example, if the patch size is 5 and the image is 24 x 24 it would make more sense to have a padding of just 1 to pad the image to the size divisible by patch_size.

I appreciate your time and look forward to your clarification on these points.

@leanderweber
Copy link
Collaborator

leanderweber commented Oct 9, 2024

Hi @davor10105,
thank you for working on the batched metrics!

Regarding your questions above:

  1. Filtering of overlapping patches.
    The reason for the current implementation is to avoid fitting a fixed grid to the image (=using a step of size patch_size), and thus limit the locations of the patches, especially during the first few perturbation steps. Otherwise, when using a fixed grid, the first few removed patches may not correspond to the ones with the highest attribution sum in the list of all possible patches and thus confound the resulting curves.
    However, you are right, in the current implementation there may be gaps that are never considered, which is also not really desired behavior. I am not sure there is a good solution for this. Off the top of my head, there are three possibilities:
    (1) Not removing overlapping patches. This will cause a lot of additional runtime though.
    (2) Instead of removing overlapping patches, shrinking remaining patches by the overlap and recomputing attribution sums. However, I am not sure if a varying patch size like this is desired.
    (3) Using a grid, and maybe aligning it to the most relevant patch of all possible patches? This will just move the grid issue to the remaining patches though.
    Out of all of these, I think I would prefer option (2), but maybe we can brainstorm a better solution?

  2. Padding.
    I think just padding the image to the size divisible by patch_size, as you suggested, should work as well!

I hope I could clarify these points :)

@davor10105
Copy link
Contributor Author

Hey @leanderweber,
Thank you for your clarifications and suggestions!

I understand your concerns, but I don’t see a significant issue with using a fixed grid. While it’s true that the "highest attribution" patch may not perfectly align with the grid, the concept of a patch is something we've defined arbitrarily. In my view, maintaining a consistent patch structure across the dataset and methods is essential to ensure equal testing conditions for all attribution methods.

If we adopt option (1) or (2), each perturbation step could potentially affect a different number of pixels, making some steps less impactful than others. Additionally, if a user specifies a certain number of patches, they won’t know what percentage of the input will ultimately be perturbed. This could lead to significant variations in the final perturbed area across images, complicating result comparison.

I do agree with your point about option (3); it wouldn’t make much of a difference.

Looking at the paper again, there is mention of a predefined grid being used.
image
There is also another repo implementing the metric here and fixed grid is used there as well.

Looking forward to your opinion on this!

@annahedstroem
Copy link
Member

@davor10105 I think your solution is sound and would be very happy with your suggested update. @leanderweber let me know if you object!

@leanderweber if you any time over today or tomorrow for a general view of this PR, I would be grateful to have your second pair of eyes!

otherwise, I'll try to go for a merge tomorrow :)

Copy link
Member

@annahedstroem annahedstroem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for merge.

Just a final question @davor10105, where are the test results for the remaining faithfulness metrics (they are not included in the METRICS_PAIRS in the testing_utils/batched_tests/batch_implementation_verification.py)?
I don't find them in the results.pickle file, let me know!

Copy link
Member

@annahedstroem annahedstroem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N/A

@leanderweber
Copy link
Collaborator

leanderweber commented Oct 14, 2024

Hi @davor10105 @annahedstroem,

sorry for the late reply!
I think this issue comes down to interpretation of the metric, and goals for implementing it. A grid is efficient and easy to implement, but I would expect quite a bit of sensitivity to chosen patch size, as well as large variation between attribution methods and datasets, resulting in quite inequal testing conditions. The reason for this is that, as I understand the metric, as follows:

Region Perturbation implicitly puts more emphasis on the patches that are removed earlier. I.e., for a faithful attribution method and MORF order, there is this assumption (1) that the first removed patches will lead to the largest change in model output, since measured attribution faithfulness is related to the resulting curve. At the same time, we also assume (2) that more relevance = more change in model output, so if the attribution is faithful, a patch with a larger sum of relevance should lead to a larger change.
Now with a grid, the patch borders are arbitrarily chosen. That is, there is no guarantee that the patch(es) with the largest sum of relevance are even included in the set of grid patches. As a consequence, other, less relevant, patches may be chosen first for removal. Worst case, this could flip the resulting curve completely, see the (exaggerated) example below.

regionperturbation_grid_failure

However, as you stated, there are several drawbacks to the current implementation as well. After thinking about it, the "correct" way to implement this may be to not remove overlapping patches, and instead consider all possible patches. Potentially recomputing attribution sums after each patch removal? Not sure about that last one.
In any case, if the goal is to remove all patches instead of a fixed number, this way of implementing it will result in a horribly slow runtime.

Maybe we could evaluate how much variability across patch sizes and datasets is introduced in practice when using a grid? E.g., using MetaQuantus? We can also set up a meeting to discuss this in the discord, if you want.

@davor10105
Copy link
Contributor Author

Hey @annahedstroem , sorry for the delay, here is the updated speed-up visualization (average speed-up being approximately 25x):
image
The code and results.pickle to reproduce the results above can be found here - testing_utils.zip

Moving onto the actual results, here is the output of the implementation validation procedure:

pixel_flipping is VALID (all close)

monotonicity is VALID (all close)

monotonicity_correlation is INVALID (t-test) (p > 0.05 elements: 90.62%)
p-values
 [0.46592632        inf 0.01480677 0.22114211 0.68538209 0.25543686
 0.72118409        inf 0.86727984 0.05209713 0.08634006 0.12402893
        inf        inf 0.06867183 0.34054199 0.43720713 0.44546378
 0.9270527  0.84403708 0.84130063 0.00290434 0.6873556  0.6398928
 0.04232199 0.43276764 0.9286894         inf 0.30899273 0.40053449
 0.87341356        inf]

faithfulness_estimate is VALID (all close)

faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 96.88%)
p-values
 [0.34509066 0.93410044 0.85030513 0.12177978 0.79621781 0.4108037
 0.91519351 0.53137183 0.61857571 0.18013535 0.74150682 0.63116584
 0.01188433 0.97485647 0.40405789 0.52355945 0.35866354 0.8957526
 0.98630125 0.06767416 0.48845927 0.77983595 0.88528296 0.63985627
 0.74551271 0.34419242 0.67536439 0.88045068 0.11668878 0.15323135
 0.88838945 0.27389627]

skipping infidelity

irof is INVALID (t-test) (p > 0.05 elements: 0.0%)
p-values
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]

skipping region_perturbation

road is INVALID (t-test) (p > 0.05 elements: 90.62%)
p-values
 [0.15371222 0.52671989 0.37387944 0.0098352  0.13315418 1.
 0.46552173 1.         0.09965008 0.7519292  0.56141593 0.14530726
 2.         1.         0.69171128 0.05489559 1.         0.40442216
 0.32146438 0.53376254 1.         0.72323525 0.40347314 0.14844789
 0.49881312 0.19038966 1.                nan 0.60546431 0.0027649
 0.60546431 1.        ]

skipping selectivity

sensitivity_n is VALID (all close)

sufficiency is VALID (all close)

pixel_flipping, monotonicity, faithfulness_estimate, sensitivity_n and sufficiency are all valid.

region_perturbation, selectivity and infidelity were skipped during the validation process. region_perturbation and selectivity were skipped due to the dynamic / fixed grid discussion, and therefore the results are expected to be different when compared to the old implementation. Regarding infidelity, I forgot to mention it on our last call, but I have found a potential issue with the old implementation. Looking at the paper, infidelity is defined as (focusing on the difference to baseline and explanation multiplication, notice that both are 1D arrays):
image
But if you look at the current implementation, you will notice a np.dot call with two 3D tensors passed to the method here which performs batched matrix multiplication instead and yields incorrect results. I have resolved this in my implementation and thus skipped the validation for infidelity, since the results differ a lot.

Regarding the remaining metrics, both monotonicity_correlation and faithfulness_correlation produced results similar to the initial validation run, despite increasing the number of samples from 10 to 20. However, I have still not found any reason to believe that something is wrong with the implementation. For the road metric, around 90% of the means are consistent with the previous implementation. However, irof shows some variation, as it does not involve stochasticity, yet the results differ from the old implementation. I verified that identical segmentation masks are being used in both implementations, so the source of the discrepancy is unclear. While the results for the 32 examples are generally close between the two implementations, there are still noticeable differences:

batched
(array([62.33, 59.36, 55.95, 74.21, 70.45, 64.07, 71.58, 56.16, 69.05,
        60.83, 56.82, 58.46, 52.69, 56.74, 70.22, 63.09, 56.95, 73.75,
        68.86, 69.74, 63.04, 72.56, 72.54, 71.71, 65.84, 73.52, 68.38,
        47.94, 52.36, 69.33, 59.6 , 67.94]),

unbatched
 array([61.57, 57.4 , 56.38, 74.86, 70.66, 61.37, 71.02, 52.1 , 66.04,
        61.47, 54.75, 56.98, 50.48, 56.14, 67.64, 63.08, 57.31, 73.22,
        72.3 , 69.25, 62.86, 69.5 , 72.08, 72.81, 65.34, 72.21, 71.32,
        48.59, 55.04, 70.29, 59.57, 64.83]))

I would appreciate any guidance you have on this. If you have any additional questions, please ask!

Finally, @leanderweber , I agree that the arbitrary choice of a particular fixed grid changes the resulting curve, but it does so for every evaluated attribution method. In my opinion, the goal of the metric is not to provide a particular "optimal" score, but a score that is obtained on a level playing-field between different attribution methods so that they can be fairly compared. In other words, a score of a single attribution method does not really matter, what matters is the ranking between the scores of a set of attribution methods. For the scores to be comparable, the metric has to perform consistent steps, and that seems tricky to do if the grid is not fixed. Furthermore, a better attribution map should outperform the worse one, no matter the actual underlying grid.
If you think that this is something that could be assessed by using MetaQuantus, I would love to hear more of your thoughts (discord meeting sounds good)!

@annahedstroem
Copy link
Member

annahedstroem commented Oct 17, 2024

Thanks @davor10105 for this elaborate discussion on the remaining results!

First, the discrepancy in the IROF method could possibly be due to the underlying segmentation methods felzenszwalb and slic which both use Gaussian smoothing and thus can introduce stochasticity. So for this reason, I would accept this discrepancy. A similar reasoning could be applied to ROAD which relies on Gaussian noise. Thus, we accept these discrepancies. The same logic is applied to FaithfulnessCorrelation I am not surprised by the varied results in FaithfulnessCorrelation as it includes random sampling. Also, if MonotonictyCorrelation uses "uniform" or "random" as a perturbation type, this can explain the discrepancy as well. Please check what the perturb_baseline choice is, then we will accept these changes too.

Second, is there a possibility to keep both alternatives for patching logic for RegionPerturbation so that the user can choose a method they prefer ["patch_by_size", "patch_by_magnitude"] or something? Is this an agreeable solution? @leanderweber @davor10105

(Thanks also @leanderweber for all your input on the reasoning/ thought behind the implementation! V appreciated.)

Also, thank you for highlighting the bug in the Infidelity metric and for fixing it!

As a final request before merging, can you make a short list of changes, separated by:

  • Batch ...
  • Bug fixes ... <briefly list the changes eg Infideliy>
  • Misc ... <briefly list the utils changes eg choice of patching in RegionSegmentation >

So that we can add it to the release notes and thus track back any discrepancies to that?

Million thanks again @davor10105, really awesome work!

@davor10105
Copy link
Contributor Author

@annahedstroem Thank you for the feedback! I agree that giving users the option to choose between different patching procedures is a solid approach, and I'll make sure to incorporate it in an upcoming commit.

Here’s the requested list of changes:

Batch

Introduced batched processing to the following metrics:

  • FaithfulnessCorrelation
  • FaithfulnessEstimate
  • Infidelity
  • IROF
  • Monotonicity
  • MonotonicityCorrelation
  • PixelFlipping
  • RegionPerturbation
  • ROAD
  • Selectivity
  • SelectivityN
  • Sufficiency

Bug fixes

  • Resolved a bug in the Infidelity metric where np.dot was incorrectly applied to two 3D tensors, causing batched matrix multiplication instead of the intended element-wise multiplication, as defined by the metric.

Misc

  • Support for batched processing to loss_func (mse), norm_func (fro_norm, l2_norm, linf_norm), similarity_func (correlation_spearman, correlation_pearson, correlation_kendall_tau, distance_euclidean, distance_manhattan, distance_chebyshev, ssim)
  • Batched perturbation by indices (batch_baseline_replacement_by_indices) and batched perturbation by binary mask (baseline_replacement_by_mask)
  • get_baseline_dict support for batched inputs
  • Changed reshaping logic (shape_input) when inputting batched input
  • Added get_block_indices to utils, which patches the input image into non-overlapping patches of a certain size, used in RegionPerturbation, Selectivity and Infidelity
  • Changed padding size from patch_size - 1 to a minimum number required to patch the image into same-sized non-overlapping patches (get_padding_size)

@annahedstroem annahedstroem merged commit 53f97c2 into understandable-machine-intelligence-lab:main Nov 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants