Batched metrics #351

davor10105 · 2024-07-27T13:06:15Z

Description

Implementing true batch processing for the available metrics
This follows the previously raised issue.
The improvements gained from the batched processing seem to hover at around 12.6x speed-up on CPU (see image below):

Further testing needs to be done on GPU (I do not have access to one at the moment :/)

Implemented changes

The following changes were made:
- Removed evaluate_instance method in PixelFlipping, Monotonicity, MonotonicityCorrelation, FaithfulnessCorrelation and FaithfulnessEstimate classes and replaced the existing evaluate_batch methods with their "true" batch implementation
- Added batched parameter to correlation_spearman, correlation_pearson and correlation_kendall_tausimilarity functions to support batch processing, batched parameter to get_baseline_dict to support batched baseline creation, and similarly added the same parameter to calculate_auc

Implementation validity

To verify that the batched implementation and the for-loop type implementation return the same (or close) results, each metric was invoked over the same sample 30 times (using different seeds) utilizing the batched implementation and the for-loop (unbatched) implementation. Firstly, a simple np.allclose check was made and PixelFlipping, Monotonicity and FaithfulnessEstimate were verified as valid. MonotonicityCorrelation and FaithfulnessCorrelation did not pass this test, as they include stochastic elements in their calculations. To verify their validity, a two-way t-test was utilized over the 30 runs for each sample of the respective implementations. The resulting p-values can be seen below:

pixel_flipping is VALID (all close)
monotonicity is VALID (all close)
monotonicity_correlation is INVALID (t-test) (p > 0.05 elements: 91.67%)
p-values
 [0.04856062        inf 0.71486889 0.15810997 0.526356   0.74277285
 0.07734966        inf 0.79329449 0.25064761 0.63218201 0.04654986
        inf        inf 0.84467497 0.11548366 0.27559421 0.29898369
 0.65902698        inf 0.30107825 0.6182171  0.9439362  0.26287838]
faithfulness_estimate is VALID (all close)
faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 83.33%)
p-values
 [0.20491799 0.01031497 0.83503175 0.2578373  0.30944439 0.87939701
 0.47807545 0.05502236 0.7996986  0.02895154 0.02735908 0.90580175
 0.14312554 0.25870476 0.09197358 0.08645692 0.50516479 0.68482126
 0.09873465 0.64258948 0.35872379 0.94694292 0.03611762 0.78976033]

The mean metric values for 92% and 83% of the batch samples showed no statistically significant difference between the two approaches, respectively. The reason for the lack of a 100% match is unclear. To further investigate, I conducted an additional experiment comparing the scores from two runs (each repeated 30 times) of the old loop-style implementation. Despite being identical implementations, the t-test revealed instances where the means of the two groups differed, specifically in the case of FaithfulnessCorrelation (see results below):

monotonicity_correlation is VALID (t-test)
faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 91.67%)
p-values
 [0.81952674 0.22507615 0.33738925 0.90864344 0.81490267 0.20901662
 0.81953659 0.17859758 0.62796515 0.04876216 0.13012812 0.58632642
 0.84153836 0.17535798 0.52076956 0.84998221 0.57439399 0.1188701
 0.53963641 0.77368691 0.68040962 0.9938426  0.00461874 0.07818067]

So the reason might just be that this metric is very unstable for some examples, but I am not sure. I do not think my batched implementation is wrong, since it does produce results that can be statistically verified in the majority of cases, which would be difficult to do "by chance".
Validation and visualization scripts are attached here - testing_utils.zip. batched_tests directory contains batch_implementation_verification.py script which runs the validation tests utilizing a copy of the repo (in the quantus directory also contained within the zip file) that has both the batched and the old implementation versions. Results of the runs mentioned above can be found in the results.pickle. This file is used by test_visualization.py to display the box visualization and check the validity of the batch implementation as described above.

Minimum acceptance criteria

Implementing batch processing for all other metrics and supporting functions
@annahedstroem

aaarrti · 2024-07-28T16:45:31Z

quantus/helpers/utils.py

@@ -1015,6 +1034,8 @@ def calculate_auc(values: np.array, dx: int = 1):
    np.ndarray
        Definite integral of values.
    """
+    if batched:
+        return np.trapz(values, dx=dx, axis=1)
    return np.trapz(np.array(values), dx=dx)


I believe, this could be simplified to smth like:

axis = 1 if batched else None return np.trapz(np.asarray(values), dx=dx, axis=axis)

Simplified in the latest commit.

aaarrti · 2024-07-28T16:50:01Z

tests/functions/test_perturb_func.py

+    # Indices
+    indices = params["indices"]
+
+    if isinstance(expected, (int, float)):


The expected value is provided by the pytest.mark.parametrize, and its type is known beforehand. Why do we need this check?

aaarrti · 2024-07-28T16:50:52Z

tests/functions/test_perturb_func.py

@@ -30,6 +30,11 @@ def input_zeros_2d_3ch_flattened():
    return np.zeros(shape=(3, 224, 224)).flatten()


+@pytest.fixture


Is this fixture used only in one place?
If that's the case, please inline it.

aaarrti · 2024-07-28T16:54:21Z

quantus/metrics/faithfulness/pixel_flipping.py

+        x_batch_shape = x_batch.shape
+        for perturbation_step_index in range(n_perturbations):
+            # Perturb input by indices of attributions.
+            a_ix = a_indices[


a_ix is an array with shape (batch_size, n_features*n_perturbations), right?
I'd suggest we create a view with shape (batch_size, n_features, n_perturbations).
Then we can index each step with [...,perturbation_step_index] instead of manually calculating offsets into the array

a_indices is an array with shape (batch_size, n_features) and the resulting a_ixs are of shape (batch_size, self.features_in_step). I believe calculating the offsets manually here is the only option.

aaarrti · 2024-07-28T17:00:42Z

quantus/functions/perturb_func.py

@@ -118,6 +118,58 @@ def baseline_replacement_by_indices(
    return arr_perturbed


+def batch_baseline_replacement_by_indices(


import numpy.typing as npt def batch_baseline_replacement_by_indices( arr: np.ndarray, indices: np.ndarray, perturb_baseline: npt.ArrayLike, **kwargs, ) -> np.ndarray:

aaarrti · 2024-07-28T17:06:58Z

quantus/metrics/faithfulness/faithfulness_correlation.py

+
+        # Predict on input.
+        x_input = model.shape_input(
+            x_batch, x_batch.shape, channel_first=True, batched=True


afaik channel_first is a models parameter, so we should not hardcode it. @annahedstroem could you please help us on that one 🙃

This was hardcoded in the original implementation as well. Is that a bug?

aaarrti · 2024-07-28T17:07:36Z

quantus/metrics/faithfulness/faithfulness_correlation.py

+            # Randomly mask by subset size.
+            a_ix = np.stack(
+                [
+                    np.random.choice(n_features, self.subset_size, replace=False)


Should we mb add fixed PRNG seed for reproducibility?

aaarrti · 2024-07-28T17:10:05Z

quantus/metrics/faithfulness/faithfulness_correlation.py

+        pred_deltas = np.stack(pred_deltas, axis=1)
+        att_sums = np.stack(att_sums, axis=1)
+
+        similarity = self.similarity_func(a=att_sums, b=pred_deltas, batched=True)


Isn't batch_baseline_replacement_by_indices always batched? Why do we need batched=True argument?

@annahedstroem have you ever used a different similarity_func here?

Here, the batched=True argument goes into a similarity function (for example correlation_pearson), not the batch_baseline_replacement_by_indices. Similarity functions can be batched and not batched (at the moment at least) so this argument is needed here.

aaarrti · 2024-07-28T17:11:51Z

quantus/metrics/faithfulness/monotonicity.py

+            return_shape=(
+                batch_size,
+                n_features,
+            ),  # TODO. Double-check this over using = (1,).


this TODO would need a bit more detail

This is a relic of the past implementation, accidentally left it in the new one as well. I have deleted the TODO in the latest commit.

aaarrti · 2024-07-28T17:13:37Z

quantus/functions/similarity_func.py

+    if batched:
+        assert len(a.shape) == 2 and len(b.shape) == 2, "Batched arrays must be 2D"
+        # No support for axis currently, so just iterating over the batch
+        return np.array([scipy.stats.kendalltau(a_i, b_i)[0] for a_i, b_i in zip(a, b)])


Mb we could use np.vectorize (https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) for this one?

could be used but I also like the simplicity of @davor10105's suggestion!

codecov-commenter · 2024-07-28T17:24:07Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 95.88235% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.29%. Comparing base (6857561) to head (4f44510).
Report is 16 commits behind head on main.

Files	Patch %	Lines
quantus/helpers/utils.py	72.72%	3 Missing ⚠️
...s/metrics/faithfulness/faithfulness_correlation.py	96.15%	1 Missing ⚠️
...ntus/metrics/faithfulness/faithfulness_estimate.py	96.55%	1 Missing ⚠️
quantus/metrics/faithfulness/monotonicity.py	95.45%	1 Missing ⚠️
quantus/metrics/faithfulness/pixel_flipping.py	96.15%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
+ Coverage   91.15%   91.29%   +0.13%     
==========================================
  Files          66       66              
  Lines        3925     4010      +85     
==========================================
+ Hits         3578     3661      +83     
- Misses        347      349       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

annahedstroem · 2024-07-30T06:59:14Z

quantus/functions/similarity_func.py

@@ -14,7 +14,9 @@
 import skimage


-def correlation_spearman(a: np.array, b: np.array, **kwargs) -> float:
+def correlation_spearman(


annahedstroem

Really great work @davor10105, looking forward to our chat.

annahedstroem · 2024-07-30T07:00:54Z

quantus/metrics/faithfulness/faithfulness_correlation.py

@@ -139,7 +139,7 @@ def __init__(

        # Save metric-specific attributes.
        if perturb_func is None:
-            perturb_func = baseline_replacement_by_indices


Let's discuss: where in the code should we make it explicit for the user that they no longer can use any other perturb function that batch_baseline_replacement_by_indices.

annahedstroem · 2024-07-30T07:04:20Z

Do we know why most of the python checks are failing? Thanks

davor10105 · 2024-07-30T08:28:11Z

Do we know why most of the python checks are failing? Thanks

It seems that the installed versions of scipy on python3.8 and 3.9 do not have the axis parameter in the pearsonr correlation. As for 3.11, I am not sure, the test fails early on fixture loading, I will look into it.

davor10105 · 2024-10-08T15:26:12Z

@leanderweber Hey Leander, I have been working on implementing batched versions of all the metrics present in Quantus and have encountered two questions that @annahedstroem said you might be able to answer:

The first question pertains to the perturbation patch filtering done in region perturbation and selectivity metrics Link to relevant lines of the region perturbation metric. I am wondering why is the patching done by convolving over the image with a stride 1, which entails overlapping patches, instead of them being non-overlapping to begin with by using a step of size patch_size? The way it is done right now, there might be a situation where there are two patches with high attributions that are spaced by at most patch_size - 1, and in this situation, the gap between the two (which might be highly relevant as well) will never get perturbed since it will be discarded by this filtering process as all patches within this gap will overlap the already selected perturbation area in at least one pixel (See image below, hopefully it's a bit more clear).

I've tried looking through the relevant papers and found nothing regarding this filtering process, but I might just be missing something here. I have only found this visualization in the Montavon paper for selectivity, which suggests that the patching was done in this non-overlapping fashion, as all 8 of them are arranged on a grid, but I can not be sure:
The answer to the second question is probably related to the question above, but why is the padding size in all metrics that utilize patches set to patch_size - 1? For example, if the patch size is 5 and the image is 24 x 24 it would make more sense to have a padding of just 1 to pad the image to the size divisible by patch_size.

I appreciate your time and look forward to your clarification on these points.

leanderweber · 2024-10-09T16:31:19Z

Hi @davor10105,
thank you for working on the batched metrics!

Regarding your questions above:

Filtering of overlapping patches.
The reason for the current implementation is to avoid fitting a fixed grid to the image (=using a step of size patch_size), and thus limit the locations of the patches, especially during the first few perturbation steps. Otherwise, when using a fixed grid, the first few removed patches may not correspond to the ones with the highest attribution sum in the list of all possible patches and thus confound the resulting curves.
However, you are right, in the current implementation there may be gaps that are never considered, which is also not really desired behavior. I am not sure there is a good solution for this. Off the top of my head, there are three possibilities:
(1) Not removing overlapping patches. This will cause a lot of additional runtime though.
(2) Instead of removing overlapping patches, shrinking remaining patches by the overlap and recomputing attribution sums. However, I am not sure if a varying patch size like this is desired.
(3) Using a grid, and maybe aligning it to the most relevant patch of all possible patches? This will just move the grid issue to the remaining patches though.
Out of all of these, I think I would prefer option (2), but maybe we can brainstorm a better solution?
Padding.
I think just padding the image to the size divisible by patch_size, as you suggested, should work as well!

I hope I could clarify these points :)

davor10105 · 2024-10-10T10:32:39Z

Hey @leanderweber,
Thank you for your clarifications and suggestions!

I understand your concerns, but I don’t see a significant issue with using a fixed grid. While it’s true that the "highest attribution" patch may not perfectly align with the grid, the concept of a patch is something we've defined arbitrarily. In my view, maintaining a consistent patch structure across the dataset and methods is essential to ensure equal testing conditions for all attribution methods.

If we adopt option (1) or (2), each perturbation step could potentially affect a different number of pixels, making some steps less impactful than others. Additionally, if a user specifies a certain number of patches, they won’t know what percentage of the input will ultimately be perturbed. This could lead to significant variations in the final perturbed area across images, complicating result comparison.

I do agree with your point about option (3); it wouldn’t make much of a difference.

Looking at the paper again, there is mention of a predefined grid being used.

There is also another repo implementing the metric here and fixed grid is used there as well.

Looking forward to your opinion on this!

annahedstroem · 2024-10-10T13:17:44Z

@davor10105 I think your solution is sound and would be very happy with your suggested update. @leanderweber let me know if you object!

@leanderweber if you any time over today or tomorrow for a general view of this PR, I would be grateful to have your second pair of eyes!

otherwise, I'll try to go for a merge tomorrow :)

annahedstroem

Ready for merge.

Just a final question @davor10105, where are the test results for the remaining faithfulness metrics (they are not included in the METRICS_PAIRS in the testing_utils/batched_tests/batch_implementation_verification.py)?
I don't find them in the results.pickle file, let me know!

annahedstroem

N/A

leanderweber · 2024-10-14T10:50:35Z

Hi @davor10105 @annahedstroem,

sorry for the late reply!
I think this issue comes down to interpretation of the metric, and goals for implementing it. A grid is efficient and easy to implement, but I would expect quite a bit of sensitivity to chosen patch size, as well as large variation between attribution methods and datasets, resulting in quite inequal testing conditions. The reason for this is that, as I understand the metric, as follows:

Region Perturbation implicitly puts more emphasis on the patches that are removed earlier. I.e., for a faithful attribution method and MORF order, there is this assumption (1) that the first removed patches will lead to the largest change in model output, since measured attribution faithfulness is related to the resulting curve. At the same time, we also assume (2) that more relevance = more change in model output, so if the attribution is faithful, a patch with a larger sum of relevance should lead to a larger change.
Now with a grid, the patch borders are arbitrarily chosen. That is, there is no guarantee that the patch(es) with the largest sum of relevance are even included in the set of grid patches. As a consequence, other, less relevant, patches may be chosen first for removal. Worst case, this could flip the resulting curve completely, see the (exaggerated) example below.

However, as you stated, there are several drawbacks to the current implementation as well. After thinking about it, the "correct" way to implement this may be to not remove overlapping patches, and instead consider all possible patches. Potentially recomputing attribution sums after each patch removal? Not sure about that last one.
In any case, if the goal is to remove all patches instead of a fixed number, this way of implementing it will result in a horribly slow runtime.

Maybe we could evaluate how much variability across patch sizes and datasets is introduced in practice when using a grid? E.g., using MetaQuantus? We can also set up a meeting to discuss this in the discord, if you want.

davor10105 · 2024-10-15T15:22:00Z

Hey @annahedstroem , sorry for the delay, here is the updated speed-up visualization (average speed-up being approximately 25x):

The code and results.pickle to reproduce the results above can be found here - testing_utils.zip

Moving onto the actual results, here is the output of the implementation validation procedure:

pixel_flipping is VALID (all close)

monotonicity is VALID (all close)

monotonicity_correlation is INVALID (t-test) (p > 0.05 elements: 90.62%)
p-values
 [0.46592632        inf 0.01480677 0.22114211 0.68538209 0.25543686
 0.72118409        inf 0.86727984 0.05209713 0.08634006 0.12402893
        inf        inf 0.06867183 0.34054199 0.43720713 0.44546378
 0.9270527  0.84403708 0.84130063 0.00290434 0.6873556  0.6398928
 0.04232199 0.43276764 0.9286894         inf 0.30899273 0.40053449
 0.87341356        inf]

faithfulness_estimate is VALID (all close)

faithfulness_correlation is INVALID (t-test) (p > 0.05 elements: 96.88%)
p-values
 [0.34509066 0.93410044 0.85030513 0.12177978 0.79621781 0.4108037
 0.91519351 0.53137183 0.61857571 0.18013535 0.74150682 0.63116584
 0.01188433 0.97485647 0.40405789 0.52355945 0.35866354 0.8957526
 0.98630125 0.06767416 0.48845927 0.77983595 0.88528296 0.63985627
 0.74551271 0.34419242 0.67536439 0.88045068 0.11668878 0.15323135
 0.88838945 0.27389627]

skipping infidelity

irof is INVALID (t-test) (p > 0.05 elements: 0.0%)
p-values
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]

skipping region_perturbation

road is INVALID (t-test) (p > 0.05 elements: 90.62%)
p-values
 [0.15371222 0.52671989 0.37387944 0.0098352  0.13315418 1.
 0.46552173 1.         0.09965008 0.7519292  0.56141593 0.14530726
 2.         1.         0.69171128 0.05489559 1.         0.40442216
 0.32146438 0.53376254 1.         0.72323525 0.40347314 0.14844789
 0.49881312 0.19038966 1.                nan 0.60546431 0.0027649
 0.60546431 1.        ]

skipping selectivity

sensitivity_n is VALID (all close)

sufficiency is VALID (all close)

pixel_flipping, monotonicity, faithfulness_estimate, sensitivity_n and sufficiency are all valid.

region_perturbation, selectivity and infidelity were skipped during the validation process. region_perturbation and selectivity were skipped due to the dynamic / fixed grid discussion, and therefore the results are expected to be different when compared to the old implementation. Regarding infidelity, I forgot to mention it on our last call, but I have found a potential issue with the old implementation. Looking at the paper, infidelity is defined as (focusing on the difference to baseline and explanation multiplication, notice that both are 1D arrays):

But if you look at the current implementation, you will notice a np.dot call with two 3D tensors passed to the method here which performs batched matrix multiplication instead and yields incorrect results. I have resolved this in my implementation and thus skipped the validation for infidelity, since the results differ a lot.

Regarding the remaining metrics, both monotonicity_correlation and faithfulness_correlation produced results similar to the initial validation run, despite increasing the number of samples from 10 to 20. However, I have still not found any reason to believe that something is wrong with the implementation. For the road metric, around 90% of the means are consistent with the previous implementation. However, irof shows some variation, as it does not involve stochasticity, yet the results differ from the old implementation. I verified that identical segmentation masks are being used in both implementations, so the source of the discrepancy is unclear. While the results for the 32 examples are generally close between the two implementations, there are still noticeable differences:

batched
(array([62.33, 59.36, 55.95, 74.21, 70.45, 64.07, 71.58, 56.16, 69.05,
        60.83, 56.82, 58.46, 52.69, 56.74, 70.22, 63.09, 56.95, 73.75,
        68.86, 69.74, 63.04, 72.56, 72.54, 71.71, 65.84, 73.52, 68.38,
        47.94, 52.36, 69.33, 59.6 , 67.94]),

unbatched
 array([61.57, 57.4 , 56.38, 74.86, 70.66, 61.37, 71.02, 52.1 , 66.04,
        61.47, 54.75, 56.98, 50.48, 56.14, 67.64, 63.08, 57.31, 73.22,
        72.3 , 69.25, 62.86, 69.5 , 72.08, 72.81, 65.34, 72.21, 71.32,
        48.59, 55.04, 70.29, 59.57, 64.83]))

I would appreciate any guidance you have on this. If you have any additional questions, please ask!

Finally, @leanderweber , I agree that the arbitrary choice of a particular fixed grid changes the resulting curve, but it does so for every evaluated attribution method. In my opinion, the goal of the metric is not to provide a particular "optimal" score, but a score that is obtained on a level playing-field between different attribution methods so that they can be fairly compared. In other words, a score of a single attribution method does not really matter, what matters is the ranking between the scores of a set of attribution methods. For the scores to be comparable, the metric has to perform consistent steps, and that seems tricky to do if the grid is not fixed. Furthermore, a better attribution map should outperform the worse one, no matter the actual underlying grid.
If you think that this is something that could be assessed by using MetaQuantus, I would love to hear more of your thoughts (discord meeting sounds good)!

annahedstroem · 2024-10-17T16:08:33Z

Thanks @davor10105 for this elaborate discussion on the remaining results!

First, the discrepancy in the IROF method could possibly be due to the underlying segmentation methods felzenszwalb and slic which both use Gaussian smoothing and thus can introduce stochasticity. So for this reason, I would accept this discrepancy. A similar reasoning could be applied to ROAD which relies on Gaussian noise. Thus, we accept these discrepancies. The same logic is applied to FaithfulnessCorrelation I am not surprised by the varied results in FaithfulnessCorrelation as it includes random sampling. Also, if MonotonictyCorrelation uses "uniform" or "random" as a perturbation type, this can explain the discrepancy as well. Please check what the perturb_baseline choice is, then we will accept these changes too.

Second, is there a possibility to keep both alternatives for patching logic for RegionPerturbation so that the user can choose a method they prefer ["patch_by_size", "patch_by_magnitude"] or something? Is this an agreeable solution? @leanderweber @davor10105

(Thanks also @leanderweber for all your input on the reasoning/ thought behind the implementation! V appreciated.)

Also, thank you for highlighting the bug in the Infidelity metric and for fixing it!

As a final request before merging, can you make a short list of changes, separated by:

Batch ...
Bug fixes ... <briefly list the changes eg Infideliy>
Misc ... <briefly list the utils changes eg choice of patching in RegionSegmentation >

So that we can add it to the release notes and thus track back any discrepancies to that?

Million thanks again @davor10105, really awesome work!

davor10105 · 2024-10-18T11:36:19Z

@annahedstroem Thank you for the feedback! I agree that giving users the option to choose between different patching procedures is a solid approach, and I'll make sure to incorporate it in an upcoming commit.

Here’s the requested list of changes:

Batch

Introduced batched processing to the following metrics:

FaithfulnessCorrelation
FaithfulnessEstimate
Infidelity
IROF
Monotonicity
MonotonicityCorrelation
PixelFlipping
RegionPerturbation
ROAD
Selectivity
SelectivityN
Sufficiency

Bug fixes

Resolved a bug in the Infidelity metric where np.dot was incorrectly applied to two 3D tensors, causing batched matrix multiplication instead of the intended element-wise multiplication, as defined by the metric.

Misc

Support for batched processing to loss_func (mse), norm_func (fro_norm, l2_norm, linf_norm), similarity_func (correlation_spearman, correlation_pearson, correlation_kendall_tau, distance_euclidean, distance_manhattan, distance_chebyshev, ssim)
Batched perturbation by indices (batch_baseline_replacement_by_indices) and batched perturbation by binary mask (baseline_replacement_by_mask)
get_baseline_dict support for batched inputs
Changed reshaping logic (shape_input) when inputting batched input
Added get_block_indices to utils, which patches the input image into non-overlapping patches of a certain size, used in RegionPerturbation, Selectivity and Infidelity
Changed padding size from patch_size - 1 to a minimum number required to patch the image into same-sized non-overlapping patches (get_padding_size)

davor and others added 4 commits July 14, 2024 22:14

added batch version for faithfulness correlation

64cf19c

finished batch implementations of the first 5 faithfulness metrics

3391361

added tests for batching

2952fdf

removing Batched classes

4f44510

aaarrti reviewed Jul 28, 2024

View reviewed changes

annahedstroem reviewed Jul 30, 2024

View reviewed changes

davor and others added 9 commits July 30, 2024 10:44

simplifying trapz calculation and removing TODO

c56104e

faithfulness metrics ready to test (got a lot of questions)

17fb93e

Merge branch 'main' into batched-metrics

15c36cb

small corrections for tests

67c8c37

bugfix for batch size 1 when calling pearsonr

99ba309

removing a comment

811b99f

Merge branch 'main' into batched-metrics

b506aa5

resolving mypy tests

6591aab

resolving the single missing mypy test

1609535

annahedstroem reviewed Oct 11, 2024

View reviewed changes

Merge branch 'main' into batched-metrics

0442047

annahedstroem merged commit 53f97c2 into understandable-machine-intelligence-lab:main Nov 9, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched metrics #351

Batched metrics #351

davor10105 commented Jul 27, 2024

aaarrti Jul 28, 2024

davor10105 Jul 30, 2024

aaarrti Jul 28, 2024

aaarrti Jul 28, 2024

aaarrti Jul 28, 2024

davor10105 Jul 30, 2024

aaarrti Jul 28, 2024

aaarrti Jul 28, 2024

davor10105 Jul 30, 2024

aaarrti Jul 28, 2024

aaarrti Jul 28, 2024

davor10105 Jul 30, 2024

aaarrti Jul 28, 2024

davor10105 Jul 30, 2024

aaarrti Jul 28, 2024

annahedstroem Jul 30, 2024

codecov-commenter commented Jul 28, 2024

annahedstroem Jul 30, 2024

annahedstroem left a comment

annahedstroem Jul 30, 2024

annahedstroem commented Jul 30, 2024

davor10105 commented Jul 30, 2024

davor10105 commented Oct 8, 2024

leanderweber commented Oct 9, 2024 •

edited

Loading

davor10105 commented Oct 10, 2024

annahedstroem commented Oct 10, 2024

annahedstroem left a comment •

edited

Loading

annahedstroem left a comment •

edited

Loading

leanderweber commented Oct 14, 2024 •

edited

Loading

davor10105 commented Oct 15, 2024

annahedstroem commented Oct 17, 2024 •

edited

Loading

davor10105 commented Oct 18, 2024

		@@ -30,6 +30,11 @@ def input_zeros_2d_3ch_flattened():
		return np.zeros(shape=(3, 224, 224)).flatten()


		@pytest.fixture

		@@ -118,6 +118,58 @@ def baseline_replacement_by_indices(
		return arr_perturbed


		def batch_baseline_replacement_by_indices(

Batched metrics #351

Batched metrics #351

Conversation

davor10105 commented Jul 27, 2024

Description

Implemented changes

Implementation validity

Minimum acceptance criteria

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 28, 2024

Codecov Report

Choose a reason for hiding this comment

annahedstroem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annahedstroem commented Jul 30, 2024

davor10105 commented Jul 30, 2024

davor10105 commented Oct 8, 2024

leanderweber commented Oct 9, 2024 • edited Loading

davor10105 commented Oct 10, 2024

annahedstroem commented Oct 10, 2024

annahedstroem left a comment • edited Loading

Choose a reason for hiding this comment

annahedstroem left a comment • edited Loading

Choose a reason for hiding this comment

leanderweber commented Oct 14, 2024 • edited Loading

davor10105 commented Oct 15, 2024

annahedstroem commented Oct 17, 2024 • edited Loading

davor10105 commented Oct 18, 2024

Batch

Bug fixes

Misc

leanderweber commented Oct 9, 2024 •

edited

Loading

annahedstroem left a comment •

edited

Loading

annahedstroem left a comment •

edited

Loading

leanderweber commented Oct 14, 2024 •

edited

Loading

annahedstroem commented Oct 17, 2024 •

edited

Loading