Implement Med-Halt Tests for Robust Model Evaluation #1166

chakravarthik27 · 2025-01-16T06:40:46Z

Issue Description:
To enhance the robustness of model evaluation in Langtest, we propose implementing a suite of Med-Halt Tests. These tests aim to challenge the model's reliability and confidence by introducing variations in ground truth, adding distractors, and testing its ability to handle unexpected inputs.

Proposed Tests:

False Confidence Test (FCT):
- Provide the model with deliberately false answers for evaluation.
- Measure the model's confidence and its ability to avoid overconfidence in incorrect answers.
None of the Above Test (NOTA):
- Introduce a "None of the Above" (NOTA) option to the question set.
- Evaluate if the model can correctly identify when none of the provided answers are valid.
Fake Questions Test (FQT):
- Select and modify random questions from the dataset.
- Assess the model's ability to handle slightly altered or out-of-context questions effectively.

Benefits:

Improves the model's reliability in handling diverse scenarios.
Tests the model's ability to detect and reject incorrect or altered inputs.
Enhances the overall robustness and trustworthiness of the evaluation process.

Acceptance Criteria:

False Confidence Test (FCT) is implemented and validated.
None of the Above Test (NOTA) functionality is added, with appropriate evaluation metrics.
Fake Questions Test (FQT) is integrated and tested for effectiveness.
Unit tests are added to verify the correctness of each Med-Halt test.
Documentation is updated to describe the new test functionalities and their usage.

chakravarthik27 self-assigned this Jan 17, 2025

chakravarthik27 added this to the 2.6.0 milestone Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Med-Halt Tests for Robust Model Evaluation #1166

Implement Med-Halt Tests for Robust Model Evaluation #1166

chakravarthik27 commented Jan 16, 2025

Implement Med-Halt Tests for Robust Model Evaluation #1166

Implement Med-Halt Tests for Robust Model Evaluation #1166

Comments

chakravarthik27 commented Jan 16, 2025