You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue Description:
To enhance the robustness of model evaluation in Langtest, we propose implementing a suite of Med-Halt Tests. These tests aim to challenge the model's reliability and confidence by introducing variations in ground truth, adding distractors, and testing its ability to handle unexpected inputs.
Proposed Tests:
False Confidence Test (FCT):
Provide the model with deliberately false answers for evaluation.
Measure the model's confidence and its ability to avoid overconfidence in incorrect answers.
None of the Above Test (NOTA):
Introduce a "None of the Above" (NOTA) option to the question set.
Evaluate if the model can correctly identify when none of the provided answers are valid.
Fake Questions Test (FQT):
Select and modify random questions from the dataset.
Assess the model's ability to handle slightly altered or out-of-context questions effectively.
Benefits:
Improves the model's reliability in handling diverse scenarios.
Tests the model's ability to detect and reject incorrect or altered inputs.
Enhances the overall robustness and trustworthiness of the evaluation process.
Acceptance Criteria:
False Confidence Test (FCT) is implemented and validated.
None of the Above Test (NOTA) functionality is added, with appropriate evaluation metrics.
Fake Questions Test (FQT) is integrated and tested for effectiveness.
Unit tests are added to verify the correctness of each Med-Halt test.
Documentation is updated to describe the new test functionalities and their usage.
The text was updated successfully, but these errors were encountered:
Issue Description:
To enhance the robustness of model evaluation in Langtest, we propose implementing a suite of Med-Halt Tests. These tests aim to challenge the model's reliability and confidence by introducing variations in ground truth, adding distractors, and testing its ability to handle unexpected inputs.
Proposed Tests:
False Confidence Test (FCT):
None of the Above Test (NOTA):
Fake Questions Test (FQT):
Benefits:
Acceptance Criteria:
The text was updated successfully, but these errors were encountered: