Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Med-Halt Tests for Robust Model Evaluation #1166

Open
5 tasks
chakravarthik27 opened this issue Jan 16, 2025 · 0 comments
Open
5 tasks

Implement Med-Halt Tests for Robust Model Evaluation #1166

chakravarthik27 opened this issue Jan 16, 2025 · 0 comments
Assignees
Milestone

Comments

@chakravarthik27
Copy link
Collaborator

Issue Description:
To enhance the robustness of model evaluation in Langtest, we propose implementing a suite of Med-Halt Tests. These tests aim to challenge the model's reliability and confidence by introducing variations in ground truth, adding distractors, and testing its ability to handle unexpected inputs.

Proposed Tests:

  1. False Confidence Test (FCT):

    • Provide the model with deliberately false answers for evaluation.
    • Measure the model's confidence and its ability to avoid overconfidence in incorrect answers.
  2. None of the Above Test (NOTA):

    • Introduce a "None of the Above" (NOTA) option to the question set.
    • Evaluate if the model can correctly identify when none of the provided answers are valid.
  3. Fake Questions Test (FQT):

    • Select and modify random questions from the dataset.
    • Assess the model's ability to handle slightly altered or out-of-context questions effectively.

Benefits:

  • Improves the model's reliability in handling diverse scenarios.
  • Tests the model's ability to detect and reject incorrect or altered inputs.
  • Enhances the overall robustness and trustworthiness of the evaluation process.

Acceptance Criteria:

  • False Confidence Test (FCT) is implemented and validated.
  • None of the Above Test (NOTA) functionality is added, with appropriate evaluation metrics.
  • Fake Questions Test (FQT) is integrated and tested for effectiveness.
  • Unit tests are added to verify the correctness of each Med-Halt test.
  • Documentation is updated to describe the new test functionalities and their usage.
@chakravarthik27 chakravarthik27 self-assigned this Jan 17, 2025
@chakravarthik27 chakravarthik27 added this to the 2.6.0 milestone Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant