John Snow Labs LangTest 1.5.0: Gender Stereotype Analysis with Wino-Bias, Enhancement with Legal-Support, Legal-Summarization (Multi-LexSum Dataset), Factuality & Negation-Sensitivity Tests, Updated Gender Classifier, and Streamlined Bug Resolutions for Better User Experience. #779
ArshaanNazir
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 Overview
LangTest 1.5.0 Release by John Snow Labs 🚀: Debuting the Wino-Bias Test to scrutinize gender role stereotypes and unveiling an expanded suite with the Legal-Support, Legal-Summarization (based on the Multi-LexSum dataset), Factuality, and Negation-Sensitivity evaluations. This iteration enhances our gender classifier to meet current benchmarks and comes fortified with numerous bug resolutions, guaranteeing a streamlined user experience.
A heartfelt thank you to our unwavering community for consistently fueling our journey with their invaluable feedback, questions, and suggestions 🎉
Make sure to give the project a star right here ⭐
🔥 New Features & Enhancements
🐛 Bug Fixes
🔥 New Features
Adding support for wino-bias test
This test is specifically designed for Hugging Face fill-mask models like BERT, RoBERTa-base, and similar models. Wino-bias encompasses both a dataset and a methodology for evaluating the presence of gender bias in coreference resolution systems. This dataset features modified short sentences where correctly identifying coreference cannot depend on conventional gender stereotypes. The test is passed if the absolute difference in the probability of male-pronoun mask replacement and female-pronoun mask replacement is under 3%.
➤ Notebook Link:
➤ How the test looks ?
Adding support for legal-support test
The LegalSupport dataset evaluates fine-grained reverse entailment. Each sample consists of a text passage making a legal claim, and two case summaries. Each summary describes a legal conclusion reached by a different court. The task is to determine which case (i.e. legal conclusion) most forcefully and directly supports the legal claim in the passage. The construction of this benchmark leverages annotations derived from a legal taxonomy expliciting different levels of entailment (e.g. "directly supports" vs "indirectly supports"). As such, the benchmark tests a model's ability to reason regarding the strength of support a particular case summary provides.
➤ Notebook Link:
➤ How the test looks ?
Adding support for factuality test
The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments.
Test Objective
The primary goal of the Factuality Test is to assess how well LLMs can identify the factual accuracy of summary sentences. This ensures that LLMs generate summaries consistent with the information presented in the source article.
Data Source
For this test, we utilize the Factual-Summary-Pairs dataset, which is sourced from the following GitHub repository: Factual-Summary-Pairs Dataset.
Methodology
Our test methodology draws inspiration from a reference article titled "LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper".
Bias Identification
We identify bias in the responses based on specific patterns:
Accuracy Assessment
Accuracy is assessed by examining the "pass" column. If "pass" is marked as True, it indicates a correct response. Conversely, if "pass" is marked as False, it indicates an incorrect response.
➤ Notebook Link:
➤ How the test looks ?
Adding support for negation sensitivity test
In this evaluation, we investigate how a model responds to negations introduced into input text. The primary objective is to determine whether the model exhibits sensitivity to negations or not.
Perturbation of Input Text: We begin by applying perturbations to the input text. Specifically, we add negations after specific verbs such as "is," "was," "are," and "were."
Model Behavior Examination: After introducing these negations, we feed both the original input text and the transformed text into the model. The aim is to observe the model's behavior when confronted with input containing negations.
Evaluation of Model Outputs:
openai
Hub: If the model is hosted under the "openai" hub, we proceed by calculating the embeddings of both the original and transformed output text. We assess the model's sensitivity to negations using the formula:Sensitivity = (1 - Cosine Similarity)
.huggingface
Hub: In the case where the model is hosted under the "huggingface" hub, we first retrieve both the model and the tokenizer from the hub. Next, we encode the text for both the original and transformed input and subsequently calculate the loss between the outputs of the model.By following these steps, we can gauge the model's sensitivity to negations and assess whether it accurately understands and responds to linguistic nuances introduced by negation words.
➤ Notebook Link:
➤ How the test looks ?
We have used threshold of (-0.1,0.1) . If the eval_score falls within this threshold range, it indicates that the model is failing to properly handle negations, implying insensitivity to linguistic nuances introduced by negation words.
Adding support for legal-summarization test
MultiLexSum
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
Dataset Summary
The Multi-LexSum dataset consists of legal case summaries. The aim is for the model to thoroughly examine the given context and, upon understanding its content, produce a concise summary that captures the essential themes and key details.
➤ Notebook Link:
➤ How the test looks ?
The default threshold value is 0.50. If the eval_score is higher than threshold, then the "pass" will be as true.
❤️ Community support
#langtest
channelWe would love to have you join the mission 👉 open an issue, a PR, or give us some feedback on features you'd like to see! 🙌
♻️ Changelog
What's Changed
Full Changelog: 1.4.0...1.5.0
This discussion was created from the release John Snow Labs LangTest 1.5.0: Gender Stereotype Analysis with Wino-Bias, Enhancement with Legal-Support, Legal-Summarization (Multi-LexSum Dataset), Factuality & Negation-Sensitivity Tests, Updated Gender Classifier, and Streamlined Bug Resolutions for Better User Experience..
Beta Was this translation helpful? Give feedback.
All reactions