NIAH was invented to evaluate capabilities in efficiently locating specific information within extensive texts, simulating the real world use cases such as RAG (retrieval-augmented generation).
A QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
A crowdsourced platform where LLMs have randomised conversations rated by human users based on factors like fluency, helpfulness, and consistency. Users have real conversations with two anonymous chatbots, voting on which response is superior. This approach aligns with how LLMs are used in the real world, giving us insights into which models excel in conversation.
A dataset of challenging questions designed for multi-turn conversations. LLMs are graded (often by other, even more powerful LLMs) on the quality and relevance of their answers. The focus here is less about casual chat and more about a chatbot's ability to provide informative responses in potentially complex scenarios.
Multi-task Language Understanding, is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans.
MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.
GLUE (General Language Understanding Evaluation) was an early but groundbreaking benchmark suite.
SuperGLUE emerged as a response to LLMs quickly outperforming the original GLUE tasks. These benchmarks include tasks like:
- Natural Language Inference: Does one sentence imply another?
- Sentiment Analysis: Is the attitude in a piece of text positive or negative?
- Coreference Resolution: Identifying which words in a text refer to the same thing.
Questions that require understanding complete paragraphs. For example, by adding, counting, or sorting values spread across multiple sentences.
A long-context needle-in-a-haystack benchmark for LLMs
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
ARC confronts LLMs with a collection of complex, multi-part science questions (grade-school level). LLMs need to apply scientific knowledge, understand cause-and-effect relationships, and solve problems step-by-step to successfully tackle these challenges.
An acronym for “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations”, this benchmark focuses on commonsense reasoning. To really challenge LLMs, the benchmark includes deceptively realistic wrong answers generated by "Adversarial Filtering," making the task harder for models that over-rely on word probabilities.
Multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 74% accuracy.
Grade school mathematics problems, translated into ten languages, including underrepresented languages like Bengali and Swahili.
Middle school and high school mathematics problems.
This is based on a limericks dataset published in 2021.
GSM8K consists of 8.5K high quality grade school math problems created by human problem writers. We segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem.
Code for the paper "Evaluating Large Language Models Trained on Code"
Short for “Mostly Basic Python Programming'', MBPP is a vast dataset of 1,000 Python coding problems designed for beginner-level programmers. This benchmark tests an LLM's grasp of core programming concepts and its ability to translate instructions into functional code. MBPP problems comprise three integral components: task descriptions, correct code solutions, and test cases to verify the LLM's output.
hort for “Software Engineering Benchmark”, SWE-bench is a comprehensive benchmark designed to evaluate LLMs on their ability to tackle real-world software issues sourced from GitHub. This benchmark tests an LLM's proficiency in understanding and resolving software problems by requiring it to generate patches for issues described in the context of actual codebases. Notably, SWE-bench was used to compare the performance of Devin, the AI Software Engineer, with that of assisted foundational LLMs.