Benchmarking Glossary

LLM Benchmarks for RAG

Needle in a Haystack(NIAH)

NIAH was invented to evaluate capabilities in efficiently locating specific information within extensive texts, simulating the real world use cases such as RAG (retrieval-augmented generation).

MultiHop-RAG

A QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.

LLM Benchmarks for Chatbot Assistance

ChatBot Arena

A crowdsourced platform where LLMs have randomised conversations rated by human users based on factors like fluency, helpfulness, and consistency. Users have real conversations with two anonymous chatbots, voting on which response is superior. This approach aligns with how LLMs are used in the real world, giving us insights into which models excel in conversation.

MT-Bench

A dataset of challenging questions designed for multi-turn conversations. LLMs are graded (often by other, even more powerful LLMs) on the quality and relevance of their answers. The focus here is less about casual chat and more about a chatbot's ability to provide informative responses in potentially complex scenarios.

LLM Benchmarks for Question Answering and Language Understanding

MMLU

Multi-task Language Understanding, is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans.

MTEB

MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.

GLUE

GLUE (General Language Understanding Evaluation) was an early but groundbreaking benchmark suite.

SuperGLUE

SuperGLUE emerged as a response to LLMs quickly outperforming the original GLUE tasks. These benchmarks include tasks like:

Natural Language Inference: Does one sentence imply another?
Sentiment Analysis: Is the attitude in a piece of text positive or negative?
Coreference Resolution: Identifying which words in a text refer to the same thing.

Discrete Reasoning Over Paragraphs (DROP)

Questions that require understanding complete paragraphs. For example, by adding, counting, or sorting values spread across multiple sentences.

BABILong

A long-context needle-in-a-haystack benchmark for LLMs

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

LLM Benchmarks for Reasoning

ARC (AI2 Reasoning Challenge)

ARC confronts LLMs with a collection of complex, multi-part science questions (grade-school level). LLMs need to apply scientific knowledge, understand cause-and-effect relationships, and solve problems step-by-step to successfully tackle these challenges.

HellaSwag

An acronym for “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations”, this benchmark focuses on commonsense reasoning. To really challenge LLMs, the benchmark includes deceptively realistic wrong answers generated by "Adversarial Filtering," making the task harder for models that over-rely on word probabilities.

Graduate-Level Google-Proof Q&A (GPQA)

Multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 74% accuracy.

Multilingual Grade School Math (MSGM)

Grade school mathematics problems, translated into ten languages, including underrepresented languages like Bengali and Swahili.

MATH

Middle school and high school mathematics problems.

Needle in a NeedleStack(NIAN)

This is based on a limericks dataset published in 2021.

GSM8K

GSM8K consists of 8.5K high quality grade school math problems created by human problem writers. We segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem.

LLM Benchmarks for Coding

Human Eval

Code for the paper "Evaluating Large Language Models Trained on Code"

MBPP

Short for “Mostly Basic Python Programming'', MBPP is a vast dataset of 1,000 Python coding problems designed for beginner-level programmers. This benchmark tests an LLM's grasp of core programming concepts and its ability to translate instructions into functional code. MBPP problems comprise three integral components: task descriptions, correct code solutions, and test cases to verify the LLM's output.

SWE-Bench

hort for “Software Engineering Benchmark”, SWE-bench is a comprehensive benchmark designed to evaluate LLMs on their ability to tackle real-world software issues sourced from GitHub. This benchmark tests an LLM's proficiency in understanding and resolving software problems by requiring it to generate patches for issues described in the context of actual codebases. Notably, SWE-bench was used to compare the performance of Devin, the AI Software Engineer, with that of assisted foundational LLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking.md

benchmarking.md

Benchmarking Glossary

LLM Benchmarks for RAG

Needle in a Haystack(NIAH)

MultiHop-RAG

LLM Benchmarks for Chatbot Assistance

ChatBot Arena

MT-Bench

LLM Benchmarks for Question Answering and Language Understanding

MMLU

MTEB

GLUE

SuperGLUE

Discrete Reasoning Over Paragraphs (DROP)

BABILong

BIG-bench

LLM Benchmarks for Reasoning

ARC (AI2 Reasoning Challenge)

HellaSwag

Graduate-Level Google-Proof Q&A (GPQA)

Multilingual Grade School Math (MSGM)

MATH

Needle in a NeedleStack(NIAN)

GSM8K

LLM Benchmarks for Coding

Human Eval

MBPP

SWE-Bench

Files

benchmarking.md

Latest commit

History

benchmarking.md

File metadata and controls

Benchmarking Glossary

LLM Benchmarks for RAG

Needle in a Haystack(NIAH)

MultiHop-RAG

LLM Benchmarks for Chatbot Assistance

ChatBot Arena

MT-Bench

LLM Benchmarks for Question Answering and Language Understanding

MMLU

MTEB

GLUE

SuperGLUE

Discrete Reasoning Over Paragraphs (DROP)

BABILong

BIG-bench

LLM Benchmarks for Reasoning

ARC (AI2 Reasoning Challenge)

HellaSwag

Graduate-Level Google-Proof Q&A (GPQA)

Multilingual Grade School Math (MSGM)

MATH

Needle in a NeedleStack(NIAN)

GSM8K

LLM Benchmarks for Coding

Human Eval

MBPP

SWE-Bench