Skip to content

Latest commit

 

History

History
122 lines (72 loc) · 6.6 KB

benchmarking.md

File metadata and controls

122 lines (72 loc) · 6.6 KB

Benchmarking Glossary

LLM Benchmarks for RAG

Needle in a Haystack(NIAH)

NIAH was invented to evaluate capabilities in efficiently locating specific information within extensive texts, simulating the real world use cases such as RAG (retrieval-augmented generation).

Read More

MultiHop-RAG

A QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.

Read More

LLM Benchmarks for Chatbot Assistance

ChatBot Arena

A crowdsourced platform where LLMs have randomised conversations rated by human users based on factors like fluency, helpfulness, and consistency. Users have real conversations with two anonymous chatbots, voting on which response is superior. This approach aligns with how LLMs are used in the real world, giving us insights into which models excel in conversation.

Read More

MT-Bench

A dataset of challenging questions designed for multi-turn conversations. LLMs are graded (often by other, even more powerful LLMs) on the quality and relevance of their answers. The focus here is less about casual chat and more about a chatbot's ability to provide informative responses in potentially complex scenarios.

Read More

LLM Benchmarks for Question Answering and Language Understanding

MMLU

Multi-task Language Understanding, is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans.

Read More

MTEB

MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.

Read More

GLUE

GLUE (General Language Understanding Evaluation) was an early but groundbreaking benchmark suite.

Read More

SuperGLUE

SuperGLUE emerged as a response to LLMs quickly outperforming the original GLUE tasks. These benchmarks include tasks like:

  • Natural Language Inference: Does one sentence imply another?
  • Sentiment Analysis: Is the attitude in a piece of text positive or negative?
  • Coreference Resolution: Identifying which words in a text refer to the same thing.

Read More

Discrete Reasoning Over Paragraphs (DROP)

Questions that require understanding complete paragraphs. For example, by adding, counting, or sorting values spread across multiple sentences.

Read More

BABILong

A long-context needle-in-a-haystack benchmark for LLMs

Read More

BIG-bench

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Read More

LLM Benchmarks for Reasoning

ARC (AI2 Reasoning Challenge)

ARC confronts LLMs with a collection of complex, multi-part science questions (grade-school level). LLMs need to apply scientific knowledge, understand cause-and-effect relationships, and solve problems step-by-step to successfully tackle these challenges.

Read More

HellaSwag

An acronym for “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations”, this benchmark focuses on commonsense reasoning. To really challenge LLMs, the benchmark includes deceptively realistic wrong answers generated by "Adversarial Filtering," making the task harder for models that over-rely on word probabilities.

Read More

Graduate-Level Google-Proof Q&A (GPQA)

Multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 74% accuracy.

Read More

Multilingual Grade School Math (MSGM)

Grade school mathematics problems, translated into ten languages, including underrepresented languages like Bengali and Swahili.

Read More

MATH

Middle school and high school mathematics problems.

Read More

Needle in a NeedleStack(NIAN)

This is based on a limericks dataset published in 2021.

Read More

GSM8K

GSM8K consists of 8.5K high quality grade school math problems created by human problem writers. We segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem.

Read More

LLM Benchmarks for Coding

Human Eval

Code for the paper "Evaluating Large Language Models Trained on Code"

Read More

MBPP

Short for “Mostly Basic Python Programming'', MBPP is a vast dataset of 1,000 Python coding problems designed for beginner-level programmers. This benchmark tests an LLM's grasp of core programming concepts and its ability to translate instructions into functional code. MBPP problems comprise three integral components: task descriptions, correct code solutions, and test cases to verify the LLM's output.

Read More

SWE-Bench

hort for “Software Engineering Benchmark”, SWE-bench is a comprehensive benchmark designed to evaluate LLMs on their ability to tackle real-world software issues sourced from GitHub. This benchmark tests an LLM's proficiency in understanding and resolving software problems by requiring it to generate patches for issues described in the context of actual codebases. Notably, SWE-bench was used to compare the performance of Devin, the AI Software Engineer, with that of assisted foundational LLMs.

Read More