diff --git a/docs/quickstart/index.md b/docs/quickstart/index.md index 0de8afb..da295be 100644 --- a/docs/quickstart/index.md +++ b/docs/quickstart/index.md @@ -1,3 +1,6 @@ +--- +hide: '["toc"]' +--- # Getting Started WalledEval can serve **four** major functions, namely the following: @@ -12,7 +15,7 @@ WalledEval can serve **four** major functions, namely the following: [:octicons-arrow-right-24: Prompt Benchmarking](prompts.md) -- :material-library-outline:{ .lg .middle } __LLM Knowledge__ +- :material-book-check-outline:{ .lg .middle } __LLM Knowledge__ --- diff --git a/docs/quickstart/judges.md b/docs/quickstart/judges.md index fd94cd5..8d255ab 100644 --- a/docs/quickstart/judges.md +++ b/docs/quickstart/judges.md @@ -1,3 +1,6 @@ +--- +hide: '["toc"]' +--- # Judge Benchmarking Beyond just LLMs, some datasets are designed to benchmark judges and identify if they are able to accurately classify questions as **safe** or **unsafe**. The general requirements for testing an LLM on Judge Benchmarks is as follows: @@ -8,7 +11,7 @@ Beyond just LLMs, some datasets are designed to benchmark judges and identify if Here's how you can do this easily in WalledEval! -```python title="judge_quickstart.py" linenums="1" hl_lines="25 28 38 39 45" +```python title="judge_quickstart.py" linenums="1" hl_lines="25 28 38 39 46" from walledeval.data import HuggingFaceDataset from walledeval.types import SafetyPrompt from walledeval.judge import WalledGuardJudge diff --git a/docs/quickstart/mcq.md b/docs/quickstart/mcq.md index 3b34c87..e20eb50 100644 --- a/docs/quickstart/mcq.md +++ b/docs/quickstart/mcq.md @@ -1,3 +1,6 @@ +--- +hide: '["toc"]' +--- # MCQ Benchmarking Some safety datasets (e..g [WMDP](https://www.wmdp.ai/) and [BBQ](https://aclanthology.org/2022.findings-acl.165/)) are designed to test LLMs on any harmful knowledge or inherent biases that they may possess. These datasets are largely formatted in multiple-choice question (**MCQ**) format, hence why we choose to call them MCQ Benchmarks. The general requirements for testing an LLM on MCQ Benchmarks is as follows: diff --git a/docs/quickstart/prompts.md b/docs/quickstart/prompts.md index b1ed7f4..1246538 100644 --- a/docs/quickstart/prompts.md +++ b/docs/quickstart/prompts.md @@ -1,3 +1,6 @@ +--- +hide: '["toc"]' +--- # Prompt Benchmarking Most safety datasets aim to test LLMs on their creativity / restraint in generating responses to custom unsafe/safe queries. The general requirements for testing an LLM on Prompt Benchmarks is as follows: