Skip to content

Commit

Permalink
Finishing touches (#2)
Browse files Browse the repository at this point in the history
Summary: Pull Request resolved: #2

Reviewed By: varunfb

Differential Revision: D51921595

Pulled By: Darktex

fbshipit-source-id: 019ab49d58485bb2d9b78a144761008dbbf0bec1
  • Loading branch information
Darktex authored and facebook-github-bot committed Dec 7, 2023
1 parent 2a8f9d2 commit f0faaf8
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 100 deletions.
39 changes: 27 additions & 12 deletions Llama-Guard/MODEL_CARD.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,31 @@
# Model Details

Llama-Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based input-output safeguard model. It can be used for classifying content in both LLM inputs (prompt classification) and in LLM responses (response classification).
Llama Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based
input-output safeguard model. It can be used for classifying content in both LLM
inputs (prompt classification) and in LLM responses (response classification).

It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example:
It acts as an LLM: it generates text in its output that indicates whether a
given prompt or response is safe/unsafe, and if unsafe based on a policy, it
also lists the violating subcategories. Here is an example:

![](Llama-Guard_example.png)
![](Llama Guard_example.png)

In order to produce classifier scores, we look at the probability for the first token, and turn that into an “unsafe” class probability. Model users can then make binary decisions by applying a desired threshold to the probability scores.
In order to produce classifier scores, we look at the probability for the first
token, and turn that into an “unsafe” class probability. Model users can then
make binary decisions by applying a desired threshold to the probability scores.

# Training and Evaluation

## Training Data

We use a mix of prompts that come from the Anthropic [dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we have collected in house, in a separate process from our production redteaming. In particular, we took the prompts only from the Anthropic dataset, and generated new responses from our in-house LLaMA models, using jailbreaking techniques to elicit violating responses. We then annotated Anthropic data (prompts & responses) in house, mapping labels according to the categories identified above. Overall we have ~13K training examples.
We use a mix of prompts that come from the Anthropic
[dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we
have collected in house, in a separate process from our production redteaming.
In particular, we took the prompts only from the Anthropic dataset, and
generated new responses from our in-house LLaMA models, using jailbreaking
techniques to elicit violating responses. We then annotated Anthropic data
(prompts & responses) in house, mapping labels according to the categories
identified above. Overall we have ~13K training examples.

## Taxonomy of harms and Risk Guidelines

Expand All @@ -27,14 +40,16 @@ the following components:

Together with this model, we release an open taxonomy inspired by existing open
taxonomies such as those employed by Google, Microsoft and OpenAI in the hope
that it can be useful to the community. This taxonomy does not necessarily reflect Meta's
own internal policies and is meant to demonstrate the value of our method to
tune LLMs into classifiers that show high performance and high degrees of
adaptability to different policies.
that it can be useful to the community. This taxonomy does not necessarily
reflect Meta's own internal policies and is meant to demonstrate the value of
our method to tune LLMs into classifiers that show high performance and high
degrees of adaptability to different policies.

### The Llama-Guard Safety Taxonomy & Risk Guidelines
### The Llama Guard Safety Taxonomy & Risk Guidelines

Below, we provide both the harm types themselves under this taxonomy and also examples of the specific kinds of content that would be considered harmful under each category:
Below, we provide both the harm types themselves under this taxonomy and also
examples of the specific kinds of content that would be considered harmful under
each category:

- **Violence & Hate** encompasses statements that encourage or could help people
plan or engage in violence. Similarly, statements that advocate
Expand Down Expand Up @@ -85,6 +100,6 @@ in our paper: [LINK TO PAPER].

| | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) |
| --------------- | --------------------- | ---------- | --------- | ----------------------- |
| Llama-Guard | **0.945** | 0.847 | **0.626** | **0.953** |
| Llama Guard | **0.945** | 0.847 | **0.626** | **0.953** |
| OpenAI API | 0.764 | **0.856** | 0.588 | 0.769 |
| Perspective API | 0.728 | 0.787 | 0.532 | 0.699 |
40 changes: 20 additions & 20 deletions Llama-Guard/README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,42 @@
# GuardLlama
# Llama Guard

GuardLlama is a new experimental model that provides input and output guardrails
for LLM deployments.
Llama Guard is a new experimental model that provides input and output
guardrails for LLM deployments.

# Download

In order to download the model weights and tokenizer, please visit the Meta
website and accept our License.
In order to download the model weights and tokenizer, please visit the
[Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
and accept our License.

Once your request is approved, you will receive a signed URL over email. Then
run the download.sh script, passing the URL provided when prompted to start the
download.

Pre-requisites: Make sure you have wget and md5sum installed. Then to run the
script: ./download.sh.
script: `./download.sh`.

Keep in mind that the links expire after 24 hours and a certain amount of
downloads. If you start seeing errors such as 403: Forbidden, you can always
downloads. If you start seeing errors such as `403: Forbidden`, you can always
re-request a link.

# Access on HuggingFace

[TODO CHANGE LINK] We are also providing downloads on Hugging Face. You must
first request a download from the Meta website using the same email address as
your Hugging Face account. After doing so, you can request access to any of the
models on Hugging Face and within 1-2 days your account will be granted access
to all versions.

# Quick Start

TODO to be written.
Since Llama Guard is a fine-tuned Llama-7B model (see our
[model card](MODEL_CARD.md) for more information), the same quick start steps
outlined in our
[README file](https://github.com/facebookresearch/llama/blob/main/README.md) for
Llama2 apply here.

In addition to that, we added examples using Llama Guard in the
[Llama 2 recipes repository](https://github.com/facebookresearch/llama-recipes).

# Issues

Please report any software bug, or other problems with the models through one
of the following means:
Please report any software bug, or other problems with the models through one of
the following means:

- Reporting issues with the GuardLlama model:
- Reporting issues with the Llama Guard model:
[github.com/facebookresearch/purplellama](github.com/facebookresearch/purplellama)
- Reporting issues with Llama in general:
[github.com/facebookresearch/llama](github.com/facebookresearch/llama)
Expand All @@ -57,4 +57,4 @@ as our accompanying [Acceptable Use Policy](USE_POLICY).

# References

Research Paper: [TODO ADD LINK]
[Research Paper](https://ai.facebook.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/)
63 changes: 6 additions & 57 deletions Llama-Guard/download.sh
Original file line number Diff line number Diff line change
@@ -1,70 +1,19 @@
#!/usr/bin/env bash

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

set -e

read -p "Enter the URL from email: " PRESIGNED_URL
echo ""
read -p "Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all: " MODEL_SIZE
TARGET_FOLDER="." # where all files should end up
mkdir -p ${TARGET_FOLDER}

if [[ $MODEL_SIZE == "" ]]; then
MODEL_SIZE="7B,13B,70B,7B-chat,13B-chat,70B-chat"
fi

echo "Downloading LICENSE and Acceptable Usage Policy"
wget --continue ${PRESIGNED_URL/'*'/"LICENSE"} -O ${TARGET_FOLDER}"/LICENSE"
wget --continue ${PRESIGNED_URL/'*'/"USE_POLICY.md"} -O ${TARGET_FOLDER}"/USE_POLICY.md"

echo "Downloading tokenizer"
wget --continue ${PRESIGNED_URL/'*'/"tokenizer.model"} -O ${TARGET_FOLDER}"/tokenizer.model"
wget --continue ${PRESIGNED_URL/'*'/"tokenizer_checklist.chk"} -O ${TARGET_FOLDER}"/tokenizer_checklist.chk"
CPU_ARCH=$(uname -m)
if [ "$CPU_ARCH" = "arm64" ]; then
(cd ${TARGET_FOLDER} && md5 tokenizer_checklist.chk)
else
(cd ${TARGET_FOLDER} && md5sum -c tokenizer_checklist.chk)
fi

for m in ${MODEL_SIZE//,/ }
do
if [[ $m == "7B" ]]; then
SHARD=0
MODEL_PATH="llama-2-7b"
elif [[ $m == "7B-chat" ]]; then
SHARD=0
MODEL_PATH="llama-2-7b-chat"
elif [[ $m == "13B" ]]; then
SHARD=1
MODEL_PATH="llama-2-13b"
elif [[ $m == "13B-chat" ]]; then
SHARD=1
MODEL_PATH="llama-2-13b-chat"
elif [[ $m == "70B" ]]; then
SHARD=7
MODEL_PATH="llama-2-70b"
elif [[ $m == "70B-chat" ]]; then
SHARD=7
MODEL_PATH="llama-2-70b-chat"
fi

echo "Downloading ${MODEL_PATH}"
mkdir -p ${TARGET_FOLDER}"/${MODEL_PATH}"

for s in $(seq -f "0%g" 0 ${SHARD})
do
wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/consolidated.${s}.pth"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/consolidated.${s}.pth"
done

wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/params.json"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/params.json"
wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/checklist.chk"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/checklist.chk"
echo "Checking checksums"
if [ "$CPU_ARCH" = "arm64" ]; then
(cd ${TARGET_FOLDER}"/${MODEL_PATH}" && md5 checklist.chk)
else
(cd ${TARGET_FOLDER}"/${MODEL_PATH}" && md5sum -c checklist.chk)
fi
done
mkdir -p ${TARGET_FOLDER}"/llama-guard"
wget --continue ${PRESIGNED_URL/'*'/"consolidated.00.pth"} -O ${TARGET_FOLDER}"/llama-guard/consolidated.00.pth"
wget --continue ${PRESIGNED_URL/'*'/"params.json"} -O ${TARGET_FOLDER}"/llama-guard/params.json"
97 changes: 86 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,104 @@
</p>

<p align="center">
🤗 <a href="https://huggingface.co/meta-Llama">Hugging Face</a>&nbsp&nbsp | <a href="">Blog</a>&nbsp&nbsp | <a href="https://ai.facebook.com/llama/purple-llama">Website</a>&nbsp&nbsp | <a href="https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/">CyberSec Eval Paper</a>&nbsp&nbsp | <a href="https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/">Llama Guard Paper</a>&nbsp&nbsp
🤗 <a href="https://huggingface.co/meta-Llama"> Models on Hugging Face</a>&nbsp | <a href="https://ai.facebook.com/blog/purple-llama-open-trust-safety-generative-ai"> Blog</a>&nbsp | <a href="https://ai.facebook.com/llama/purple-llama">Website</a>&nbsp | <a href="https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/">CyberSec Eval Paper</a>&nbsp&nbsp | <a href="https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/">Llama Guard Paper</a>&nbsp
<br>

--------------------------------------------------------------------------------
---

# Purple Llama
Purple Llama is a an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.

Purple Llama is a an umbrella project that over time will bring together tools
and evals to help the community build responsibly with open generative AI
models. The initial release will include tools and evals for Cyber Security and
Input/Output safeguards but we plan to contribute more in the near future.

## Why purple?
Borrowing a [concept](https://www.youtube.com/watch?v=ab_Fdp6FVDI) from the cybersecurity world, we believe that to truly mitigate the challenges which generative AI presents, we need to take both attack (red team) and defensive (blue team) postures. Purple teaming, composed of both red and blue team responsibilities, is a collaborative approach to evaluating and mitigating potential risks and the same ethos applies to generative AI and hence our investment in Purple Llama will be comprehensive.

Borrowing a [concept](https://www.youtube.com/watch?v=ab_Fdp6FVDI) from the
cybersecurity world, we believe that to truly mitigate the challenges which
generative AI presents, we need to take both attack (red team) and defensive
(blue team) postures. Purple teaming, composed of both red and blue team
responsibilities, is a collaborative approach to evaluating and mitigating
potential risks and the same ethos applies to generative AI and hence our
investment in Purple Llama will be comprehensive.

## License
Components within the Purple Llama project will be licensed permissively enabling both research and commercial usage. We believe this is a major step towards enabling community collaboration and standardizing the development and usage of trust and safety tools for generative AI development. More concretely evals and benchmarks are licensed under the MIT license while any models use the Llama 2 Community license. See the table below:

| **Component Type** | **Components** | **License** |
|:----------|:------------:|:----------:|
| Evals/Benchmarks | Cyber Security Eval (others to come) | MIT |
| Models | Llama Guard | [Llama 2 Community License](https://github.com/facebookresearch/PurpleLlama/blob/main/LICENSE) |
Components within the Purple Llama project will be licensed permissively
enabling both research and commercial usage. We believe this is a major step
towards enabling community collaboration and standardizing the development and
usage of trust and safety tools for generative AI development. More concretely
evals and benchmarks are licensed under the MIT license while any models use the
Llama 2 Community license. See the table below:

| **Component Type** | **Components** | **License** |
| :----------------- | :----------------------------------: | :--------------------------------------------------------------------------------------------: |
| Evals/Benchmarks | Cyber Security Eval (others to come) | MIT |
| Models | Llama Guard | [Llama 2 Community License](https://github.com/facebookresearch/PurpleLlama/blob/main/LICENSE) |

## Evals & Benchmarks

### Cybersecurity

We are sharing what we believe is the first industry-wide set of cybersecurity
safety evaluations for LLMs. These benchmarks are based on industry guidance and
standards (e.g., CWE and MITRE ATT&CK) and built in collaboration with our
security subject matter experts. With this initial release, we aim to provide
tools that will help address some risks outlined in the
[White House commitments on developing responsible AI](https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/),
including:

Metrics for quantifying LLM cybersecurity risks. Tools to evaluate the frequency
of insecure code suggestions. Tools to evaluate LLMs to make it harder to
generate malicious code or aid in carrying out cyberattacks. We believe these
tools will reduce the frequency of LLMs suggesting insecure AI-generated code
and reduce their helpfulness to cyber adversaries. Our initial results show that
there are meaningful cybersecurity risks for LLMs, both with recommending
insecure code and for complying with malicious requests. See our
[Cybersec Eval paper](https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/)
for more details.

## Input/Output Safeguards

As we outlined in Llama 2’s
[Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/), we
recommend that all inputs and outputs to the LLM be checked and filtered in
accordance with content guidelines appropriate to the application.

### Llama Guard

To support this, and empower the community, we are releasing
[Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/),
an openly-available model that performs competitively on common open benchmarks
and provides developers with a pretrained model to help defend against
generating potentially risky outputs.

As part of our ongoing commitment to open and transparent science, we are
releasing our methodology and an extended discussion of model performance in our
[Llama Guard paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/).
This model has been trained on a mix of publicly-available datasets to enable
detection of common types of potentially risky or violating content that may be
relevant to a number of developer use cases. Ultimately, our vision is to enable
developers to customize this model to support relevant use cases and to make it
easier to adopt best practices and improve the open ecosystem.

## Getting Started
To get started and learn how to use Purple Llama components with Llama models, see the getting started guide [here](https://ai.meta.com/llama/get-started/). The guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides. Additionally, you will find supplemental materials to further assist you while responsibly building with Llama. The guide will be updated as more Purple Llama components get released.

To get started and learn how to use Purple Llama components with Llama models,
see the getting started guide [here](https://ai.meta.com/llama/get-started/).
The guide provides information and resources to help you set up Llama including
how to access the model, hosting, how-to and integration guides. Additionally,
you will find supplemental materials to further assist you while responsibly
building with Llama. The guide will be updated as more Purple Llama components
get released.

## FAQ
For a running list of frequently asked questions, for not only Purple Llama components but also generally for Llama models, see the FAQ [here](https://ai.meta.com/llama/faq/).

For a running list of frequently asked questions, for not only Purple Llama
components but also generally for Llama models, see the FAQ
[here](https://ai.meta.com/llama/faq/).

## Join the Purple Llama community

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

0 comments on commit f0faaf8

Please sign in to comment.