Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 3.54 KB

9-Adversarial-Training.md

File metadata and controls

27 lines (18 loc) · 3.54 KB

Chapter 9: Adversarial Training

Adversarial training is designed to make models robust to adversarially-selected inputs.

Recommended reading

  • Adversarial examples - A famous result showing that image classifiers are vulnerable to adversarial attacks, in which an image can be imperceptibly perturbed to cause the classifier to select an incorrect target class with high confidence. Training on adversarial examples makes more models robust to them, and is an example of adversarial training.

Optional reading

Suggested exercise

Red team a GPT-2 chatbot to find inputs where it generates offensive language, reproducing the experimental setup in the red teaming paper. We recommend using all models via HuggingFace Transformers in an environment with GPUs available (Google Colab provides GPUs for free).

  • Choose a language model (LM) for red teaming. We recommend GPT-2 (or larger) as the LM.
  • Choose a chatbot-like model to red team. We recommend using the prompt from the last page of the red teaming paper to prompt GPT-2 (or larger) to generate chatbot-like text.
  • Use an offensive or toxic language detection model of your choice. We recommend Unitary’s BERT-based model or a similar toxicity classifier available on HuggingFace.
  • Use the zero-shot approach described in the red teaming paper (section 3.1) to generate inputs that elicit offensive language from the chatbot language model. Look for patterns in the failed test cases, to better understand what kinds of inputs the chatbot fails on.
  • (Optional) Use the few-shot, supervised learning, and reinforcement learning approaches (in that order) to generate even harder test cases for the chatbot. How do the test cases generated by each method differ from each other?
  • (Optional) Reproduce various analyses in the red teaming paper, e.g., clustering the test cases, to help find common patterns in the chatbot failures.

Credit to Ethan Perez for this suggested exercise.