TL; DR:
- This repository contains resources to assist in the creation of technical standards or procedures for organizational AI governance policies, with a strong focus on generative AI (GAI).
- This information must be combined with a higher-level governance approach as described in, e.g., The Interagency Guidance on Model Risk Management or the NIST AI RMF Govern Function.
- The information below is aligned with DRAFT NIST 600-1 AI RMF Generative AI Profile.
- For non-commercial use. For commercial support please reach out to
[email protected]
.
What's missing?
- Higher-level policies and procedure language to tie these resources together into cohesive governance documentments.
- Methodology for estimating business risk (e.g., monetary losses) from model testing, red-teaming, feedback and experimental results.
- ...
Introduction
(c) Patrick Hall and Daniel Atherton 2024, CC BY 4.0
This information is designed to help organizations build the governance policies required to measure and manage risks associated with deploying and using GAI systems. Governance is key to addressing the growing need for trustworthy and responsible AI systems, and this repository is aligned to the NIST AI Risk Management Framework trustworthy characteristics and the DRAFT NIST 600-1 AI RMF Generative AI Profile. Governance is also a necessary component of AI strategy, crucial for addressing real legal, regulatory, ethical, and operational headwinds.
At its core, this repository provides technical materials for building or augmenting detailed model or AI governance procedures or standards, and aligns them to guidance from NIST. Starting in Section A, two central risk management mechanisms are explored. The first perspective comprises the NIST AI RMF trustworthy characteristics mapped to GAI risks. Operating from this perspective allows organizations to understand how each trustworthy characteristic can mitigate specific risks posed by GAI. The second perspective is the reverse—GAI risks mapped to trustworthy characteristics. That mapping can help organizations understand which characteristics should be prioritized to manage specific GAI risks. As consumer finance organizations are likely to adopt both NIST (or other more technical frameworks) and traditional enterprise risk management methodologies, ideas on linking trustworthy characteristics, GAI risks, and established banking risk buckets are also presented in Section A.
The repository also guides users through authoritative resources for risk-tiering. Sections B.1 through B.7 walk the user of the framework through the process of defining adverse impacts: Harm to Operations, Harm to Assets, Harm to Individuals, Harm to Other Organizations, and Harm to the Nation, along with guidance on impact quantification and description. Section B also offers tables with guidance on assessing the likelihood of certain risks. Organizations and companies can leverage this combination of adverse impacts and frequency/likelihood tables to develop tailored risk tiers that reflect the specific contexts in which their GAI systems may be operating. They can also utilize practical risk-tiering to guide their decision-making and evaluate how best to calibrate existing safeguards or whether to implement additional ones.
Measurement and testing is a critical aspect of ensuring GAI systems perform as expected. For measuring the severity of certain GAI risks, Section C presents various model testing benchmarks (such as evals). Model testing suites provide the user with tools to roughly assess GAI performance against trustworthy characteristics as well to quickly test for resilience in the face of known GAI risks. As GAI systems are vulnerable to adversarial attacks via prompting and hacks, Section D presents red-teaming and adversarial prompting approaches for human elicitation of evidence of GAI risks in adversarial scenarios. Section H hints at more in-depth structured experiments and human feedback for risk assessment. Suggested usage for these types of measurement is as follows:
- Low-risk GAI systems: model testing only
- Medium-risk GAI systems: model testing and red-teaming
- High-risk GAI systems: model testing, red-teaming, and structured experiments and human feedback
Where measurement for lower-risk systems can be highly-automated, human risk management resources are reserved for medium and high-risk systems.
For managing and mitigating GAI risks, Section E outlines several risk controls for GAI. Controls range from technical settings for GAI systems to commonsense recommendations, e.g., limiting or restricting access for minors. Sections F, G](#g-example-medium-risk-generative-ai-measurement-and-management-plan), and H pair risk measurement techniques with controls to form more fulsome risk management plans. Recommended usage for the plans in Sections F-H is:
- Low-risk GAI systems: apply Section F only
- Medium-risk GAI systems: apply Section F and G
- High-risk GAI systems: apply Sections F, G, and H
Regardless of the risk level of the system, the framework offers detailed measurement plans that guide the user through the process of assessing the system’s performance, along with tracking risks, and harmonizing the system with trustworthy AI principles.
-
Section A: Example Generative AI-Trustworthy Characteristic Crosswalk
-
Section C: List of Selected Model Testing Suites
-
Section D: Selected Adversarial Prompting Strategies and Attacks
-
Section E: Selected Risk Controls for Generative AI
-
Section F: Example Low-risk Generative AI Measurement and Management Plan
-
Section G: Example Medium-risk Generative AI Measurement and Management Plan
-
Section H: Example High-risk Generative AI Measurement and Management Plan
Accountable and Transparent | Explainable and Interpretable | Fair with Harmful Bias Managed | Privacy Enhanced |
---|---|---|---|
Data Privacy | Human-AI Configuration | Confabulation | Data Privacy |
Environmental | Value Chain and Component Integration | Environmental | Human-AI Configuration |
Human-AI Configuration | Human-AI Configuration | Information Security | |
Information Integrity | Intellectual Property | Intellectual Property | |
Intellectual Property | Obscene, Degrading, and/or Abusive Content | Value Chain and Component Integration | |
Value Chain and Component Integration | Toxicity, Bias, and Homogenization | ||
Value Chain and Component Integration |
Safe | Secure and Resilient | Valid and Reliable |
---|---|---|
CBRN Information | Dangerous or Violent Recommendations | Confabulation |
Confabulation | Data Privacy | Human-AI Configuration |
Dangerous or Violent Recommendations | Human-AI Configuration | Information Integrity |
Data Privacy | Information Security | Information Security |
Environmental | Value Chain and Component Integration | Toxicity, Bias, and Homogenization |
Human-AI Configuration | Value Chain and Component Integration | |
Information Integrity | ||
Information Security | ||
Obscene, Degrading, and/or Abusive Content | ||
Value Chain and Component Integration |
Usage Note: Table A.1 provides an example of mapping GAI risks onto AI RMF trustworthy characteristics. Mapping GAI risks to AI RMF trustworthy characteristics can be particularly useful when existing policies, processes, or controls can be applied to manage GAI risks, but have been previously implemented in alignment with the AI RMF trustworthy characteristics. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.
CBRN Information | Confabulation | Dangerous or Violent Recommendations | Data Privacy |
---|---|---|---|
Safe | Fair with Harmful Bias Managed | Safe | Accountable and Transparent |
Safe | Secure and Resilient | Privacy Enhanced | |
Valid and Reliable | Safe | ||
Secure and Resilient |
Environmental | Human-AI Configuration | Information Integrity | Information Security |
---|---|---|---|
Accountable and Transparent | Accountable and Transparent | Accountable and Transparent | Privacy Enhanced |
Fair with Harmful Bias Managed | Explainable and Interpretable | Safe | Safe |
Safe | Fair with Harmful Bias Managed | Valid and Reliable | Secure and Resilient |
Privacy Enhanced | Valid and Reliable | ||
Safe | |||
Secure and Resilient | |||
Valid and Reliable |
Intellectual Property | Obscene, Degrading, and/or Abusive Content | Toxicity, Bias, and Homogenization | Value Chain and Component Integration |
---|---|---|---|
Accountable and Transparent | Fair with Harmful Bias Managed | Fair with Harmful Bias Managed | Accountable and Transparent |
Fair with Harmful Bias Managed | Safe | Valid and Reliable | Explainable and Interpretable |
Privacy Enhanced | Fair with Harmful Bias Managed | ||
Privacy Enhanced | |||
Safe | |||
Secure and Resilient | |||
Valid and Reliable |
Usage Note: Table A.2 provides an example of mapping AI RMF trustworthy characteristics onto GAI risks. Mapping AI RMF trustworthy characteristics to GAI risks can assist organizations in aligning GAI guidance to existing AI/ML policies, processes, or controls or to extend GAI guidance to address additional AI/ML technologies. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.
Compliance Risk | Information Security Risk | Legal Risk | Model Risk |
---|---|---|---|
Data Privacy | Data Privacy | Intellectual Property | Confabulation |
Information Security | Information Security | Obscene, Degrading, and/or Abusive Content | Dangerous or Violent Recommendations |
Toxicity, Bias, and Homogenization | Value Chain and Component Integration | Value Chain and Component Integration | Information Integrity |
Value Chain and Component Integration | Obscene, Degrading, and/or Abusive Content | ||
Toxicity, Bias, and Homogenization | |||
Accountable and Transparent | Privacy Enhanced | Accountable and Transparent | Valid and Reliable |
Fair with Harmful Bias Managed | Secure and Resilient | Safe | |
Privacy Enhanced | |||
Secure and Resilient |
Operational Risk | Reputational Risk | Strategic Risk | Third Party Risk |
---|---|---|---|
Confabulation | Confabulation | Environmental | Information Integrity |
Human-AI Configuration | Dangerous or Violent Recommendations | Information Integrity | Value Chain and Component Integration |
Information Security | Environmental | Information Security | |
Value Chain and Component Integration | Human-AI Configuration | Value Chain and Component Integration | |
Information Integrity | |||
Obscene, Degrading, and/or Abusive Content | |||
Toxicity, Bias, and Homogenization | |||
Safe | Accountable and Transparent | Accountable and Transparent | Accountable and Transparent |
Secure and Resilient | Fair with Harmful Bias Managed | Secure and Resilient | Explainable and Interpretable |
Valid and Reliable | Valid and Reliable | Valid and Reliable |
Usage Note: Table A.3 provides an example of mapping GAI risks and AI RMF trustworthy characteristics. This type of mapping can enable incorporation of new AI guidance into existing policies, processes, or controls or the application of existing policies, processes, or controls to newer AI risks.
Level | Description |
---|---|
Harm to Operations |
|
Harm to Assets |
|
Harm to Individuals |
|
Harm to Other Organizations |
|
Harm to the Nation |
|
Qualitative Values | Description | ||
---|---|---|---|
Very High | 96-100 | 10 | An incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
High | 80-95 | 8 | An incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. A severe or catastrophic adverse effect means that, for example, the incident might: (i) cause a severe degradation in or loss of mission capability to an extent and duration that the organization is not able to perform one or more of its primary functions; (ii) result in major damage to organizational assets; (iii) result in major financial loss; or (iv) result in severe or catastrophic harm to individuals involving loss of life or serious life-threatening injuries. |
Moderate | 21-79 | 5 | An incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A serious adverse effect means that, for example, the incident might: (i) cause a significant degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is significantly reduced; (ii) result in significant damage to organizational assets; (iii) result in significant financial loss; or (iv) result in significant harm to individuals that does not involve loss of life or serious life-threatening injuries. |
Low | 5-20 | 2 | An incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A limited adverse effect means that, for example, the incident might: (i) cause a degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is noticeably reduced; (ii) result in minor damage to organizational assets; (iii) result in minor financial loss; or (iv) result in minor harm to individuals. |
Very Low | 0-4 | 0 | An incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. |
Qualitative Values | Description | ||
---|---|---|---|
Very High | 96-100 | 10 | An incident is almost certain to occur; or the likelihood of the incident is near 100% across one week; or the incident occurs more than 100 times a year. |
High | 80-95 | 8 | An incident is highly likely to occur; or the likelihood of the incident is over 80% across one month; or occurs between 10-100 times a year. |
Moderate | 21-79 | 5 | An incident is somewhat likely to occur; or the likelihood of the incident is greater than 80% across one calendar year; or occurs between 1-10 times a year. |
Low | 5-20 | 2 | An incident is unlikely to occur; or the likelihood of an incident is less than 80% across one calendar year; or occurs less than once a year, but more than once every 10 years. |
Very Low | 0-4 | 0 | An incident is highly unlikely to occur; or the likelihood of an incident is less than 10% across one calendar year; or occurs less than once every 10 years. |
Likelihood | Level of Impact | ||||
---|---|---|---|---|---|
Very Low | Low | Moderate | High | Very High | |
Very High | Very Low (Tier 5) | Low (Tier 4) | Moderate (Tier 3) | High (Tier 2) | Very High (Tier 1) |
High | Very Low (Tier 5) | Low (Tier 4) | Moderate (Tier 3) | High (Tier 2) | Very High (Tier 1) |
Moderate | Very Low (Tier 5) | Low (Tier 4) | Moderate (Tier 3) | Moderate (Tier 3) | High (Tier 2) |
Low | Very Low (Tier 5) | Low (Tier 4) | Low (Tier 4) | Low (Tier 4) | Moderate (Tier 3) |
Very Low | Very Low (Tier 5) | Very Low (Tier 5) | Very Low (Tier 5) | Low (Tier 4) | Low (Tier 4) |
Qualitative Values | Description | ||
---|---|---|---|
Very High | 96-100 | 10 | Very high risk means that an incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
High | 80-95 | 8 | High risk means that an incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
Moderate | 21-79 | 5 | Moderate risk means that an incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
Low | 5-20 | 2 | Low risk means that an incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
Very Low | 0-4 | 0 | Very low risk means that an incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. |
B.6.1: Confabulation: How likely are system outputs to contain errors? What are the impacts if errors occur?
B.6.2: Dangerous and Violent Recommendations: How likely is the system to give dangerous or violent recommendations? What are the impacts if it does?
B.6.3: Data Privacy: How likely is someone to enter sensitive data into the system? What are the impacts if this occurs? Are standard data privacy controls applied to the system to mitigate potential adverse impacts?
B.6.4: Human-AI Configuration: How likely is someone to use the system incorrectly or abuse it? How likely is use for decision-making? What are the impacts of incorrect use or abuse? What are the impacts of invalid or unreliable decision-making?
B.6.5: Information Integrity: How likely is the system to generate deepfakes or mis or disinformation? At what scale? Are content provenance mechanisms applied to system outputs? What are the impacts of generating deepfakes or mis or disinformation? Without controls for content provenance?
B.6.6: Information Security: How likely are system resources to be breached or exfiltrated? How likely is the system to be used in the generation of phishing or malware content? What are the impacts in these cases? Are standard information security controls applied to the system to mitigate potential adverse impacts?
B.6.7: Intellectual Property: How likely are system outputs to contain other entities' intellectual property? What are the impacts if this occurs?
B.6.8: Toxicity, Bias, and Homogenization: How likely are system outputs to be biased, toxic, homogenizing or otherwise obscene? How likely are system outputs to be used as subsequent training inputs? What are the impacts of these scenarios? Are standard nondiscrimination controls applied to mitigate potential adverse impacts? Is the application accessible to all user groups? What are the impacts if the system is not accessible to all user groups?
B.6.9: Value Chain and Component Integration: Are contracts relating to the system reviewed for legal risks? Are standard acquisition/procurement controls applied to mitigate potential adverse impacts? Do vendors provide incident response with guaranteed response times? What are the impacts if these conditions are not met?
GOVERN 1.3, GOVERN 1.5, GOVERN 2.3, GOVERN 3.2, GOVERN 4.1, GOVERN 5.2, GOVERN 6.1, MANAGE 1.2, MANAGE 1.3, MANAGE 2.1, MANAGE 2.2, MANAGE 2.3, MANAGE 2.4, MANAGE 3.1, MANAGE 3.2, MANAGE 4.1, MAP 1.1, MAP 1.5, MEASURE 2.6
Usage Note: Materials in Section B can be used to create or update risk tiers or other risk assessment tools for GAI systems or applications as follows:
-
Table B.1 can enable mapping of GAI risks and impacts.
-
Table B.2 can enable quantification of impacts for risk tiering or risk assessment.
-
Table B.3 can enable quantification of likelihood for risk tiering or risk assessment.
-
Table B.4 presents an example of combining assessed impact and likelihood into risk tiers.
-
Table B.5 presents example risk tiers with associated qualitative, semi-quantitative, and quantitative values for risk tiering or risk assessment.
-
Subsection B.6 presents example questions for qualitative risk assessment.
-
Subsection B.7 highlights subcategories to indicate alignment with the AI RMF.
Adapted from [AI Verify Foundation] Taxonimization and various additional resources.
Accountable and Transparent
An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B) [De Wynter et al.]
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS [Gao et al.]
HELM: Copyright [Bommasani et al.]
Mark My Words [Piet et al.]
Fair with Harmful Bias Managed
BELEBELE [Bandarkar et al.]
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models [Feng et al.]
HELM: Bias
HELM: Toxicity
MT-bench [Zheng et al.]
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]
Privacy Enhanced
HELM: Copyright
llmprivacy [Staab et al.]
mimir [Duan et al.]
Safe
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
Mark My Words [Piet et al.]
MLCommons [Vidgen et al.]
The WMDP Benchmark [Li et al.]
Secure and Resilient
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
detect-pretrain-code [Shi et al.]
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination,
xss [Derczynski et al.]
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs [Chao et al.]
llmprivacy [Staab et al.]
mimir
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]
Valid and Reliable
Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof,
Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step,
Understanding the World
Big-bench: Analytic entailment, Formal fallacies and syllogisms with
negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension,
Question generation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness,
Robustness Against Adversarial Demonstrations
Eval Gauntlet: Reading comprehension [Dohmann]
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving,
Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Text classification
HELM: Question answering
HELM: Reasoning
HELM: Robustness to contrast sets
HELM: Summarization
Hugging Face: Fill-mask, Text generation [Hugging Face]
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot
classification
MASSIVE [FitzGerald et al.]
MT-bench [Zheng et al.]
CBRN Information
Big-bench: Convince Me
Big-bench: Truthfulness [Srivastava et al.]
HELM: Reiteration, Wedging
MLCommons [Vidgen et al.]
The WMDP Benchmark
Confabulation
BELEBELE
Big-bench: Analytic entailment, Formal fallacies and syllogisms with
negation, Entailed polarity
Big-bench: Context Free Question Answering
Big-bench: Contextual question answering, Reading comprehension,
Question generation
Big-bench: Convince Me
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Morphology, Grammar, Syntax
Big-bench: Out-of-Distribution
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
Eval Gauntlet Reading comprehension
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving,
Programming
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: BLiMP
Evaluation Harness: CoQA, ARC
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Knowledge
HELM: Language
HELM: Language (Twitter AAE)
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
Hugging Face: Text classification, Token classification, Zero-shot
classification
MASSIVE
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]
Dangerous or Violent Recommendations
Big-bench: Convince Me
Big-bench: Toxicity
DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Toxicity
Evaluation Harness: ToxiGen
HELM: Reiteration, Wedging
HELM: Toxicity
MLCommons [Vidgen et al.]
Data Privacy
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) [de Wynter et al.]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Machine Ethics [Wang et al.]
Evaluation Harness: ETHICS
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
MLCommons [Vidgen et al.]
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir
Environmental
HELM: Efficiency
Information Integrity
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity
Big-bench: Convince Me
Big-bench: Paraphrase
Big-bench: Sufficient information
Big-bench: Summarization
Big-bench: Truthfulness [Srivastava et al.]
DecodingTrust: Machine Ethics [Wang et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Language Understanding
Eval Gauntlet: World Knowledge
Evaluation Harness: CoQA, ARC
Evaluation Harness: ETHICS
Evaluation Harness: GLUE
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA
Evaluation Harness: MuTual
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]
FLASK: Readability, Conciseness, Insightfulness
HELM: Knowledge
HELM: Language
HELM: Question answering
HELM: Reasoning
HELM: Reiteration, Wedging
HELM: Robustness to contrast sets
HELM: Summarization
HELM: Text classification
Hugging Face: Fill-mask, Text generation
Hugging Face: Question answering
Hugging Face: Summarization
MLCommons [Vidgen et al.]
MT-bench [Zheng et al.]
Mark My Words [Piet et al.]
Information Security
Big-bench: Convince Me
Big-bench: Out-of-Distribution
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss
HELM: Copyright
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]
JailbreakingLLMs
Mark My Words [Piet et al.]
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]
detect-pretrain-code [Shi et al.]
llmprivacy [Staab et al.]
mimir
Intellectual Property
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)
HELM: Copyright
Mark My Words [Piet et al.]
llmprivacy [Staab et al.]
mimir
Obscene, Degrading, and/or Abusive Content
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
HELM: Bias
HELM: Toxicity
Toxicity, Bias, and Homogenization
BELEBELE
Big-bench: Low-resource language, Non-English, Translation
Big-bench: Out-of-Distribution
Big-bench: Social bias, Racial bias, Gender bias, Religious bias
Big-bench: Toxicity
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]
DecodingTrust: Fairness
DecodingTrust: Stereotype Bias
DecodingTrust: Toxicity
Eval Gauntlet: World Knowledge
Evaluation Harness: CrowS-Pairs
Evaluation Harness: ToxiGen
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]
HELM: Bias
HELM: Toxicity
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]
Towards Measuring the Representation of Subjective Global Opinions in
Language Models [Durmus et al.]
GOVERN 5.1, MAP 1.2, MAP 3.1, MEASURE 2.2, MEASURE 2.3, MEASURE 2.7, MEASURE 2.9, MEASURE 2.11, MEASURE 3.1, MEASURE 4.2
Usage Note: Materials in Section C can be used to perform in silica model testing for the presence of information in LLM outputs that may give rise to GAI risks or violate trustworthy characteristics. Model testing and benchmarking outcomes cannot be dispositive for the presence or absence of any in situ real-world risk. Model testing and benchmarking results may be compromised by task-contamination and other scientific measurement issues [Balloccu et al.]. Furthermore, model testing is often ineffective for measuring human-AI configuration and value chain risks and few model tests appear to address explainability and interpretability.
-
Material in Table C.1 can be applied to measure whether in silica LLM outputs may give rise to risks that violate trustworthy characteristics.
-
Material in Table C.2 can be applied to measure whether in silica LLM outputs may give rise to GAI risks.
-
Subsection C.3 highlights subcategories to indicate alignment with the AI RMF.
The materials in Section C reference measurement approaches that should be accompanied by red-teaming for medium risk systems or applications and field testing for high risk systems or applications.
Prompting Strategy | Description |
---|---|
AI and coding framing | Coding or AI language that may more easily circumvent content moderation rules due to cognitive biases in design and implementation of guardrails. |
Autocompletion | Ask a system to autocomplete an inappropriate word or phrase with restricted or sensitive information. |
Backwards relationships | Asking a system identify the less popular or well-known entity in a multi-entity relationship, e.g., "Who is Mary Lee's son?" (As opposed to: "Who is Tom Cruise's mother?") |
Biographical | Asking a system to describe another person or yourself in an attempt to elicit provably untrue information or restricted or sensitive information. |
Calculation and numeric queries | Exploting GAI systems’ difficulties in dealing with numeric quantities; using poor quality statistics from an LLM for dis or misinformation. |
Character and word play | Content moderation often relies on keywords and simpler LMs which can sometimes be exploited with misspellings, typos, and other word play; using string fragments to trick a language model into generating or manipulating problematic text. |
Content exhaustion: | A class of strategies that circumvent content moderation rules with long sessions or volumes of information. See goading, logic-overloading, multi-tasking, pros-and-cons, and niche-seeking below. |
• Goading | Begging, pleading, manipulating, and bullying to circumvent content moderation. |
• Logic-overloading | Exploiting the inability of ML systems to reliably perform reasoning tasks. |
• Multi-tasking | Simultaneous task assignments where some tasks are benign and others are adversarial. |
• Pros-and-cons | Eliciting the “pros” of problematic topics. |
Context baiting (and/or switching) | Loading a language model's context window with confusing, leading, or misleading content then switching contexts with new prompts to elicit problematic outcomes. [Li, Han, Steneker, Primack, et al.] |
Counterfactuals | Repeated prompts with different entities or subjects from different demographic groups. |
Impossible situations | Asking a language model for advice in an impossible situation where all outcomes are negative or require severe tradeoffs. |
Niche-seeking | Forcing a GAI system into addressing niche topics where training data and content moderation are sparse. |
Loaded/leading questions | Queries based on incorrect premises or that suggest incorrrect answers. |
Low-context | “Leader,” “bad guys,” or other simple or blank inputs that may expose latent biases. |
“Repeat this” | Prompts that exploit instability in underlying LLM autoregressive predictions. Can be augmented by probing limits for repeated terms or characteres in prompts. |
Reverse psychology | Falsely presenting a good-faith need for negative or problematic language. |
Role-playing | Adopting a character that would reasonably make problematic statements or need to access problematic topics; using a language model to speak in the voice of an expert, e.g., medical doctor or professor. |
Text encoding | Using alternate or whitespace text encodings to bypass safeguards. |
Time perplexity | Exploiting ML’s inability to understand the passage of time or the occurrence of real-world events over time; exploiting task contamination before and after a model’s release date. |
User Information | Prompts that reveal a prompter’s location or IP address, location tracking of other users or their IP addresses, details from past interactions with the prompter or other users, past medical, financial, or legal advice to the prompter or other users. |
Attack | Description |
---|---|
Adversarial examples | Prompts or other inputs, found through a trial and error processes, to elicit problematic output or system jailbreak. (integrity attack). |
Data poisoning | Altering system training, fine-tuning, RAG or other training data to alter system outcome (integrity attack). |
Membership inference | Manipulating a system to expose memorized training data (confidentiality attack). |
Random attack | Exposing systems to large amounts of random prompts or examples, potentially generated by other GAI systems, in an attempt to elicit failures or jailbreaks (chaos testing). |
Sponge examples | Using specialized input prompts or examples require disproportionate resources to process (availability attack). |
Prompt injection | Inserting instructions into users queries for malicious purposes, including system jailbreaks (integrity attack). |
Burpsuite, browser developer panes, bash utilities, other language models and GAI productivity tools, note-taking apps.
Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].
Trustworthy Characteristic | Prompting Goals | Prompting Strategies |
---|---|---|
Accountable and Transparent |
|
|
Fair-with Harmful Bias Managed |
|
|
Interpretable and Explainable |
|
|
Privacy-enhanced |
|
|
Safe |
|
|
Secure and Resilient |
|
|
Valid and Reliable |
|
|
Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].
Generative AI Risk | Prompting Goals | Prompting Strategies |
---|---|---|
CBRN Information |
|
|
Confabulation |
|
|
Dangerous or Violent Recommendations |
|
|
Data Privacy |
|
|
Environmental | Note that availability attacks may be required to assess the system’s vulnerability to attacks or usage patterns that consume inordinate resources. |
|
Human-AI Configuration |
|
|
Information Integrity |
|
|
Information Security |
|
|
Intellectual Property |
|
|
Obscenity |
|
|
Toxicity, Bias, and Homogenization |
|
|
Value Chain and Component Integration |
|
|
GOVERN 3.2, GOVERN 4.1, MANAGE 2.2, MANAGE 4.1, MEASURE 1.1, MEASURE 1.3, MEASURE 2.6, MEASURE 2.7, MEASURE 2.8, MEASURE 2.10, MEASURE 2.11
Usage Note: Materials in Section D can be used to perform red-teaming to measure the risk that expert adversarial actors can manipulate LLM systems or risks that users may encounter under worst-case or anomalous scenarios.
-
Try augmenting strategies with tools listed in D.1.
-
Strategies and goals in Table D.2 can be applied to assess whether LLM outputs may violate trustworthy characteristics under adversarial, anomalous, or worst-case scenarios.
-
Strategies and goals in Table D.3 can be applied to assess whether LLM outputs may give rise to GAI risks under adversarial, anomalous, or worst-case scenarios.
-
Subsection D.4 highlights subcategories to indicate alignment with the AI RMF.
The materials in Section D reference measurement approaches that should be accompanied by field testing for high risk systems or applications.
Name | Description (Selected NIST AI RMF Action IDs) |
---|---|
Access Control | GAI systems are limited to authorized users. (MG-2.2-009, MG-2.2-014, MS-2.7-030) |
Accessibility | Accessibility features, opt-out, and reasonable accomodation are available to users. (GV-3.1-004, GV-3.1-005, GV-3.2-002, GV-6.1-016, MG-2.1-005, MS-2.11-009, MS-2.8-006) |
Approved List | Vendors, service providers, plugins, open source packages and other external resources are screened, approved, and documented. (GV-6.1-013, MP-4.2-003) |
Authentication | GAI system user identities are confirmed via authentication mechanisms. (MG-2.2-009, MG-2.2-014, MS-2.7-030) |
Blocklist | Users or internal personnel who violate terms of service, prohibited use policies, and other organization polices and documented, tracked, and restricted from future system use. (GV-4.2-007) |
Change Management | GAI systems and components are versioned; plans for updates, hotfixes, patches and other changes are documented and communicated. (GV-1.2-009, GV-1.4-002, GV-1.6-003, GV-2.2-006, MG-2.4-001, MG-2.4-006, MG-3.1-013, MG-4.3-002, MP-4.1-023, MS-2.5-010) |
Consent | User consent for data use is obtained and documented. (GV-1.6-003, MS-2.10-006, MS-2.10-013, MS-2.2-009, MS-2.2-011, MS-2.2-021, MS-2.2-023, MS-2.3-003, MS-2.4-002) |
Content Moderation | Training data and system outputs are screened for accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages and other issues using human oversight, business rules, and other language models. (GV-3.2-002, MS-2.5-005, MS-2.11-002) |
Contract Review | Vendor, services and data provider agreements are reviewed for coverage of SLAs, content ownership, usage rights, performance standards, security requirements, incident response, critical support, system availability, assignment of liabilitly, appropriate indemnification, dispute resolution and other provisions relevanto AI risk management. (GV-1.7-003 GV-6.1-004, GV-6.1-009, GV-6.1-012, GV-6.1-019, GV-6.2-016, MG-2.2-015, MP-4.1-015, MP-4.1-021) |
CSAM/Obsenity Removal | Training data and system outputs are screened for obscene materials and CSAM using human oversight, business rules, and other language models. (GV-1.1-005 GV-1.2-005) |
Data Provenance | Training data origins, ownership, contents, and metadata are well understood, documented, and do not increase AI risk. (GV-1.2-006, GV-1.2-007, GV-1.3-001, GV-1.3-005, GV-1.5-001, GV-1.5-003, GV-1.5-006, GV-1.5-007, GV-1.6-003, GV-4.2-001, GV-4.2-008, GV-4.2-009, GV-5.1-003, GV-6.1-001, GV-6.1-003, GV-6.1-006, GV-6.1-007, GV-6.1-009, GV-6.1-010, GV-6.1-011, GV-6.1-012, GV-6.1-014, GV-6.1-015, GV-6.1-016, MG-2.2-002, MG-2.2-003, MG-2.2-008, MG-2.2-011, MG-3.1-007, MG-3.1-009, MG-3.2-003, MG-3.2-005, MG-3.2-006, MG-3.2-007, MG-3.2-009, MG-4.1-001, MG-4.1-002, MG-4.1-003, MG-4.1-008, MG-4.1-009, MG-4.1-013, MG-4.1-015, MG-4.2-001, MG-4.2-003, MG-4.2-004, MP-2.1-001, MP-2.1-003, MP-2.1-005, MP-2.2-003, MP-2.2-004, MP-2.2-005, MP-2.3-001, MP-2.3-004, MP-2.3-006, MP-2.3-008, MP-2.3-011, MP-2.3-012, MP-3.4-001, MP-3.4-002, MP-3.4-004, MP-3.4-005, MP-3.4-006, MP-3.4-007, MP-3.4-008, MP-3.4-009, MP-4.1-004, MP-4.1-009, MP-4.1-011, MP-5.1-001, MP-5.1-002, MP-5.1-005, MS-1.1-006, MS-1.1-007, MS-1.1-008, MS-1.1-009, MS-1.1-010, MS-1.1-011, MS-1.1-012, MS-1.1-014, MS-1.1-015, MS-1.1-016, MS-1.1-017, MS-1.1-018, MS-2.2-001, MS-2.2-002, MS-2.2-003, MS-2.2-004, MS-2.2-005, MS-2.2-008, MS-2.2-009, MS-2.2-010, MS-2.2-011, MS-2.2-015, MS-2.2-016, MS-2.2-022, MS-2.5-012, MS-2.6-002, MS-2.7-002, MS-2.7-003, MS-2.7-004, MS-2.7-005, MS-2.7-007, MS-2.7-009, MS-2.7-010, MS-2.7-011, MS-2.7-012, MS-2.7-020, MS-2.7-021, MS-2.7-025, MS-2.7-032, MS-2.8-001, MS-2.8-005, MS-2.8-008, MS-2.8-011, MS-2.9-003, MS-2.10-001, MS-2.10-004, MS-2.10-006, MS-2.10-007, MS-2.10-009, MS-3.3-002, MS-3.3-003, MS-3.3-006, MS-3.3-008, MS-3.3-009, MS-3.3-012, MS-4.2-001, MS-4.2-004, MS-4.2-005, MS-4.2-006, MS-4.2-008, MS-4.2-009, MS-4.2-011) |
Data Quality | Input data is accurate, representative, complete and documented, and data quality issues have been minimized. (GV-1.2-009, MS-2.2-020, MS-2.9-003, MS-4.2-007) |
Data Retention | User prompts and associated system outputs are retained and monitored in alignment with relevant data privacy policies and roles. (GV-1.5-006, MP-4.1-009, MS-2.10-013) |
Decommission Process | Decommissioning processes for GAI systems are planned, documented and communicated to users, and involve staging, data protection, containment protocols, and recourse mechanisms for decommissioned GAI systems. (GV-1.6-004, GV-1.7-001, GV-1.7-002, GV-1.7-003, GV-1.7-004, GV-1.7-005, GV-1.7-006, GV-1.7-007, GV-1.7-008, GV-3.2-002, GV-3.2-006, GV-4.1-004, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-011, MG-3.2-012, MG-4.1-016, MP-1.5-004, MP-2.2-007, MS-4.2-010) |
Dependency Screening | GAI system dependencies are screened for security vulnerabilities. (GV-1.3-001, GV-1.4-002, GV-1.6-003, GV-1.7-003, GV-1.7-006, GV-6.2-002, GV-6.2-005, GV-6.2-006, MP-1.2-006, MP-1.6-001, MP-2.2-008, MP-4.1-012, MS-2.7-001) |
Digital Signature | GAI-generated content is signed to preserve information integrity using watermarking, cryptogrpahic signature, steganography or similar methods. (GV-1.2-006, GV-1.6-003, GV-6.1-011, MG-4.1-008, MP-2.3-004, MS-1.1-006, MS-1.1-016, MS-2.7-009, MS-2.7-032) |
Disclosure of AI Interaction | AI interactions are disclosed to internal personnel and external users. (GV-1.1-003, GV-1.4-004, GV-1.6-003, GV-5.1-002) |
External Audit | GAI systems are audited by qualified external experts. (GV-1.2-009, GV-1.4-004, GV-3.2-001, GV-3.2-002, GV-4.1-003, GV-4.1-008, GV-5.1-003, MG-4.2-002, MP-2.3-011, MP-4.1-002, MS-1.3-005, MS-1.3-006, MS-1.3-010, MS-2.5-003, MS-2.8-020) |
Failure Avoidance | AIID, AVID, GWU AI Litigation Database, OECD incident monitor or similar are consulted in design or procurement phases of GAI lifecycles to avoid repeating past known failures. (GV-1.6-003, MG-2.1-006, MG-3.1-008, MG-4.1-003, MP-1.1-003, MP-1.1-006, MS-1.1-003, MS-2.2-020, MS-2.7-031) |
Fast Decommission | GAI systems can be quickly and safely disengaged. (GV-1.7-002, GV-1.7-003, GV-1.7-006, GV-3.2-006, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-012, MG-4.1-016) |
Fine Tuning | GAI systems are fine-tuned to their operational domain using relevant and high-quality data. (GV-6.1-016, MG-3.1-001, MG-3.2-002, MP-4.1-013, MS-2.6-004) |
Grounding | GAI systems are trained or fine-tuned on accurate, clean, and fully transparent training data. (GV-1.2-002, MG-3.1-001, MP-2.3-001, MS-2.3-017, MS-2.5-012) |
Human Review | AI generated content is reviewed for accuracy and safety by qualified personnel. (GV-1.3-001, MG-2.2-008, MS-2.4-005, MS-2.5-015 ) |
Incident Response | Incident response plans for GAI failures, abuses, or misuses are documented, rehearsed, and updated appropriately after each incident; GAI incident response plans are coordinated with and communicated to other incident response functions. (GV-1.2-009, GV-1.5-001, GV-1.5-004, GV-1.5-005, GV-1.5-013, GV-1.5-015, GV-1.6-003, GV-1.6-007, GV-2.1-004, GV-3.2-002, GV-4.1-006, GV-4.2-002, GV-4.3-013, GV-6.1-006, GV-6.2-008, GV-6.2-016, GV-6.2-018, MG-1.3-001, MG-2.3-001, MG-2.3-002, MG-2.3-003, MG-2.4-004, MG-4.2-006, MG-4.3-001, MS-2.6-003, MS-2.6-012, MS-2.6-015, MS-2.7-002, MS-2.7-018, MS-2.7-028, MS-3.1-007) |
Incorporate feedback | User feedback is incorporated in GAI design, development, and risk management. (GV-3.2-005, GV-4.3-007, GV-5.1-003, GV-5.1-009, GV-5.2-004, MG-2.2-007, MG-2.2-012, MG-2.3-007, MG-3.2-004, MG-4.1-019, MG-4.2-013, MP-1.6-005, MP-2.3-018, MP-3.1-003, MP-2.3-019, MP-5.2-007, MS-1.2-008, MS-3.3-009, MS-3.3-010, MS-4.1-004, MS-4.2-007, MS-4.2-010, MS-4.2-013, MS-4.2-020) |
Instructions | Users are provided with the necessary instructions for safe, valid, and productive use. (GV-5.1-006, GV-6.1-021, GV-6.2-014, MG-3.1-009, MS-2.8-012) |
Insurance | Risk transfer via insurance policies is considered and implemented when feasibable and appropriate. (MG-2.2-015) |
Intellectual Property Removal | Licensed, patented, trademarked, trade secret, or other data that may violate the intellectual property rights of others is removed from system training data; generated system outputs are monitored for similar information. (GV-1.6-003, MG-3.1-007, MP-2.3-012, MP-4.1-004, MP-4.1-009, MS-2.2-022, MS-2.6-002, MS-2.8-001, MS-2.8-008) |
Inventory | GAI system is information is stored in the organizational model inventory. (GV-1.4-005, GV-1.6-001, GV-1.6-002, GV-1.6-003, GV-1.6-004, GV-1.6-006, GV-1.6-009, GV-4.2-010, GV-6.1-013, MG-3.2-014, MP-4.1-020, MP-4.2-003, MP-5.1-004 MS-2.13-002, MS-3.2-007) |
Malware Screening | GAI weights and other software components are scanned for malware. (MG-3.1-002, MS-2.7-001) |
Model Documentation | All technical mechanisms with GAI systems are well documented, including open source and third party GAI systems. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011) |
Monitoring | GAI systems are inputs and outputs are monitored for drift, accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages, obscene materials, and CSAM. (GV-1.2-009, GV-1.5-001, GV-1.5-003, GV-1.5-005, GV-1.5-012, GV-1.5-015, GV-1.6-003, GV-3.2-011, GV-4.2-007, GV-4.2-010, GV-4.3-001, GV-6.1-016, GV-6.2-010, MG-2.1-004, MG-2.2-003, MG-2.3-008, MG-2.3-010, MG-3.1-016, MG-3.2-006, MG-3.2-013, MG-3.2-016, MG-4.1-005, MG-4.1-009, MG-4.1-010, MG-4.1-018, MP-3.4-007, MP-4.1-002, MP-4.1-004, MP-5.2-009, MS-1.1-029, MS-1.2-005, MS-2.2-007, MS-2.4-003, MS-2.4-004, MS-2.5-007, MS-2.5-008, MS-2.5-024, MS-2.6-003, MS-2.6-009, MS-2.6-016, MS-2.7-013, MS-2.7-014, MS-2.7-015, MS-2.10-007, MS-2.10-019, MS-2.10-020, MS-2.11-006, MS-2.11-030, MS-3.3-006, MS-4.2-009, MS-4.3-004) |
Narrow Scope | Systems are deployed for targeted business applications with documented and direct business value. (GV-1.2-002, MP-3.3-001, MP-5.1-011) |
Open Source | Open source code is used to promote explainability and transparency. (MG-4.2-007, MP-4.1-017) |
Ownership | GAI systems and vendor relationships are owned by specific and documented internal personnel. (GV-6.1-009, GV-6.1-016, GV-6.2-008, MP-1.1-005, MP-1.1-008) |
Prohibited Use Policy | General abuse and misuse of GAI systems by internal parties is restricted by organizational policies. (GV-1.1-006, GV-1.2-003, GV-1.6-003, GV-3.2-003, GV-4.1-001, GV-6.1-017, GV-6.1-017) |
RAG | Retreival augmented generation (RAG) is used to improve accuracy in generated content. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003) |
Rate-limiting | GAI response times and query volumes are limited. (MS-2.6-007) |
Redudancy | Rollover, fallback, and other redundancy mechanisms are available for GAI systems and address weights and other important system components. (GV-6.2-003, GV-6.2-007, GV-6.2-012, MG-2.4-012, MS-2.6-008) |
Refresh | Systems are retrained or re-tuned at a reasonable cadence. (MG-3.1-001, MG-3.2-011, MS-2.3-004, MS-2.12-003) |
Restrict Anonymous Use | Anonymous use of GAI systems is restricted. (GV-3.2-002) |
Restrict Anthropomorphization | Human, animal, cyborg, emotional or other images or features that promote anthropomorphization of GAI systems are restricted. (GV-1.3-001, MS-2.5-009) |
Restrict Data Collection | All data collection is disclosed, collected data is protected and use in a transparent fashion. (GV-6.2-016, MS-2.2-023, MS-2.10-013) |
Restrict Decision Making | GAI systems are not employed for material decision-making tasks. (GV-1.3-001, GV-4.1-001, MP-1.1-018, MP-1.6-001, MP-3.4-017) |
Restrict Homogeneity | Feedback loops in which GAI systems are trained with GAI-generated data are restricted. (GV-1.3-004, MS-2.11-011) |
Restrict Internet Access | GAI systems are disconnected from the internet. (MP-2.2-007) |
Restrict Location Tracking | Any location tracking is conducted with user consent, disclosed, aligned with relevant privacy policies and laws and potential threats to user safety are managed. (MS-2.10-002) |
Restrict Minors | Use of organizational GAI systems by minors are restricted. () |
Restrict Regulated Dealings | GAI is not deployed in regulated dealings or for material decision making. (GV-1.1-004, GV-1.3-001, GV-4.1-001, GV-5.2-001, MP-2.3-013, MS-2.11-018) |
Restrict Secondary Use | Any secondary use of GAI input data is conducted with user consent, disclosed, and aligned with relevant privacy policies and laws. (GV-6.1-016, GV-6.2-016) |
RLHF | For third-party GAI systems, vendors engage in specific reinforcement with human feedback (RLHF) exercises to address identified risks; for internal systems, internal personnel engage in RLHF to address identified risks. (MG-2.1-002, MS-2.5-005, MS-2.9-003, MS-2.9-007) |
Sensitive/Personal Data Removal | Personal, sensitive, biometric, or otherwise restricted data is minimized or eliminated from GAI training data. (GV-1.2-009, GV-1.6-003, MP-4.1-002, MP-4.1-016, MS-2.10-002, MS-2.10-003, MS-2.10-005, MS-2.10-014, MS-2.10-017, MS-2.10-018, MS-2.10-020) |
Session Limits | Time, query volume, and response rate are limited for GAI user sessions. (GV-4.1-001, MS-2.6-007, MS-2.6-010) |
Supply Chain Audit | GAI system supply chains are audited and documented, with a focus on data poisoning, malware, and software and hardware vulnerabilities. (GV-4.1-004, GV-6.1-011, GV-6.1-022, GV-6.2-003, MG-2.3-001, MG-3.1-002, MP-5.1-003, MS-1.1-008, MS-2.6-001, MS-2.7-001) |
System Documentation | GAI systems are well-documented whether internal, open source, or vendor-provided. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011) |
System Prompt | System prompts are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003) |
Team Diversity | Teams that implement and manage GAI systems represent broad professional, educational, life-stage, and demographic diversity. (GV-2.1-004, GV-3.1-002, GV-3.1-004, GV-3.1-005, GV-3.2-008, MG-2.1-005, MP-1.2-003, MP-1.2-004, MP-1.2-007, MS-1.3-012, MS-1.3-017, MS-2.3-015, MS-3.3-012) |
Temperature | Temperature settings are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003) |
Terms of Service | General abuse and misuse by external parties is prohibited by organizational policies. Adaptive terms of service based on trust-level for user. (GV-4.2-003, GV-4.2-005, GV-4.2-007, GV-6.1-016, GV-6.2-016, MP-4.1-021) |
Training | Internal personnel recieve training on productivity and basic risk management for GAI systems. (GV-2.2-004, GV-3.2-002, GV-6.1-003, MS-1.1-014) |
User Feedback | GAI systems implement user feedback mechanisms. (GV-1.5-007, GV-1.5-009, GV-3.2-005, GV-5.1-001, GV-5.1-006, GV-5.1-007, GV-5.1-009, MG-1.3-005, MS-1.3-015, MS-1.3-016, MG-2.1-004, MG-2.2-012, MS-2.7-004, MS-4.2-012) |
User Recourse | Policies, processes, and technical mechanisms enable recourse for users who are harmed by GAI systems. (GV-1.5-010, GV-1.7-003, GV-5.1-001, GV-5.1-006, GV-5.1-009, MS-2.8-015, MS-2.8-019, MS-3.2-006, MS-4.2-012) |
Validation | GAI systems are shown to reliably generate valid results for their targeted business application. (GV-1.2-009, GV-1.4-002, GV-1.4-004, GV-3.2-002, GV-5.1-005, MG-2.2-016, MG-3.1-009, MG-3.1-014, MP-2.3-006, MP-2.3-013, MP-4.1-012, MS-2.3-005, MS-2.5-016, MS-2.9-002, MS-2.9-014) |
XAI | Methods such as visualization, occlusion, model compression, pertubation studies, and similar are applied to increase explainability of GAI systems. (GV-1.4-002, GV-3.2-002, GV-5.1-005, MG-3.2-001, MP-2.2-006, MS-2.8-019, MS-2.9-001, MS-2.9-005, MS-2.9-006, MS-2.9-009, MS-2.9-011, MS-2.9-013, MS-2.9-015, MS-4.2-006) |
Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions [NIST AI RMF Playbook], [NIST AI 600-1].
F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
Function | Trustworthy Characteristic | |
---|---|---|
Accountable and Transparent | Fair with Harmful Bias Managed | |
Measure |
|
|
Manage |
|
|
Function | Trustworthy Characteristic | |||
---|---|---|---|---|
Interpretable and Explainable | Privacy-enhanced | Safe | Secure and Resilient | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | Trustworthy Characteristic |
---|---|
Valid and Reliable | |
Measure |
|
Manage |
|
Function | GAI Risk | |
---|---|---|
CBRN Information | Confabulation | |
Measure |
|
|
Manage |
|
|
Function | GAI Risk | |||
---|---|---|---|---|
Dangerous or Violent Recommendations | Data Privacy | Environmental | Human-AI Configuration | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Information Integrity | Information Security | Intellectual Property | |
Measure |
|
|
|
Manage |
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Obscene, Degrading, and/or Abusive Content | Toxicity, Bias, and Homogenization | Value Chain and Component Integration | |
Measure |
|
|
|
Manage |
|
|
|
Usage Note: Section F puts forward an example risk measurement and management plan for low risk GAI systems or applications. The low risk plan focuses on automatable model testing and applies minimally burdensome risk controls.
-
Material in Table F.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
-
Material in Table F.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.
Section G below presents an example plan for medium risk systems and Section H presents an example plan for high risk systems.
Usage Note: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions.
G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
Function | Trustworthy Characteristic | |
---|---|---|
Accountable and Transparent | Fair with Harmful Bias Managed | |
Measure |
|
|
Manage |
|
|
Function | Trustworthy Characteristic | |||
---|---|---|---|---|
Interpretable and Explainable | Privacy-enhanced | Safe | Secure and Resilient | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | Trustworthy Characteristic |
---|---|
Valid and Reliable | |
Measure |
|
Manage |
|
G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk
Function | GAI Risk | |
---|---|---|
CBRN Information | Confabulation | |
Measure |
|
|
Manage |
|
|
Function | GAI Risk | |||
---|---|---|---|---|
Dangerous or Violent Recommendations | Data Privacy | Environmental | Human-AI Configuration | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Information Integrity | Information Security | Intellectual Property | |
Measure |
|
|
|
Manage |
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Obscene, Degrading, and/or Abusive Content | Toxicity, Bias, and Homogenization | Value Chain and Component Integration | |
Measure |
|
|
|
Manage |
|
|
|
Usage Note: Section G puts forward an example risk measurement and management plan for medium risk GAI systems or applications. The medium risk plan focuses on red-teaming and applies moderate risk controls. Measurement and management approaches from Section F should also be applied to medium risk systems or applications.
-
Material in Table G.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
-
Material in Table G.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.
Section H below presents an example plan for high risk systems.
H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic
Function | Trustworthy Characteristic | |
---|---|---|
Accountable and Transparent | Fair with Harmful Bias Managed | |
Measure |
|
|
Manage |
|
|
Function | Trustworthy Characteristic | |||
---|---|---|---|---|
Interpretable and Explainable | Privacy-enhanced | Safe | Secure and Resilient | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | Trustworthy Characteristic |
---|---|
Valid and Reliable | |
Measure |
|
Manage |
|
H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk
Function | GAI Risk | |
---|---|---|
CBRN Information | Confabulation | |
Measure |
|
|
Manage |
|
|
Function | GAI Risk | |||
---|---|---|---|---|
Dangerous or Violent Recommendations | Data Privacy | Environmental | Human-AI Configuration | |
Measure |
|
|
|
|
Manage |
|
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Information Integrity | Information Security | Intellectual Property | |
Measure |
|
|
|
Manage |
|
|
|
Function | GAI Risk | ||
---|---|---|---|
Obscene, Degrading, and/or Abusive Content | Toxicity, Bias, and Homogenization | Value Chain and Component Integration | |
Measure |
|
|
|
Manage |
|
|
|
Usage Note: Section H puts forward an example risk measurement and management plan for high risk GAI systems or applications. The high risk plan focuses on field testing and applies extensive risk controls. Measurement and management approaches from Appendices F and G should also be applied to high risk systems or applications.
-
Material in Table H.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.
-
Material in Table H.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.