A curated list of LLM Interpretability related material.
- Tutorial
- History
- Code
- Survey
- Video
- Paper & Blog
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability [Neel Nanda's blog]
- Mechanistic Interpretability Quickstart Guide [Neel Nanda's blog]
- ARENA Mechanistic Interpretability Tutorials by Callum McDougall [website]
- 200 Concrete Open Problems in Mechanistic Interpretability: Introduction by Neel Nanda [AlignmentForum]
- Transformer-specific Interpretability [EACL 2023 Tutorial]
- Mechanistic? [BlackBoxNLP workshop at EMNLP 2024]
- This paper explores the multiple definitions and uses of "mechanistic interpretability," tracing its evolution in NLP research and revealing a critical divide within the interpretability community.
- TransformerLens [github]
- A library for mechanistic interpretability of GPT-style language models
- CircuitsVis [github]
- Mechanistic Interpretability visualizations
- baukit [github]
- Contains some methods for tracing and editing internal activations in a network.
- transformer-debugger [github]
- Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models. The tool combines automated interpretability techniques with sparse autoencoders.
- pyvene [github]
- Supports customizable interventions on a range of different PyTorch modules
- Supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters.
- ViT-Prisma [github]
- An open-source mechanistic interpretability library for vision and multimodal models.
- pyreft [github]
- A Powerful, Parameter-Efficient, and Interpretable way of fine-tuning
- SAELens [github]
- Training and analyzing sparse autoencoders on Language Models
- mamba interpretability [github]
- Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks [SaTML 2023] [arxiv 2207]
- Neuron-level Interpretation of Deep NLP Models: A Survey [TACL 2022]
- Explainability for Large Language Models: A Survey [TIST 2024] [arxiv 2309]
- Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability [arxiv 2402]
- Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era [arxiv 2403]
- Mechanistic Interpretability for AI Safety -- A Review [arxiv 2404]
- A Primer on the Inner Workings of Transformer-based Language Models [arxiv 2405]
- πA Practical Review of Mechanistic Interpretability for Transformer-Based Language Models [arxiv 2407]
- Internal Consistency and Self-Feedback in Large Language Models: A Survey [arxiv 2407]
- The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability [arxiv 2408]
- Attention Heads of Large Language Models: A Survey [arxiv 2409] [github]
Note: These Alignment surveys discuss the relation between Interpretability and LLM Alignment.
-
Large Language Model Alignment: A Survey [arxiv 2309]
-
AI Alignment: A Comprehensive Survey [arxiv 2310] [github] [website]
- Neel Nanda's Channel [Youtube]
- Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability [Youtube]
- Concrete Open Problems in Mechanistic Interpretability: Neel Nanda at SERI MATS [Youtube]
- BlackboxNLP's Channel [Youtube]
-
πICML 2024 Workshop on Mechanistic Interpretability [openreview]
-
πTransformer Circuits Thread [blog]
-
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP [workshop]
-
AI Alignment Forum [forum]
-
Lesswrong [forum]
-
Neel Nanda [blog] [google scholar]
-
Mor Geva [google scholar]
-
David Bau [google scholar]
-
Jacob Steinhardt [google scholar]
-
Yonatan Belinkov [google scholar]
- πA mathematical framework for transformer circuits [blog]
- Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models [arxiv]
-
πinterpreting GPT: the logit lens [Lesswrong 2020]
-
πAnalyzing Transformers in Embedding Space [ACL 2023]
-
Eliciting Latent Predictions from Transformers with the Tuned Lens [arxiv 2303]
-
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l arxiv 2310
-
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State [CoNLL 2023]
-
SelfIE: Self-Interpretation of Large Language Model Embeddings [arxiv 2403]
-
InversionView: A General-Purpose Method for Reading Information from Neural Activations [ICML 2024 MI Workshop]
- Enhancing Neural Network Transparency through Representation Analysis [arxiv 2310] [openreview]
- Analyzing And Editing Inner Mechanisms of Backdoored Language Models [arxiv 2303]
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations [arxiv 2303]
- Localizing Model Behavior with Path Patching [arxiv 2304]
- Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [NIPS 2023]
- Towards Best Practices of Activation Patching in Language Models: Metrics and Methods [ICLR 2024]
- Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching [ICLR 2024]
- A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments [arxiv 2401]
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks [arxiv 2402]
- πHow to use and interpret activation patching [arxiv 2404]
- Towards Automated Circuit Discovery for Mechanistic Interpretability [NIPS 2023]
- Neuron to Graph: Interpreting Language Model Neurons at Scale [arxiv 2305] [openreview]
- Discovering Variable Binding Circuitry with Desiderata [arxiv 2307]
- Discovering Knowledge-Critical Subnetworks in Pretrained Language Models [openreview]
- Attribution Patching Outperforms Automated Circuit Discovery [arxiv 2310]
- AtP*: An efficient and scalable method for localizing LLM behaviour to components [arxiv 2403]
- Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms [arxiv 2403]
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [arxiv 2403]
- Automatically Identifying Local and Global Circuits with Linear Computation Graphs [arxiv 2405]
- Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models [arxiv 2405]
- Hypothesis Testing the Circuit Hypothesis in LLMs [ICML 2024 MI Workshop]
- πTowards monosemanticity: Decomposing language models with dictionary learning [Transformer Circuits Thread]
- Sparse Autoencoders Find Highly Interpretable Features in Language Models [ICLR 2024]
- Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small [Alignment Forum]
- Attention SAEs Scale to GPT-2 Small [Alignment Forum]
- We Inspected Every Head In GPT-2 Small using SAEs So You Donβt Have To [Alignment Forum]
- Understanding SAE Features with the Logit Lens [Alignment Forum]
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [Transformer Circuits Thread]
- Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models [arxiv 2405]
- Scaling and evaluating sparse autoencoders [arxiv 2406] [code]
- Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models [ICML 2024 MI Workshop]
- Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task [ICML 2024 MI Workshop]
- Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [ICML 2024 MI Workshop]
- Transcoders find interpretable LLM feature circuits [ICML 2024 MI Workshop]
- Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders [arxiv 2407]
- Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models [arxiv 2410]
- Mechanistic Permutability: Match Features Across Layers [arxiv 2410]
- Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [arxiv 2410]
- Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs [arxiv 2410]
- Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT [arxiv 2305] [github]
- Sparse AutoEncoder Visulization [github]
- SAE-VIS: Announcement Post [lesswrong]
- LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models [arxiv 2404] [[github]](https://github.com/facebookresearch/ llm-transparency-tool)
- Tracr: Compiled Transformers as a Laboratory for Interpretability [arxiv 2301]
- Opening the AI black box: program synthesis via mechanistic interpretability [arxiv 2402]
- An introduction to graphical tensor notation for mechanistic interpretability [arxiv 2402]
- Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models [arxiv 2312]
- RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations [arxiv 2402]
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control [arxiv 2405]
- InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques [arxiv 2407]
- Circuit Component Reuse Across Tasks in Transformer Language Models [ICLR 2024 spotlight]
- Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures [arxvi 2410]
- From Tokens to Words: On the Inner Lexicon of LLMs [arxiv 2410]
- Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models [EMNLP 2023]
- How Large Language Models Implement Chain-of-Thought? [openreview]
- Do Large Language Models Latently Perform Multi-Hop Reasoning? [arxiv 2402]
- How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning [arxiv 2402]
- Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning [arxiv 2402]
- Iteration Head: A Mechanistic Study of Chain-of-Thought [arxiv 2406]
- From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency [arxiv 2410]
- πInterpretability in the wild: a circuit for indirect object identification in GPT-2 small [ICLR 2023]
- Entity Tracking in Language Models [ACL 2023]
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model [NIPS 2023]
- Can Transformers Learn to Solve Problems Recursively? [arxiv 2305]
- Analyzing And Editing Inner Mechanisms of Backdoored Language Models [NeurIPS 2023 Workshop]
- Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla [arxiv 2307]
- Refusal mechanisms: initial experiments with Llama-2-7b-chat [AlignmentForum 2312]
- Forbidden Facts: An Investigation of Competing Objectives in Llama-2 [arxiv 2312]
- How do Language Models Bind Entities in Context? [ICLR 2024]
- How Language Models Learn Context-Free Grammars? [openreview]
- πA Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity [arxiv 2401]
- Do Llamas Work in English? On the Latent Language of Multilingual Transformers [arxiv 2402]
- Evidence of Learned Look-Ahead in a Chess-Playing Neural Network [arxiv2406]
- How much do contextualized representations encode long-range context? [arxiv 2410]
- πProgress measures for grokking via mechanistic interpretability [ICLR 2023]
- πThe Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks [NIPS 2023]
- Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition [openreview]
- Arithmetic with Language Models: from Memorization to Computation [openreview]
- Carrying over Algorithm in Transformers [openreview]
- A simple and interpretable model of grokking modular arithmetic tasks [openreview]
- Understanding Addition in Transformers [ICLR 2024]
- Increasing Trust in Language Models through the Reuse of Verified Circuits [arxiv 2402]
- Pre-trained Large Language Models Use Fourier Features to Compute Addition [arxiv 2406]
- πIn-context learning and induction heads [Transformer Circuits Thread]
- In-Context Learning Creates Task Vectors [EMNLP 2023 Findings]
- Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning [EMNLP 2023]
- EMNLP 2023 best paper
- LLMs Represent Contextual Tasks as Compact Function Vectors [ICLR 2024]
- Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions [ICLR 2024]
- Where Does In-context Machine Translation Happen in Large Language Models? [openreview]
- In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations [openreview]
- Analyzing Task-Encoding Tokens in Large Language Models [arxiv 2401]
- How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning [arxiv 2402]
- Parallel Structures in Pre-training Data Yield In-Context Learning [arxiv 2402]
- What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation [arxiv 2404]
- Task Diversity Shortens the ICL Plateau [arxiv 2410]
- Inference and Verbalization Functions During In-Context Learning [arxiv 2410]
- πDissecting Recall of Factual Associations in Auto-Regressive Language Models [EMNLP 2023]
- Characterizing Mechanisms for Factual Recall in Language Models [EMNLP 2023]
- Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs [openreview]
- A Mechanism for Solving Relational Tasks in Transformer Language Models [openreview]
- Overthinking the Truth: Understanding how Language Models Process False Demonstrations [ICLR 2024 spotlight]
- πFact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level [AlignmentForum 2312]
- Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models [arxiv 2402]
- Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals [arxiv 2402]
- A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia [arxiv 2403]
- Mechanisms of non-factual hallucinations in language models [arxiv 2403]
- Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models [arxiv 2403]
- Locating and Editing Factual Associations in Mamba [arxiv 2404]
- Probing Language Models on Their Knowledge Source [[arxiv 2410]](https://arxiv.org/abs/2410.05817}
- Do Llamas Work in English? On the Latent Language of Multilingual Transformers [arxiv 2402]
- Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [arxiv 2402]
- How do Large Language Models Handle Multilingualism? [arxiv 2402]
- Large Language Models are Parallel Multilingual Learners [arxiv 2403]
- Understanding the role of FFNs in driving multilingual behaviour in LLMs [arxiv 2404]
- How do Llamas process multilingual text? A latent exploration through activation patching [ICML 2024 MI Workshop]
- Concept Space Alignment in Multilingual LLMs [EMNLP 2024]
- On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task [EMNLP 2024 Findings]
- Interpreting CLIP's Image Representation via Text-Based Decomposition [ICLR 2024 oral]
- Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) [NIPS 2024]
- Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines [arxiv 2403]
- The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? [arxiv 2403]
- Understanding Information Storage and Transfer in Multi-modal Large Language Models [arxiv 2406]
- Towards Interpreting Visual Information Processing in Vision-Language Models [arxiv 2410]
- Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models [arxiv 2410]
- Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models [arxiv 2410]
- The Hydra Effect: Emergent Self-repair in Language Model Computations [arxiv 2307]
- Unveiling A Core Linguistic Region in Large Language Models [arxiv 2310]
- Exploring the Residual Stream of Transformers [arxiv 2312]
- Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation [arxiv 2312]
- Explorations of Self-Repair in Language Models [arxiv 2402]
- Massive Activations in Large Language Models [arxiv 2402]
- Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions [arxiv 2402]
- Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [arxiv 2403]
- The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models [arxiv 2403]
- Localizing Paragraph Memorization in Language Models [github 2403]
-
πAwesome-Attention-Heads [github]
- A carefully compiled list that summarizes the diverse functions of the attention heads.
-
πIn-context learning and induction heads [Transformer Circuits Thread]
-
On the Expressivity Role of LayerNorm in Transformers' Attention [ACL 2023 Findings]
-
On the Role of Attention in Prompt-tuning [ICML 2023]
-
Copy Suppression: Comprehensively Understanding an Attention Head [ICLR 2024]
-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild [ICLR 2024]
-
A phase transition between positional and semantic learning in a solvable model of dot-product attention [arxiv 2024]
-
Retrieval Head Mechanistically Explains Long-Context Factuality [arxiv 2404]
-
Iteration Head: A Mechanistic Study of Chain-of-Thought [arxiv 2406]
-
When Attention Sink Emerges in Language Models: An Empirical View [arxiv 2410]
- πTransformer Feed-Forward Layers Are Key-Value Memories [EMNLP 2021]
- Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space [EMNLP 2022]
- What does GPT store in its MLP weights? A case study of long-range dependencies [openreview]
- Understanding the role of FFNs in driving multilingual behaviour in LLMs [arxiv 2404]
- πToy Models of Superposition [Transformer Circuits Thread]
- Knowledge Neurons in Pretrained Transformers [ACL 2022]
- Polysemanticity and Capacity in Neural Networks [arxiv 2210]
- πFinding Neurons in a Haystack: Case Studies with Sparse Probing [TMLR 2023]
- DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models [EMNLP 2023]
- Neurons in Large Language Models: Dead, N-gram, Positional [arxiv 2309]
- Universal Neurons in GPT2 Language Models [arxiv 2401]
- Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [arxiv 2402]
- How do Large Language Models Handle Multilingualism? [arxiv 2402]
- PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits [arxiv 2404]
- JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention [ICLR 2024]
- Learning Associative Memories with Gradient Descent [arxiv 2402]
- Mechanics of Next Token Prediction with Self-Attention [arxiv 2402]
- The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models [arxiv 2403]
- LLM Circuit Analyses Are Consistent Across Training and Scale [ICML 2024 MI Workshop]
- Geometric Signatures of Compositionality Across a Language Model's Lifetime [arxiv 2410]
- πProgress measures for grokking via mechanistic interpretability [ICLR 2023]
- A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations [ICML 2023]
- πThe Mechanistic Basis of Data Dependence and Abrupt Learning in an In-Context Classification Task [ICLR 2024 oral]
- Highest scores at ICLR 2024: 10, 10, 8, 8. And by one author only!
- Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs [ICLR 2024 spotlight]
- A simple and interpretable model of grokking modular arithmetic tasks [openreview]
- Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition [arxiv 2402]
- Interpreting Grokked Transformers in Complex Modular Arithmetic [arxiv 2402]
- Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models [arxiv 2402]
- Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks [arxiv 2406]
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [ICML 2024 MI Workshop]
- Studying Large Language Model Generalization with Influence Functions [arxiv 2308]
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks [ICLR 2024]
- Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking [ICLR 2024]
- The Hidden Space of Transformer Language Adapters [arxiv 2402]
- Dissecting Fine-Tuning Unlearning in Large Language Models [EMNLP 2024]
- Implicit Representations of Meaning in Neural Language Models [ACL 2021]
- All Roads Lead to Rome? Exploring the Invariance of Transformers' Representations [arxiv 2305]
- Observable Propagation: Uncovering Feature Vectors in Transformers [openreview]
- In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations [openreview]
- Challenges with unsupervised LLM knowledge discovery [arxiv 2312]
- Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks [arxiv 2307]
- Position Paper: Toward New Frameworks for Studying Model Representations [arxiv 2402]
- How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study [arxiv 2402]
- More than Correlation: Do Large Language Models Learn Causal Representations of Space [arxiv 2312]
- Do Large Language Models Mirror Cognitive Language Processing? [arxiv 2402]
- On the Scaling Laws of Geographical Representation in Language Models [arxiv 2402]
- Monotonic Representation of Numeric Properties in Language Models [arxiv 2403]
- Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers? [arxiv 2404]
- Simple probes can catch sleeper agents [Anthropic Blog]
- PaCE: Parsimonious Concept Engineering for Large Language Models [arxiv 2406]
- The Geometry of Categorical and Hierarchical Concepts in Large Language Models [ICML 2024 MI Workshop]
- Concept Space Alignment in Multilingual LLMs [EMNLP 2024]
- Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [arxiv 2410]
- πActually, Othello-GPT Has A Linear Emergent World Representation [Neel Nanda's blog]
- Language Models Linearly Represent Sentiment [openreview]
- Language Models Represent Space and Time [openreview]
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets [openreview]
- Linearity of Relation Decoding in Transformer Language Models [ICLR 2024]
- The Linear Representation Hypothesis and the Geometry of Large Language Models [arxiv 2311]
- Language Models Represent Beliefs of Self and Others [arxiv 2402]
- On the Origins of Linear Representations in Large Language Models [arxiv 2403]
- Refusal in LLMs is mediated by a single direction [Lesswrong 2024]
- ReFT: Representation Finetuning for Language Models [arxiv 2404] [github]
- πInference-Time Intervention: Eliciting Truthful Answers from a Language Model [NIPS 2023] [github]
- Activation Addition: Steering Language Models Without Optimization [arxiv 2308]
- Self-Detoxifying Language Models via Toxification Reversal [EMNLP 2023]
- DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [arxiv 2309]
- In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering [arxiv 2311]
- Steering Llama 2 via Contrastive Activation Addition [arxiv 2312]
- A Language Model's Guide Through Latent Space [arxiv 2402]
- Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment [arxiv 2311]
- Extending Activation Steering to Broad Skills and Multiple Behaviours [arxiv 2403]
- Spectral Editing of Activations for Large Language Model Alignment [arxiv 2405]
- Controlling Large Language Model Agents with Entropic Activation Steering [arxiv 2406]
- Analyzing the Generalization and Reliability of Steering Vectors [ICML 2024 MI Workshop]
- Towards Inference-time Category-wise Safety Steering for Large Language Models [arxiv 2410]
- A Timeline and Analysis for Representation Plasticity in Large Language Models [arxiv 2410]
- Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors [arxiv 2410]
- Locating and Editing Factual Associations in GPT (ROME) [NIPS 2022] [github]
- Memory-Based Model Editing at Scale [ICML 2022]
- Editing models with task arithmetic [ICLR 2023]
- Mass-Editing Memory in a Transformer [ICLR 2023]
- Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark [ACL 2023 Findings]
- Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge [ACL 2023]
- Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models [NIPS 2023]
- Inspecting and Editing Knowledge Representations in Language Models [arxiv 2304] [github]
- Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models [EACL 2023]
- Editing Common Sense in Transformers [EMNLP 2023]
- DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models [EMNLP 2023]
- MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions [EMNLP 2023]
- PMET: Precise Model Editing in a Transformer [arxiv 2308]
- Untying the Reversal Curse via Bidirectional Language Model Editing [arxiv 2310]
- Unveiling the Pitfalls of Knowledge Editing for Large Language Models [ICLR 2024]
- A Comprehensive Study of Knowledge Editing for Large Language Models [arxiv 2401]
- Trace and Edit Relation Associations in GPT [arxiv 2401]
- Model Editing with Canonical Examples [arxiv 2402]
- Updating Language Models with Unstructured Facts: Towards Practical Knowledge Editing [arxiv 2402]
- Editing Conceptual Knowledge for Large Language Models [arxiv 2403]
- Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models [arxiv 2406]
- Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing [arxiv 2410]
- Keys to Robust Edits: from Theoretical Insights to Practical Advances [arxiv 2410]
- The Internal State of an LLM Knows When It's Lying [EMNLP 2023 Findings]
- Do Androids Know They're Only Dreaming of Electric Sheep? [arxiv 2312]
- INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection [ICLR 2024]
- TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space [arxiv 2402]
- Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [arxiv 2402]
- Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models [arxiv 2402]
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation [arxiv 2403]
- Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models [arxiv 2403]
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories [arxiv 2406]
- Not all Layers of LLMs are Necessary during Inference [arxiv 2403]
- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [arxiv 2403]
- The Unreasonable Ineffectiveness of the Deeper Layers [arxiv 2403]
- The Remarkable Robustness of LLMs: Stages of Inference? [ICML 2024 MI Workshop]