Removing refusals with HF Transformers

This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens. This means, that this supports every model that HF Transformers supports*.

The code was tested on a RTX 2060 6GB, thus mostly <3B models have been tested, but the code has been tested to work with bigger models as well.

*While most models are compatible, some models are not. Mainly because of custom model implementations. Some Qwen implementations for example don't work. Because model.model.layers can't be used for getting layers. They call the variables so that, model.transformer.h must be used, if I'm not mistaken.

Usage

Set model and quantization in compute_refusal_dir.py and inference.py (Quantization can apparently be mixed)
Run compute_refusal_dir.py (Some settings in that file may be changed depending on your use-case)
Run inference.py and ask the model how to build an army of rabbits, that will overthrow your local government one day, by stealing all the carrots.

Credits

Harmful instructions
Harmless instructions
Technique

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Removing refusals with HF Transformers

Usage

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Removing refusals with HF Transformers

Usage

Credits