GitHub

Work done with Logan Riggs who wrote the original replication notebook. Thanks to Pierre Peigne for the data generating code and Lee Sharkey for answering questions.

Sparse Coding

python replicate_toy_models.py runs code which allows for the replication of the first half of the post Taking features out of superposition with sparse autoencoders.

run.py contains a more flexible set of functions for generating datasets using Pile10k and then running sparse coding activations on real models, incluidng gpt-2-small and custom models.

The repo also contains utils for running code on vast.ai computers which can speed up these sweeps.

## Automatic Interpretation

Currently using OpenAI's automoatic-interpretability repo but can't get it to install so the repo currently works by installing the automatic-interpretability/neuron-explainer/neuron-explainer cloned and saved in the top level of the repo, under the name neuron_explainer.

Training a custom small transformer

The next part of the sparse coding work uses a very small transformer to do some early tests using sparse autoencoders to find features. There doesn't appear to be an open-source model of this kind, and the original model is proprietary, so below are the instructions I followed to create a similar small transformer.

Make sure you have >200GB space. Tested using a vast.ai RTX3090 and pytorch:latest docker image.

git clone https://github.com/karpathy/nanoGPT
cd nanoGPT
python -m venv .env
source .env/bin/activate
apt install -y build-essential
pip install torch numpy transformers datasets tiktoken wandb tqdm

Change config/train_gpt2.py to have:

import time
wandb_project = 'sparsecode'
wandb_run_name = 'supertiny-' + str(time.time())
n_layer = 6 # (same as train_shakespeare and Lee's work)
n_embd = 16 # (same as Lee's)
n_head = 8 # (needs to divide n_embd)
dropout = 0.2 # (used in shakespeare_char)
block_size = 256 # (just to make faster?)
batch_size = 64

To set up the dataset run:

python data/openwebtext/prepare.py

Then if using multiple gpus, run:

torchrun --standalone --nproc_per_node={N_GPU} train.py config/train_gpt2.py

else simply run:

python train.py

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
autoencoders		autoencoders
images		images
notebooks		notebooks
rm_save_files		rm_save_files
sparse_weights		sparse_weights
.gitignore		.gitignore
README.md		README.md
activation_dataset.py		activation_dataset.py
alive_features.pkl		alive_features.pkl
all_plots.ipynb		all_plots.ipynb
all_sparse_weights.ipynb		all_sparse_weights.ipynb
alpha_easy_training_code.ipynb		alpha_easy_training_code.ipynb
alpha_easy_training_code_rm.ipynb		alpha_easy_training_code_rm.ipynb
alpha_easy_training_code_rm.py		alpha_easy_training_code_rm.py
alpha_feature_display_code.ipynb		alpha_feature_display_code.ipynb
alpha_training_code.py		alpha_training_code.py
alpha_utils_interp.py		alpha_utils_interp.py
argparser.py		argparser.py
case_studies_loop.ipynb		case_studies_loop.ipynb
chess_interp.ipynb		chess_interp.ipynb
comparisons.py		comparisons.py
dict_compare.ipynb		dict_compare.ipynb
dictionary_interp.ipynb		dictionary_interp.ipynb
easy_training_code.ipynb		easy_training_code.ipynb
feature_display.py		feature_display.py
feature_display_code.ipynb		feature_display_code.ipynb
feature_interp.ipynb		feature_interp.ipynb
interp_utils.py		interp_utils.py
interpret.py		interpret.py
interpret_sparse_weights.ipynb		interpret_sparse_weights.ipynb
interpreting_sparse_dictionaries.ipynb		interpreting_sparse_dictionaries.ipynb
investigate.py		investigate.py
linear_features.pkl		linear_features.pkl
lucia_and_lovis.ipynb		lucia_and_lovis.ipynb
minimal_feature_interp.ipynb		minimal_feature_interp.ipynb
mlp_features.pkl		mlp_features.pkl
mmcs_across_layers_across_l1.ipynb		mmcs_across_layers_across_l1.ipynb
monosemanticity_measure.ipynb		monosemanticity_measure.ipynb
nanoGPT_model.py		nanoGPT_model.py
openai_dictionary_direction.ipynb		openai_dictionary_direction.ipynb
pca_ica_interp.ipynb		pca_ica_interp.ipynb
perplexity.ipynb		perplexity.ipynb
qk_ov.ipynb		qk_ov.ipynb
replicate_toy_models.py		replicate_toy_models.py
requirements.txt		requirements.txt
rm_feature_interp.ipynb		rm_feature_interp.ipynb
rm_feature_patching.ipynb		rm_feature_patching.ipynb
rm_interp.ipynb		rm_interp.ipynb
rm_patching.ipynb		rm_patching.ipynb
run.py		run.py
sparse_weights.ipynb		sparse_weights.ipynb
sparse_weights_mse_CE.ipynb		sparse_weights_mse_CE.ipynb
static-all_sparse_weights.ipynb		static-all_sparse_weights.ipynb
static-interpret_sparse_weights.ipynb		static-interpret_sparse_weights.ipynb
static-interpret_sparse_weights_mlp.ipynb		static-interpret_sparse_weights_mlp.ipynb
static-train_sparse_sae_connector.ipynb		static-train_sparse_sae_connector.ipynb
tiny_transformer.py		tiny_transformer.py
train_saes.ipynb		train_saes.ipynb
train_sparse_sae_connector.ipynb		train_sparse_sae_connector.ipynb
upload_to_huggingface.ipynb		upload_to_huggingface.ipynb
utils.py		utils.py
utils_interp.py		utils_interp.py
zNotebook.ipynb		zNotebook.ipynb
zNotebook2.ipynb		zNotebook2.ipynb
zNotebook3.ipynb		zNotebook3.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Coding

Training a custom small transformer

About

Releases

Packages

Languages

loganriggs/sparse_coding

Folders and files

Latest commit

History

Repository files navigation

Sparse Coding

Training a custom small transformer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages