This project was built in part as an application for MATS program
This project builds upon Francois Fleuret culture (with his permission).
# install my interp library & culture
pip install -r requirements.txt
pip install -e .
# patch TransformerLens
# culture MyGPT does not include a final layer norm, need to add support for this
git clone https://github.com/TransformerLensOrg/TransformerLens.git
cd TransformerLens
git apply ../transfomer_lens_final_ln.patch
pip install -e .
I've been following this project from Francois Fleuret for a while now.
https://fleuret.org/public/culture/draft-paper.pdf
The hypothesis is that intelligence emerges from "social competition" of different agents. The experiment trains 5 GPTs on programmatically generated 2D "world" quizzes, then once the models have sufficiently learned the task (accuracy > 95%) they attempt to generate their own "culture" quizzes. These quizzes are kept if 4/5 of the models agree on the answer, and one gets it wrong (to make it correct, but sufficiently difficult).
The idea is that the models will start producing progressively more difficult quizzes as a result, and (ideally) new and unique concepts through social interaction.
I thought 5 GPTs trained in this group setting would be an interesting project from an interpretability perspective:
- Are there universal features shared across these models? like in Universal Neurons in GPT2 Language Model
- We can compare and contrast features learned using an SAE
- This is similar version to the "train with different seeds" and compare features
- Since the quizzes are synthetic, it should be easier to "interpret" the behaviour of these models, since I can easily partition the data into the different tasks, and dynamically generate new unseen tasks.
- Fundamentally these models are the same as a normal GPT, so I can use the same interpretability tools to understand the models.
- The feature visualizations on these grid tasks (rather than text sentences) would look pretty cool
Quite ambitious since the models were not compatible with TransformerLens library, and everything was implemented in a unique way. So there was a bit of a standard software engineering challenge to integrate it with existing interpretability tools.
I always try to do something like this with my own projects anyway -- implementing from scratch helps me learn much better than just AutoModel.from_pretrained
.
There's a few "gotachs" in running the GPTs:
- Added sinusoidal support for TransformerLens
use_past_kv_cache
is buggy, I think this is from hacking in sinusoidal positional encoding?- There's no final layer norm in
MyGPT
, so I had to patch TransformerLens to support this too - You must prepend the input with a
0
as a BOS token (the models generate the entire sequence when creating new quizzes, but not for eval)
Find model weights here
This is a fork of the original culture project, by Francois Fleuret.
Model weights are stored here: tommyp111/culture-gpt
Contains most of the lib functionality, including:
generate
function (usingmodel.generate
)load_culture
: Loading theMyGPT
modelsload_hooked
: ConvertingMyGPT
weights into aHookedTransformer
load_quizzes
: Loading theQuizMachine
(culture quizzes)run_tests
: Running tests on the models
Run python -m interp.culture 0 1 2 3 --num_test_samples 100
to test most of the library functionality, and run accuracy tests on the models. Setting --num_test_samples
to 2000 is standard for eval, and should achieve ~95% accuracy. I've found that using a smaller number of samples can give a lower accuracy (should be at least 80s).
Create a HF tokenizer for the models. required for train_sae.py
Can find it on HF here: tommyp111/culture-tokenizer
repr_grid
for pretty printing.sinusoidal_positional_encoding
implTOK_PREPROCESS
&prep_quiz
for preprocessing quizzes
Create a 1M element HF dataset. Again used for train_sae.py
Train a sparse autoencoder on the models using sae_lens.SAETrainingRunner
.
Load pretrained SAE from here: tommyp111/culture-sae