Red-teaming Red-teaming a simple language model like gpt2. Based on the anthropic Red-teaming paper. https://arxiv.org/abs/2202.03286 Requirements https://huggingface.co/unitary/toxic-bert https://github.com/neelnanda-io/TransformerLens install detoxify pip install detoxify