Red-Teaming Language Models with DSPy

We use the the power of DSPy, a framework for structuring and optimizing language model programs, to red-team language models.

To our knowledge, this is the first attempt at using any auto-prompting framework to perform the red-teaming task. This is also probably the deepest architecture in public optimized with DSPy to date.

We accomplish this using a deep language program with several layers of alternating Attack and Refine modules in the following optimization loop:

Figure 1: Overview of DSPy for red-teaming. The DSPy MIPRO optimizer, guided by a LLM as a judge, compiles our language program into an effective red-teamer against Vicuna.

The following Table demonstrates the effectiveness of the chosen architecture, as well as the benefit of DSPy compilation:

Architecture	ASR
None (Raw Input)	10%
Architecture (5 Layer)	26%
Architecture (5 Layer) + Optimization	44%

Table 1: ASR with raw harmful inputs, un-optimized architecture, and architecture post DSPy compilation.

With no specific prompt engineering, we are able to achieve an Attack Success Rate of 44%, 4x over the baseline. This is by no means the SOTA, but considering how we essentially spent no effort designing the architecture and prompts, and considering how we just used an off-the-shelf optimizer with almost no hyperparameter tuning (except to fit compute constraints), we think it is pretty exciting that we can achieve this result!

Full exposition on the Haize Labs blog.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
README.md		README.md
advbench_subset.json		advbench_subset.json
redteam.py		redteam.py
utils.py		utils.py
vicuna_attack.log		vicuna_attack.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red-Teaming Language Models with DSPy

About

Releases

Packages

Languages

darvin/dspy-redteam

Folders and files

Latest commit

History

Repository files navigation

Red-Teaming Language Models with DSPy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages