Safety Score for Pre-Trained Language Models

This repository contains the code used to measure safety scores for pre-trained language models based on ToxiGen human annotated dataset and ImplicitHate dataset.

Evaluation Dataset

We selected a subset of TxiGen and ImplicitHate datasets. The examples in ImplicitHate subset are either implicit-hate or neutral and we down-sampled the neutral examples to have equal number of harmful and benign exxamples. ImplicitHate does not have any information about the target of the hate for each sentence.
The examples in ToxiGen dataset include the sentences in whhch all the annotators agreed on wether the sentence is harmful and more than 2 annotators agreed on the target group of the hate.

Setup

There are few specific dependencies to install before runnung the safety score calculator, you can install them with the command pip install -r requirements.txt.

How to calculate safety score

Now you can run the following script:

python safety_score.py \
   --data data/toxiGen.json # Path to evaluation dataset \
   --output results \
   --model gpt2 \
   --lmHead clm # Type of language model head, i.e. causal or masked\
   --force # overwrites the output path if it already exists.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
data		data
results		results
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
safety_score.py		safety_score.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safety Score for Pre-Trained Language Models

Evaluation Dataset

Setup

How to calculate safety score

About

Releases

Packages

Contributors 2

Languages

License

microsoft/SafeNLP

Folders and files

Latest commit

History

Repository files navigation

Safety Score for Pre-Trained Language Models

Evaluation Dataset

Setup

How to calculate safety score

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages