Skip to content

microsoft/SafeNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Safety Score for Pre-Trained Language Models

This repository contains the code used to measure safety scores for pre-trained language models based on ToxiGen human annotated dataset and ImplicitHate dataset.

Evaluation Dataset

  • We selected a subset of TxiGen and ImplicitHate datasets. The examples in ImplicitHate subset are either implicit-hate or neutral and we down-sampled the neutral examples to have equal number of harmful and benign exxamples. ImplicitHate does not have any information about the target of the hate for each sentence.
  • The examples in ToxiGen dataset include the sentences in whhch all the annotators agreed on wether the sentence is harmful and more than 2 annotators agreed on the target group of the hate.

Setup

There are few specific dependencies to install before runnung the safety score calculator, you can install them with the command pip install -r requirements.txt.

How to calculate safety score

Now you can run the following script:

python safety_score.py \
   --data data/toxiGen.json # Path to evaluation dataset \
   --output results \
   --model gpt2 \
   --lmHead clm # Type of language model head, i.e. causal or masked\
   --force # overwrites the output path if it already exists.

About

Safety Score for Pre-Trained Language Models

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published