This repository contains the LMentry benchmark from LMentry: A Language Model Benchmark of Elementary Language Tasks, as well as the code to evaluate it.
For any questions, feel free to open a GitHub issue or to contact us at avia.efrat@gmail.com 😊
Simply clone the repo:
git clone https://github.com/aviaefrat/lmentry.git
The data is in the data
directory.
We provide functions for generating predictions with Hugging Face and OpenAI models (see below), but you can generate predictions in any method of your choosing.
For Hugging Face and OpenAI models, you can use the
generate_all_hf_predictions
and
generate_all_openai_predictions
functions from predict.py
. These are what we used in our experiments.
The easiest and recommended way is to use evalutate.py
:
python -m lmentry.evaluate
Don't forget to activate the lmentry environment (created from environment.yml
) beforehand.
Using the --num-procs=N
optional argument will score the predictions much faster.
evalutate.py
will also automatically create files analyzing the results in a separate results
dir.
To use evalutate.py
, the predictions must follow the same structure of lmentry_predictions.zip (if you used our functions from predict.py
, your predictions already follow this structure):
- The top-level directory should be named
predictions
. predictions
needs to contain exactly 41 directories, named after the 41 files indata
(the 25 task names + the 16 files for the argument content robustness).- Each of the 41 task predictions directories should contain a prediction file for each model you want to evaluate. For example, to evaluate the predictions of a model named
my-model
, each of the 41 directories should contain a file namedmy-model.json
with the model's predictions for this task. - Each predictions file should contain values in the form
"<id>": {"prediction": <prediction>},
where theid
s correspond to those in the task's file indata
.
- Clone the repository.
- Unzip
lmentry_predictions.zip
into the top-level lmentry directory. - run
evaluate.py
(preferably with a not-very-small value for--num-procs
, as there are 656 files to score...)