Skip to content

Latest commit

 

History

History

summarization_indosum

Summarization

IndoLEM uses IndoSum for extractive summarization. Our experiment is based on Liu and Lapata (2018) framework with three BERT models: IndoBERT, malayBERT, and mBERT.

Requirements

Tested with below configuration. Higher torch version is not suitable for PreSumm.

python==3.7.6
torch==1.1.0
torchvision==0.8.1
transformers==3.0.0
pyrouge==0.1.3
tensorboardX==2.1

Experiment

  1. First, download the data here and put the them (all folds) in folder data/
  2. Original implementation can be found here.
  3. Run three scripts for data preprocessing:
python make_datafiles_presum_indobert.py
python make_datafiles_presum_malaybert.py
python make_datafiles_presum_mbert.py
  1. Now you can run the experiment by using the script below:

IndoBERT

cd scripts
./train_indobert.sh
./eval_indobert.sh

MalayBERT

cd scripts
./train_malaybert.sh
./eval_malaybert.sh

mBERT

cd scripts
./train_mbert.sh
./eval_mbert.sh

In scripts/ run chmod +x * to enable bash execution. The training requires 3 GPUs (V100 16GB). If you have lower GPU size, please reduce the batch size.

Since we use 5-fold cross validation, the experiments are run 5 times with different folds in folder data/. Please adjust the script.

Evaluation

Please install pyrouge for evaluating the summary.