By Mike Pham and Jackson Lee
Academic publication:
Pham, Mike, & Jackson L. Lee. 2018. Mincing words: Balancing recovery and deletion in word truncation. Glossa: A Journal of General Linguistics, 3(1), 36. DOI: http://doi.org/10.5334/gjgl.269
This repo contains code, datasets, and results associated with this publication.
-
main.py
: Python script for running the different truncation models over the data files in thedata/
folder -
data/
: directory for datasetspt_br_full.txt
: Brazilian Portuguese lexicon (8.7 MB)gold_standard.txt
: gold standard nouns with attested truncation, all annotated with predictions from models such as the binary foot models
-
plots_for_words/
: directory for output plots for individual words for log(left-complete counts) and log(right-complete counts) -
results/
: directory for output files- error details in CSV
- error distribution boxplot
- evaluation
- L- and R-complete counts for individual in LaTeX and PDF
- R code for making individual plots for test words
-
readme.md
: this readme file
NumPy, Pandas, and Seaborn are required to run main.py
.
For reproducibility, the exact versions we use are pinned down in
requirements.txt
, and you can install these dependencies by
pip install -r requirements.txt
.
We are using Python 3.6.3.
Optional -- If the following commands are recognized in your path environment:
-
Rscript
:plot_word.R
(generated bymain.py
) is run and the plots for individual words are saved inplots_for_words/
. -
xelatex
:individual_word_details.tex
(generated bymain.py
) is compiled and the resultant PDF is saved inresults/
.
Rscript
comes from R, whereas xelatex
comes from a LaTeX distribution
(such as texlive-full
).
Download this repository to your local drive by one of these two methods:
-
Download and unzip https://github.com/jacksonllee/BP-truncation/archive/master.zip
-
Clone this repository:
$ git clone https://github.com/jacksonllee/BP-truncation.git $ cd BP-truncation
main.py
can take optional arguments.
The argument -h
brings up the help page with details of these arguments:
$ python main.py -h
usage: main.py [-h] [-f] [-l] [-r] [-d] [-x LEXICON] [-g GOLDSTANDARD]
Modeling truncation in Brazilian Portuguese, by Mike Pham and Jackson Lee
optional arguments:
-h, --help show this help message and exit
-f, --freqtoken Use token frequencies in lexicon (default: False)
-l, --latex Compile the output LaTeX file (default: False)
-r, --run_r_script Run R script (default: False)
-d, --digraphsfixed Change orthographic digraphs into monographs (default:
False)
-x LEXICON, --lexicon LEXICON
Lexicon file (default: data/pt_br_full.txt)
-g GOLDSTANDARD, --goldstandard GOLDSTANDARD
Gold standard file (default: data/gold_standard.txt)
The sample output files included in this repository are generated by this command:
$ python main.py -lr
This command has most of the default settings as described in the help page
shown above, except that R code is run and LaTeX compilation is triggered.
If you don't want to run R and LaTeX, simply run python main.py
with no other arguments.
If, for instance, you'd like to make use of word token frequency information
in the models that involve right-completes and left-completes,
you should run python main.py -flr
(still running R and LaTeX).
All output files for this command bear the suffix "-tokenfreq".
To activate orthographic digraph replacements, run python main.py -dlr
.
All output files are suffixed by "-nodigraphs".
The arguments --lexicon
and -goldstandard
may be used to override the default lexicon
and gold standard files, respectively.
See the sections below on their file format.
The lexicon file data/pt_br_full.txt
is a plain text file
where each line begins with a word, and then a space, and finally
the frequency count of that word.
Here are the first ten lines of the
the lexicon file:
que 12021478
não 9712854
o 9578625
de 8089861
a 7188507
é 6843557
você 6211533
e 5863939
eu 5741437
um 4589127
This lexicon file is from here (released with an MIT license), which in turn derived the lexicon and frequency counts from movie subtitles. The data is therefore highly representative of the spoken language.
The gold standard file is a plain text file
where each line has one original (untruncated) word, and then a
space/tab, and finally the truncated form (TF) of that original word;
the TF is just for reference and is not used by main.py
in any way.
The original word is annotated by the following symbols:
|
: where the true truncation point is for forming the truncated stem (TS). For instance, the TS for adrenalina (see below) is adren.$
: where the truncation point is as predicted by the binLR model.#
: where the truncation point is as predicted by the binRL model.
Here are the first ten lines of the default gold standard file in this
repository (data/gold_standard.txt
):
adr$en|#alina adrena
an$alf|#abeto analfa
bat$#er|ista batera
bel$e|z#a belê
berm|$ud#a bermas
bij$#u|teria biju
bis|$av#ó bisa
bob|$eir#a bobis
bot$equ|#im boteco
burg|$#ês burga
MIT License