SAPPHIRE is a simple monolingual phrase aligner based on word embeddings.
SAPPHIRE depends only on a pre-trained word embedding.
Therefore, it is easily transferable to specific domains and different languages.
This library is designed for a pre-trained model of fastText.
But it is easy to replace the model.
- Python 3.6 or newer
- NumPy & SciPy
- fasttext
- Install requirements
After cloning this repository, go to the root directory and install requirements.
$ pip install -r requirements.txt
- Install SAPPHIRE
Installation withdevelop
option allows you to change the parameters and add scripts for other word representations.
$ python setup.py develop
- Download the pre-trained model of fastText (or prepare your model of fastText) and move it to model directory.
$ curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip
$ unzip wiki-news-300d-1M-subword.bin.zip
$ mkdir model
$ mv wiki-news-300d-1M-subword.bin model/
$ python run_sapphire.py
To stop SAPPHIRE, enter exit
when inputting a sentence.
>>> from sapphire import Sapphire
>>> aligner = Sapphire()
After preparing a tokenized sentence pair (tokenized_sentence_a: list
and tokenized_sentence_b: list
),
>>> result = aligner.align(tokenized_sentence_a, tokenized_sentence_b)
>>> alignment = result.top_alignment[0][0]
>>> print(alignment)
[(1, 3, 2, 3), (8, 9, 5, 6), (13, 13, 8, 8), (27, 27, 9, 9)]
phrase pair (x, y): (x_start, x_end, y_start, y_end) # 1-indexed alignment