This repository holds the code for the paper:
- Xin Du and Kumiko Tanaka-Ishii. "Semantic Field of Words Represented as Nonlinear Functions", NeurIPS 2022.
We proposed a new word representation in a functional space rather than a vector
space, called FIeld REpresentation (FIRE). Each word
FIRE represents word polysemy by the multimodality of
The similarity between two sentences
where
Overlapped semantic fields of `river` and `financial`, and their locations. The shape resembles that of `bank` in the image above, indicating FIRE's property of compositionality.
A challenge for implementing FIRE is to parallize
the evaluation of functions
The usual way of using a neural network NN is to process a data batch at a time,
that is the parallelization of
In FIRE-based language models, we instead require the parallelization of both neural networks and data. The desired behavior should include:
- plain mode:
$\text{NN}_1(x_1)$ ,$\text{NN}_2(x_2)$ ... - cross mode:
-
$\text{NN}_1(x_1)$ ,$\text{NN}_1(x_2)$ ... -
$\text{NN}_2(x_1)$ ,$\text{NN}_2(x_2)$ ... $\cdots$
-
In other words, the separate neural networks must be batchified, just as the
indexing of column vectors in a matrix and the recombination of them into a new
matrix. We call this process a "stacking and slicing". We provide one solution
in this repository. Please see the StackSlicing
class.
We selected a subset of the "Core"
WordNet dataset and
constructed a list of 542 strongly polysemeous / strongly monosemeous words.
See /data/wordnet-542.txt
We provide scripts in /scripts/text8/
to train a FIRE model on the text8 dataset.
# download the *text8* corpus
$ bash scripts/text8/1_download_text8.sh
# tokenize the corpus with the NLTK tokenizer
$ bash scripts/text8/2_tokenize.sh
# build a vocabulary with the tokenized corpus
$ bash scripts/text8/3_build_vocab.sh
# training from scratch
$ bash scripts/text8/4_train.sh
# This would takes 2-3 hours, so it is recommended to run the process in the background.
# For example:
$ CUDA_VISIBLE_DEVICES=0 nohup bash scripts/text8/4_train.sh > log.train.log 2>&1 &
The training is carried out with the SkipGram method.
For fast sampling from the tokenized corpus in the SkipGram way, we used
another python package corpusit
that is written in Rust (and binded with PyO3).