This repository holds the code for the paper:
- Xin Du and Kumiko Tanaka-Ishii. "Semantic Field of Words Represented as Nonlinear Functions", NeurIPS 2022.
We proposed a new word representation in a functional space rather than a vector
space, called FIeld REpresentation (FIRE). Each word
Compared with previous word representation methods, FIRE represents nonlinear
word polysemy while preserving a linear structure for additive semantic compositionality.
The word polysemy is represented by the multimodality of
The similarity between two sentences
where
Figure (left): words that are frequent and similar to the word bank
,
visualized in the semantic field of bank
, when bank
are
naturally separated with FIRE. Figure (right): overlapped semantic fields of
river
and financial
, and their locations bank
in the image above, indicating FIRE's property of compositionality.
-
Python >= 3.7
-
Packages
# With CUDA 11 $ pip install -r requirements-cu11.txt # With CUDA 10 $ pip install -r requirements-cu10.txt
-
NLTK You may need to download the NTLK
punkt
package. To do that, please run python and execute the following:>>> import nltk >>> nltk.download('punkt')
-
If you are using Windows or MacOS, you need to install the Rust compiling toolchain (cargo) in advance, which is used to compile the
corpusit
package from source.
We provide the following pre-trained FIRE models:
- $D=2,L=4,L=1$ (23 parameters per word)
- $D=2,L=4,L=10$ (50 parameters per word)
- $D=2,L=8,L=20$ (100 parameters per word)
You can run the following to download the three.
$ bash scripts/benchmark/1_download_pretrained.sh
The models will be downloaded to checkpoints/
and decompressed.
The saved models can be reloaded by
import firelang
model = firelang.FireWord.from_pretrained('checkpoints/v1.1/wacky_mlplanardiv_d2_l4_k10')
Execute the benchmarking script as follows:
$ bash scripts/benchmark/2_run_benchmark.sh
We integrated WanDB functionalities in the training program for experiment management and visualization. So by default, you need to do the following three steps to enable those functionalities:
- Create
wandb_config.py
from the templatewandb_config.template.py
- Register an WanDB account
- Fill in your username (
WANDB_ENTITY
) and token (WANDB_API_KEY
) inwandb_config.py
If you do not plan to use WanDB, you will have to delete the argument --use_wandb
in scripts/text8/4_train.sh
and scripts/wacky/4_train.sh
We provide scripts in /scripts/text8/
to train a FIRE model on the text8 dataset.
Text8 is smaller (~100MB) and is publicly available.
# download the text8 corpus
$ bash scripts/text8/1_download_text8.sh
# tokenize the corpus with the NLTK tokenizer
$ bash scripts/text8/2_tokenize.sh
# build a vocabulary with the tokenized corpus
$ bash scripts/text8/3_build_vocab.sh
# training from scratch
$ bash scripts/text8/4_train.sh
# This would takes 2-3 hours, so it is recommended to run the process in the background.
# For example:
# $ CUDA_VISIBLE_DEVICES=0 nohup bash scripts/text8/4_train.sh > log.train.wacky.log 2>&1 &
The training process is carried out with the SkipGram method.
For fast sampling from the tokenized corpus in the SkipGram way, we used
another python package corpusit
that is written in Rust (and binded with PyO3).
On Windows or MacOS, installing corpusit
with pip
will compile the
package from source code; in this case, you need to have
cargo installed in advance.
The WaCKy corpus is a concatenation of two corpora ukWaC
and WaCkypedia_EN
.
Both are provided at https://wacky.sslmit.unibo.it/doku.php?id=download via request.
After you get the two corpora, put the concatenated file at
/data/corpus/wacky/wacky.txt
. Then, you can run scripts under /scripts/wacky/
(for tokenization, vocabulary construction, and training) to start training a
FIRE on the concatenated corpus. The process takes 10-20 hours depending on the hardware.
A challenge for implementing FIRE is to parallize
the evaluation of (neural-network) functions
The usual way of using a neural network NN is to process a data batch at a time,
that is the parallelization of
In FIRE-based language models, we instead require the parallelization of both neural networks and data. The desired behaviors should include:
-
paired mode: output a vector.
-
$\text{NN}_1(x_1)$ ,$\text{NN}_2(x_2)$ ,$\text{NN}_3(x_3)$ ...
In analog to element-wise multiplication
$\text{NN}*x$ , where$\text{NN}=[\text{NN}_1,\text{NN}_2,\cdots,]^\text{T}$ , and$x=[x_1,x_2,\cdots]^\text{T}$ -
-
cross mode: output a matrix.
-
$\text{NN}_1(x_1)$ ,$\text{NN}_1(x_2)$ ,$\text{NN}_1(x_3)$ ... -
$\text{NN}_2(x_1)$ ,$\text{NN}_2(x_2)$ ,$\text{NN}_2(x_3)$ ... -
$\text{NN}_3(x_1)$ ,$\text{NN}_3(x_2)$ ,$\text{NN}_3(x_3)$ ... $\cdots$
In analog to matrix multiplication of vectors:
$\text{NN}$ @$x$ . -
We call
To store the parameters for all words, slicing
is required to extract the parameters for these
words and recombine them into a new stacked function.
For word-vector representations, slicing
is natively supported
for the matrix
- Slicing:
vecs1 = V[[id_apple, id_pear, ...]] # (n1, D) vecs2 = V[[id_iphone, id_fruit, ...]] # (n2, D)
- Computation of paired similarity (where n1 == n2 must hold):
sim = (vecs1 * vecs2).sum(-1) # (n1,)
- Computation of cross similarity (n1 and n2 can be different):
sim = vecs1 @ vecs2.T # (n1, n2)
In this repository, we provide an analogous implementation for parallelizing neural networks.
In FIRE, each neural network (a function or a measure) is treated like a vector.
Multiple neural networks are stacked
like the stacking of vectors into a matrix.
In FIRE, the slicing
and similarity computation are done in a similar way to vectoral.
- Slicing:
x1 = model[["apple", "pear", ...]] # FIRETensor: (n1, D) x2 = model[["iphone", "fruit", ...]] # FIRETensor: (n1, D)
- Computation of paired similarity (where n1 == n2):
sim = x2.measures.integral(x1.funcs) # (n1,) + x1.measures.integral(x2.funcs) # (n2,)
- Computation of cross similarity:
sim = x2.measures.integral(x1.funcs, cross=True) # (n1, n2) + x1.measures.integral(x2.funcs, cross=True).T # (n2, n1) -> transpose -> (n1, n2)
In addition to the way above where integral
must be explicitly invoked,
a more friendly way is also provided, as below:
# paired similarity
# sim: (n1,)
sim = x1.funcs * x2.measures # (n1,)
+ x1.measures * x2.funcs # (n2,)
# cross similarity
sim = x1.funcs @ x2.measures + x1.measures @ x2.funcs # (n1, n2)
Furthermore, the two steps above can be done in one line:
sim = model[["apple", "pear", "melon"]] @ model[["iphone", "fruit"]] # (3, 2)
For the functions in a FIRE, we implemented arithmetic operators to make the (stacked) functions look more like a vector.
For example, the regularization of the similarity scores by the following formula
is done by:
x1 = model[["apple", "pear"]]
x2 = model[["fruit", "iphone"]]
sim_reg = x2.measures.integral(x1.funcs) + x1.measures.integral(x2.funcs) \
- x1.measures.integral(x1.funcs) - x2.measures.integral(x1.funcs)
or equivalently:
sim_reg = (x2.funcs - x1.funcs) * x1.measures + (x1.funcs - x2.funcs) * x2.measures
where x2.funcs - x1.funcs
produces a new functional.
Please cite the following paper:
@inproceedings{
du2022semantic,
title={Semantic Field of Words Represented as Non-Linear Functions},
author={Xin Du and Kumiko Tanaka-Ishii},
booktitle={Thirty-Sixth Conference on Neural Information Processing Systems},
year={2022},
}