The following table presents the Precision@K for the MUSE test dataset:
Model | Precision@1 | Precision@5 |
---|---|---|
Trained FastText | 0.3464 | 0.5663 |
Pre-trained FastText | 0.3513 | 0.6206 |
While the pre-trained FastText model shows slightly better performance, the custom-trained model on Wikipedia data delivers competitive results. You can view detailed logs in logs/alignment.log
.
An ablation study is conducted to observe the impact of dataset size on the performance of trained and pre-trained models. Here are the results:
Model | Precision@1 (%) | Precision@5 (%) |
---|---|---|
Trained FastText (5,000 words) | 0.0209 | 0.0405 |
Trained FastText (10,000 words) | 0.2887 | 0.4595 |
Trained FastText (20,000 words) | 0.3415 | 0.5799 |
Pre-trained FastText (5,000) | 0.2219 | 0.4587 |
Pre-trained FastText (10,000) | 0.3513 | 0.6100 |
Pre-trained FastText (20,000) | 0.3625 | 0.6306 |
Start by cloning the GitHub repository containing all the necessary scripts and configuration files:
git clone https://github.com/anshulsc/cross-lingual-alignment.git
cd cross-lingual-alignment
Next, install the required Python packages by running the following command:
pip install -r requirements.txt
Ensure that you have Python 3.x installed, along with conda
for environment management if needed.
You need to set up directories to store raw, processed, and extracted data. Run the following commands to create the necessary folder structure:
mkdir -p data/raw
mkdir -p data/extracted
mkdir -p data/processed
mkdir -p embedding/trained
mkdir -p embedding/pretrained
mkdir -p lexicon
This structure ensures that all data files and trained models are organized appropriately.
The first step is to download Wikipedia data. Wikipedia dumps are large files that contain all the articles in a given language. Use the provided wiki_download.py
script to automate the download process.
Run the following command to download the dump:
python vectorization/wiki_download.py
The dumps will be saved in the data/raw/
directory:
data/raw/hiwiki-latest-pages-articles.xml.bz2 # Hindi Wikipedia dump
data/raw/enwiki-latest-pages-articles.xml.bz2 # English Wikipedia dump
Once the Wikipedia dump is downloaded, you'll need to extract articles from the XML format using WikiExtractor.
-
Install WikiExtractor:
pip install wikiextractor
-
Run WikiExtractor to extract plain text and store the result in the
data/extracted/
directory:wikiextractor -o data/extracted/extracted_hi --json --no-templates data/raw/hiwiki-latest-pages-articles.xml.bz2
Alternatively, you can run the extract_articles.py
script to automate this process:
python vectorization/extract_articles.py
The extracted articles will be saved in .txt
format in the data/processed/
directory:
data/processed/final_en.txt
data/processed/final_hi.txt
Before training the model, the extracted articles must be preprocessed. This step involves cleaning the text, such as removing HTML tags, punctuation, and stopwords, as well as tokenizing the text.
Run the preprocess_text.py
script to clean and preprocess the data:
python vectorization/preprocess_text.py
The preprocessed text will be saved in the data/processed/
directory:
data/processed/preprocessed_en.txt
With the preprocessed data, you're now ready to train the FastText model. This model is efficient for capturing subword information, making it particularly useful for languages with rich morphology.
Run the train_embedding.py
script to train the FastText model:
python vectorization/train_embedding.py
The trained embedding model will be saved in the embedding/trained/
directory:
embedding/trained/fasttext_hi.bin
If you'd prefer to use pre-trained embeddings, you can download them and store the embeddings in the embedding/pretrained/
directory. Pretrained embeddings can then be used directly in your tasks without the need for further training.
The next step involves downloading the MUSE dataset, which provides bilingual dictionaries for training and testing cross-lingual alignment.
Download the MUSE dataset (English-Hindi bilingual lexicon) from the official MUSE GitHub repository.
- Download the bilingual lexicon for English-Hindi and store it in the
lexicon/
folder:
wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt -P lexicon/
- Similarly, download the test dictionary and store it in the same folder:
wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.5000-6500.txt -P lexicon/
Rename it to en-hi.test.txt
The lexicon/
folder should now contain the following files:
lexicon/en-hi.txt # Full dictionary for training
lexicon/en-hi.test.txt # Test dictionary
Below is the directory structure for the project:
/project-root
│
├── /data/ # Store raw, extracted, and processed data
│ ├── /raw/ # Raw Wikipedia dumps
│ ├── /extracted/ # Extracted Wikipedia articles
│ └── /processed/ # Preprocessed text data
│
├── /embedding/ # Store trained and pre-trained embeddings
│ └── /trained/ # Trained embedding models
│ └── /pretrained/ # Pre-trained embedding models
│
├── /lexicon/ # Store bilingual lexicons
│ ├── en-hi.txt # English-Hindi bilingual lexicon (train)
│ └── en-hi.test.txt # English-Hindi test lexicon
│
├── /vectorization/ # Python scripts for each step
│ ├── wiki_download.py # Download Wikipedia dumps
│ ├── extract_articles.py # Extract articles from the dump
│ ├── preprocess_text.py # Preprocess and clean text
│ └── train_embedding.py # Train FastText embedding model
│
├── /cross-align/ # Cross-lingual alignment and evaluation scripts
│ ├── alignment.py
│ ├── data_loader.py
│ ├── evaluation.py
│
├── config/config.yaml # Configuration file for paths, parameters
├── README.md # Project documentation
└── requirements.txt # Python dependencies
To perform cross-lingual alignment with trained or pre-trained models, you can use the command-line interface. For trained models, use the following command:
python main.py --trained
To use pre-trained models (default behavior):
python main.py
This command-line argument provides flexibility in selecting between trained and pre-trained models based on your needs.
Unsupervised Alignment: Implement an unsupervised alignment method, such as Cross-Domain Similarity Local Scaling (CSLS) combined with adversarial training, as described in the MUSE paper. CSLS is particularly useful for addressing the hubness problem, which occurs when a small number of word embeddings act as neighbors for many other embeddings.
Compare its performance with the supervised Procrustes method currently implemented in the pipeline.
CSLS and adversarial training are more robust in low-resource settings where supervision (such as a bilingual lexicon) is unavailable or limited.