Skip to content

anshulsc/Cross-Lingual-Alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Results

Performance Metrics

The following table presents the Precision@K for the MUSE test dataset:

Model Precision@1 Precision@5
Trained FastText 0.3464 0.5663
Pre-trained FastText 0.3513 0.6206

While the pre-trained FastText model shows slightly better performance, the custom-trained model on Wikipedia data delivers competitive results. You can view detailed logs in logs/alignment.log.

Ablation Study

An ablation study is conducted to observe the impact of dataset size on the performance of trained and pre-trained models. Here are the results:

Model Precision@1 (%) Precision@5 (%)
Trained FastText (5,000 words) 0.0209 0.0405
Trained FastText (10,000 words) 0.2887 0.4595
Trained FastText (20,000 words) 0.3415 0.5799
Pre-trained FastText (5,000) 0.2219 0.4587
Pre-trained FastText (10,000) 0.3513 0.6100
Pre-trained FastText (20,000) 0.3625 0.6306

Cosine Similarity Graph

Cosine Similarity Graph Cosine Similarity Graph


Setting Up the Project

Step 1: Clone the Repository

Start by cloning the GitHub repository containing all the necessary scripts and configuration files:

git clone https://github.com/anshulsc/cross-lingual-alignment.git
cd cross-lingual-alignment

Step 2: Install Dependencies

Next, install the required Python packages by running the following command:

pip install -r requirements.txt

Ensure that you have Python 3.x installed, along with conda for environment management if needed.

Step 3: Create Necessary Directories

You need to set up directories to store raw, processed, and extracted data. Run the following commands to create the necessary folder structure:

mkdir -p data/raw
mkdir -p data/extracted
mkdir -p data/processed
mkdir -p embedding/trained
mkdir -p embedding/pretrained
mkdir -p lexicon

This structure ensures that all data files and trained models are organized appropriately.


Pipeline for Training FastText Embeddings

1. Download Wikipedia Dumps

The first step is to download Wikipedia data. Wikipedia dumps are large files that contain all the articles in a given language. Use the provided wiki_download.py script to automate the download process.

Run the following command to download the dump:

python vectorization/wiki_download.py

The dumps will be saved in the data/raw/ directory:

data/raw/hiwiki-latest-pages-articles.xml.bz2  # Hindi Wikipedia dump
data/raw/enwiki-latest-pages-articles.xml.bz2  # English Wikipedia dump

2. Extract Articles from Wikipedia Dumps

Once the Wikipedia dump is downloaded, you'll need to extract articles from the XML format using WikiExtractor.

  1. Install WikiExtractor:

    pip install wikiextractor
  2. Run WikiExtractor to extract plain text and store the result in the data/extracted/ directory:

    wikiextractor -o data/extracted/extracted_hi --json --no-templates data/raw/hiwiki-latest-pages-articles.xml.bz2

Alternatively, you can run the extract_articles.py script to automate this process:

python vectorization/extract_articles.py

The extracted articles will be saved in .txt format in the data/processed/ directory:

data/processed/final_en.txt
data/processed/final_hi.txt

3. Preprocess Extracted Text

Before training the model, the extracted articles must be preprocessed. This step involves cleaning the text, such as removing HTML tags, punctuation, and stopwords, as well as tokenizing the text.

Run the preprocess_text.py script to clean and preprocess the data:

python vectorization/preprocess_text.py

The preprocessed text will be saved in the data/processed/ directory:

data/processed/preprocessed_en.txt

4. Train the FastText Model

With the preprocessed data, you're now ready to train the FastText model. This model is efficient for capturing subword information, making it particularly useful for languages with rich morphology.

Run the train_embedding.py script to train the FastText model:

python vectorization/train_embedding.py

The trained embedding model will be saved in the embedding/trained/ directory:

embedding/trained/fasttext_hi.bin

Using Pretrained FastText Embeddings

If you'd prefer to use pre-trained embeddings, you can download them and store the embeddings in the embedding/pretrained/ directory. Pretrained embeddings can then be used directly in your tasks without the need for further training.


Download MUSE Dataset for Cross-Lingual Alignment

The next step involves downloading the MUSE dataset, which provides bilingual dictionaries for training and testing cross-lingual alignment.

Download the MUSE dataset (English-Hindi bilingual lexicon) from the official MUSE GitHub repository.

  1. Download the bilingual lexicon for English-Hindi and store it in the lexicon/ folder:
wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt -P lexicon/
  1. Similarly, download the test dictionary and store it in the same folder:
wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.5000-6500.txt -P lexicon/

Rename it to en-hi.test.txt The lexicon/ folder should now contain the following files:

lexicon/en-hi.txt       # Full dictionary for training
lexicon/en-hi.test.txt  # Test dictionary

Directory Structure

Below is the directory structure for the project:

/project-root
│
├── /data/                    # Store raw, extracted, and processed data
│   ├── /raw/                 # Raw Wikipedia dumps
│   ├── /extracted/           # Extracted Wikipedia articles
│   └── /processed/           # Preprocessed text data
│
├── /embedding/               # Store trained and pre-trained embeddings
│   └── /trained/             # Trained embedding models
│   └── /pretrained/          # Pre-trained embedding models
│
├── /lexicon/                 # Store bilingual lexicons
│   ├── en-hi.txt             # English-Hindi bilingual lexicon (train)
│   └── en-hi.test.txt        # English-Hindi test lexicon
│
├── /vectorization/           # Python scripts for each step
│   ├── wiki_download.py      # Download Wikipedia dumps
│   ├── extract_articles.py   # Extract articles from the dump
│   ├── preprocess_text.py    # Preprocess and clean text
│   └── train_embedding.py    # Train FastText embedding model
│
├── /cross-align/             # Cross-lingual alignment and evaluation scripts
│   ├── alignment.py
│   ├── data_loader.py
│   ├── evaluation.py
│
├── config/config.yaml        # Configuration file for paths, parameters
├── README.md                 # Project documentation
└── requirements.txt          # Python dependencies

Cross-Lingual Alignment

To perform cross-lingual alignment with trained or pre-trained models, you can use the command-line interface. For trained models, use the following command:

python main.py --trained

To use pre-trained models (default behavior):

python main.py

This command-line argument provides flexibility in selecting between trained and pre-trained models based on your needs.


Next Steps

Unsupervised Alignment: Implement an unsupervised alignment method, such as Cross-Domain Similarity Local Scaling (CSLS) combined with adversarial training, as described in the MUSE paper. CSLS is particularly useful for addressing the hubness problem, which occurs when a small number of word embeddings act as neighbors for many other embeddings.

Compare its performance with the supervised Procrustes method currently implemented in the pipeline.

CSLS and adversarial training are more robust in low-resource settings where supervision (such as a bilingual lexicon) is unavailable or limited.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published