An experiment driven Search Engine Project, developed to index and retrieve best documents given a query using ensemble of models.
- Filter Models
- BM25
- TF-IDF
- Voter Models
- MPNET
- RoBERTa
- Phase 1
- Data Analysis & Pipeline
- Model Pipeline
- Evaluation Pipeline
- Phase 2
- BM25 Model + MPNet Model
- Hyperparameter tuning
- Ensemble Pipeline
- Phase 3
- RoBERTa Model
- Ensemble enhancement
- Experimentation
- Finetune ColBERT
- Implement Clustering of docs
Note : The project was tested on linux and MacOS. (Windows has dependency issues, refer Troubleshooting)
-
Clone repository
$ git clone https://github.com/TF4ces/TF4ces-search-engine.git
-
Setup Environment repository
$ python3 -m venv venv $ source venv/bin/activate [LINUX/MAC] $ .\venv\Scripts\activate [WINDOWS] $ pip install -r src/requirements.txt
-
Download pre-loaded embeddings to this path:
./dataset/embeddings_test
from GDriveNote: To generate embeddings from scratch run./tests/test_evaluate_model.py script setting MODEL to
all-mpnet-base-v2
,all-roberta-large-v1
individually twice.WARNING: use a GPU machine and it is expected to take 1hr to generate.
-
Run TF4ces Search Engine [install jupyter by
$pip install jupyter notebook
and to run$jupyter notebook
]- Run Eval Pipeline from ./tests/notebooks/TF4ces_Search_Eval.ipynb ipynb notebook.
- Run prediction Demo Pipeline from ./tests/notebooks/TF4ces_Search_Demo.ipynb ipynb notebook.
-
Windows Systems are seen to have issue while reading data with ir-datasets==0.4.1
For windows the doc.iter might throw decoding error while reading tsv file, You would need to change the encoding in source files of dependency as per this issue.