Multilingual Information Retrieval Toolkit

This project focuses on implementing various information retrieval techniques on Persian and English datasets. We perform preprocessing, compute TF, IDF, and TF-IDF, and implement cosine similarity and Jaccard coefficient for document-query pairs.

Datasets

Persian Dataset: Download here
English Dataset: Here you have to create a shortcut in your google drive account and mount it in your google colab.

Preprocessing

We utilized the Hazm library for Persian and NLTK for English to perform the following preprocessing steps on both datasets:

Normalization
Stemming
Tokenization
Removal of words with a length less than 3 after stemming

TF, IDF, and TF-IDF Computation

TF (Term Frequency), IDF (Inverse Document Frequency), and TF-IDF (Term Frequency-Inverse Document Frequency) are computed based on the posting list generated in the Boolean retrieval system assignment.

Cosine Similarity

Cosine similarity scores for document-query pairs are calculated using the sum of TF-IDF values of common words in the query and the document. The documents are then ranked based on cosine similarity, and the top 10 documents are displayed.

Jaccard Coefficient

Jaccard coefficient is calculated as the intersection of sets A and B divided by the union of A and B for document-query pairs.

Queries

English Queries:

𝑄1: "Mr. Henry Dashwood had one son"
𝑄2: "no money for gambling"
𝑄3: "All through the day Miss Abbott had seemed to Philip like a goddess"
𝑄4: "Are bears any good at discovering it?"
𝑄5: "I'd like a shilling"
𝑄6: "On a January evening of the early seventies"

Persian Queries:

𝑄1: "ایران در عهد باستان"
𝑄2: "موضوع این کتاب"
𝑄3: "دریا گسترهای بس زیبا، فریبنده و شگفتانگیز است"
𝑄4: "روانشناسی کودک"
𝑄5: "فیزیکدان معاصر استیون هاوکینگ"
𝑄6: "مهارتهای مطالعه برای دانشآموزان و دانشجویان"
𝑄7: "اشکانیان از سویی دیرپاترین دودمان فرمانروای ایران و طولانیترین دوران تاریخ ما"

Contact Us

We're excited to hear from you! If you have any questions, suggestions, or need assistance, don't hesitate to reach out. Feel free to contact us via email at:

We're here to help and would love to hear about your experience using this project.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Information Retrieval Toolkit

Datasets

Preprocessing

TF, IDF, and TF-IDF Computation

Cosine Similarity

Jaccard Coefficient

Queries

English Queries:

Persian Queries:

Contact Us

About

Releases

Packages

Contributors 3

Languages

MehrnazSadeghieh/Multilingual-IR-Toolkit

Folders and files

Latest commit

History

Repository files navigation

Multilingual Information Retrieval Toolkit

Datasets

Preprocessing

TF, IDF, and TF-IDF Computation

Cosine Similarity

Jaccard Coefficient

Queries

English Queries:

Persian Queries:

Contact Us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages