This project focuses on implementing various information retrieval techniques on Persian and English datasets. We perform preprocessing, compute TF, IDF, and TF-IDF, and implement cosine similarity and Jaccard coefficient for document-query pairs.
- Persian Dataset: Download here
- English Dataset: Here you have to create a shortcut in your google drive account and mount it in your google colab.
We utilized the Hazm library for Persian and NLTK for English to perform the following preprocessing steps on both datasets:
- Normalization
- Stemming
- Tokenization
- Removal of words with a length less than 3 after stemming
TF (Term Frequency), IDF (Inverse Document Frequency), and TF-IDF (Term Frequency-Inverse Document Frequency) are computed based on the posting list generated in the Boolean retrieval system assignment.
Cosine similarity scores for document-query pairs are calculated using the sum of TF-IDF values of common words in the query and the document. The documents are then ranked based on cosine similarity, and the top 10 documents are displayed.
Jaccard coefficient is calculated as the intersection of sets A and B divided by the union of A and B for document-query pairs.
𝑄1: "Mr. Henry Dashwood had one son"
𝑄2: "no money for gambling"
𝑄3: "All through the day Miss Abbott had seemed to Philip like a goddess"
𝑄4: "Are bears any good at discovering it?"
𝑄5: "I'd like a shilling"
𝑄6: "On a January evening of the early seventies"
𝑄1: "ایران در عهد باستان"
𝑄2: "موضوع این کتاب"
𝑄3: "دریا گسترهای بس زیبا، فریبنده و شگفتانگیز است"
𝑄4: "روانشناسی کودک"
𝑄5: "فیزیکدان معاصر استیون هاوکینگ"
𝑄6: "مهارتهای مطالعه برای دانشآموزان و دانشجویان"
𝑄7: "اشکانیان از سویی دیرپاترین دودمان فرمانروای ایران و طولانیترین دوران تاریخ ما"
We're excited to hear from you! If you have any questions, suggestions, or need assistance, don't hesitate to reach out. Feel free to contact us via email at:
We're here to help and would love to hear about your experience using this project.