The Boolean Retrieval System is a project designed for processing and retrieving information from Persian books and English movie summaries by queries. The project is divided into three main parts:
-
Text Preprocessing:
- Tokenization
- Normalization
- Stemming
- Stop words removal
-
Inverted Index Creation:
- Utilizes the dictionary-posting concept
- Orders items based on their repetition in documents
-
Boolean Retrieval Model:
- Implements a matrix to store representation of tokens in documents in binary format
- Allows users to input queries in a specified format for retrieval
The required datasets for this project are available, and you can download them directly in Google Colab. The links to the datasets are provided in the notebook.
To use the Boolean Retrieval System, follow these steps:
-
Data Preprocessing:
- Ensure your datasets are in the correct format (Persian books and English movie summaries).
- Run the preprocessing script to tokenize, normalize, stem, and remove stop words.
-
Inverted Index Creation:
- Run the script to generate the inverted index based on the processed data.
-
Boolean Retrieval:
- Execute the Boolean Retrieval script.
- Enter queries in the specified format
- Receive a list of document indices where the tokens appear.
dot
: Represents logical AND.!
: Represents logical NOT.+
: Represents logical OR.
city+!church.taxi+!six
We're excited to hear from you! If you have any questions, suggestions, or need assistance, don't hesitate to reach out. Feel free to contact us via email at:
We're here to help and would love to hear about your experience using this project.