ItRadar

ItRadar is a search engine designed to scrape IT blogs and retrieve the most relevant links based on user queries. It leverages advanced ranking algorithms and indexing techniques to provide precise and efficient search results.

  ____________    ________    ________ ___________          _____         ____________       _____        ___________      
 /            \  /        \  /        \\          \       /      |_       \           \    /      |_      \          \     
|\___/\  \\___/||\         \/         /|\    /\    \     /         \       \           \  /         \      \    /\    \    
 \|____\  \___|/| \            /\____/ | |   \_\    |   |     /\    \       |    /\     ||     /\    \      |   \_\    |   
       |  |     |  \______/\   \     | | |      ___/    |    |  |    \      |   |  |    ||    |  |    \     |      ___/    
  __  /   / __   \ |      | \   \____|/  |      \  ____ |     \/      \     |    \/     ||     \/      \    |      \  ____ 
 /  \/   /_/  |   \|______|  \   \      /     /\ \/    \|\      /\     \   /           /||\      /\     \  /     /\ \/    \
|____________/|            \  \___\    /_____/ |\______|| \_____\ \_____\ /___________/ || \_____\ \_____\/_____/ |\______|
|           | /             \ |   |    |     | | |     || |     | |     ||           | / | |     | |     ||     | | |     |
|___________|/               \|___|    |_____|/ \|_____| \|_____|\|_____||___________|/   \|_____|\|_____||_____|/ \|_____|

This project is built for educational purposes, allowing you to explore the inner workings of web scraping, information retrieval, and search engine ranking methods.

Features

Web Scraping: Automatically scrapes IT-related blogs and stores the content locally.
Inverted Indexing: Efficiently indexes scraped content to enable fast query retrieval.
BM25 Ranking Algorithm: Uses the BM25 algorithm to rank search results based on relevance.
Flask Web Interface: Provides a simple, user-friendly interface for querying and displaying results.

How It Works

Scraping: ItRadar scrapes IT blogs to extract the article titles, URLs, and content. The scraped data is stored in JSON files.
Indexing: The scraped content is processed using an inverted index, which allows efficient querying by mapping keywords to documents.
Querying: When a user inputs a query, ItRadar calculates the relevance of each document using the BM25 algorithm and returns the most relevant results.
Ranking: The search results are ranked based on their BM25 scores and displayed with their corresponding URLs, titles, and snippets of content.

Technical Details

Scrapy: Used for web scraping.
Inverted Index: A data structure that maps keywords to the documents in which they appear.
BM25: A ranking function used to measure the relevance of documents to a given query.
Flask: Provides a simple web interface for user interaction.

Installation

Clone the repository:

git clone https://github.com/Balcus/ItRadar.git

Go to the project's folder :
```
cd .\ItRadar\
```
Install requirements :
```
pip install -r requirements.txt
```
If you have multiple versions of Python installed and want to ensure you're using pip for a specific version (like Python 3), you might want to use:
```
pip3 install -r requirements.txt
```
Or, if you need to specify the full path to pip for a particular Python installation:
```
python3 -m pip install -r requirements.txt
```
OPTIONAL : A few websites have already been scraped and the data can be found in the json_folder. However, please note that these articles may not include the latest content. If you want to ensure you have the most up-to-date information, consider re-scraping these sites. To get started, first navigate to the crawlers folder :
```
cd .\crawlers\
```
After that you will need to once more navigate to the crawlers folder :
```
cd .\crawlers\
```
Inside it should look something like this:
```
  Mode                 LastWriteTime         Length Name
   ----                 -------------         ------ ----
 d-----         9/26/2024   3:49 PM                spiders
 d-----         9/26/2024   3:49 PM                __pycache__
 -a----         9/26/2024   3:49 PM            276 items.py
 -a----         9/26/2024   3:49 PM           3755 middlewares.py
 -a----         9/26/2024   3:49 PM            375 pipelines.py
 -a----         9/26/2024   3:49 PM           3816 settings.py
 -a----         9/26/2024   3:49 PM              0 __init__.py
```
Go to the spider folder:
```
cd .\spiders\
```
And now you can run any of the spider by using the following command:
```
scrapy crawl [name_of_the_spider] -O [name_of_json_file].json
```
The names of the spiders can be found inside the blog_spider.py file or in the following list:
- danluuspider for https://danluu.com/
- jvnsspider for https://jvns.ca/
- 2alityspider for https://2ality.com/index.html
- cleancoderspider for https://blog.cleancoder.com/
- pragmaticengineerspider for https://blog.pragmaticengineer.com/
- techradarspider for https://www.techradar.com/
- arsspider for https://arstechnica.com/gadgets/
- a_list_apart_spider for https://alistapart.com/articles/
- hsspider for https://highscalability.com/
- css for https://css-tricks.com/category/articles/
The name of ths json files should also match the ones from the `json_folder'

After scraping the websites and storing the data in JSON files, you can move these new files into the json_folder, replacing the existing files with the updated content. This ensures that you have the latest articles available for your search engine.

After this move back until you find the app folder:
```
cd ..
```
(should do this for 3 times)
Get inside the app folder:
```
cd .\app\
```
Run the app:
```
python app.py
```
or for python3 :
```
python3 app.py
```
Open your browser and go to:
```
http://localhost:5000
```
Have fun with the app !

Contact

For any questions or feedback, please reach out via:

Email: bbalcus04@gmail.com
GitHub Issues: Issues

Thank you for using Git-Stats!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ItRadar

Features

How It Works

Technical Details

Installation

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
crawlers		crawlers
json_folder		json_folder
README.md		README.md
requirements.txt		requirements.txt

Balcus/ItRadar

Folders and files

Latest commit

History

Repository files navigation

ItRadar

Features

How It Works

Technical Details

Installation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages