1cbyc Web Scraper is a Python-based tool designed to collect data from websites. It uses the requests
and BeautifulSoup
libraries to retrieve and parse web pages, and stores the extracted log in an SQLite db.
- can scrape from multiple web pages
- can handle pagination
- can store scraped data in an SQLite db (adding support for more soon)
- can mimic a web browser by setting custom headers
- (not to be a brag but) i added a way to classify the scraped data as individual texts
- (since i have lazy pals) i added a way to read the sqlite file without stressing about sqlite on your machine
- adding more features soon
- just clone my repository.
git clone https://github.com/1cbyc/1cbyc-web-scraper.git
cd 1cbyc-web-scraper
- then download the required packages.
pip install -r requirements.txt
- simply update the base_url and num_pages variables in main.py to match the target website and the number of pages you want to scrape.
base_url = 'http://nsisong.com/page/' # you can replace with the actual base URL
num_pages = 5 # adjust the number of pages to scrape based on the target website
- then, run the scraper.
python main.py
-
make sure to check the console output to see the progress and results of the scraping.
-
also, to view the scraped data, you can use the provided function in database.py to print all data:
from scraper.database import print_all_data
print_all_data()
i gave a shorter way to read the data by adding a read_db.py file to this project but i think i should not be an advocate for shortcuts. so, just do this:
like just download and install the db browser for SQLite:
- go to the DB Browser for SQLite website.
- then, download and install the version suitable for your pc.
- open the DB browser.
- click "Open Database" and go to the
data
dir. - then, select the desired
.db
file you wanna check and click "open".
- you can use the "browse data" tab to view the contents of the
data
table. - you can also run SQL queries using the "execute SQL" tab.
- since i have pushed the project for general use. you can now visit it by webscraper.nsisong.com to get started.
however, to do that, since i did not push the "data" folder in my local machine to github, i did this in the command:
mkdir -p data && python app.py
also, i wrote a deployment shell file you should check fordeploy.sh
in this repo. - I am warning you guys to use SQLite DB browser to read the file, else, you're on your own.
All in all, you now know you can open the scraped_data.db file using an SQLite browser to inspect the data and not use my shitty method.
to be honest, i want you guys to fork this repository, make improvements, and submit pull requests. probably we'd get a v2.1 release faster. suggest new features too, and i promise to work on it (if it makes sense).