The Apple Books Scraper Project is a Python-based tool designed to automate the process of extracting detailed information about books from specified URLs or lists of URLs. It leverages the power of Selenium for web navigation and BeautifulSoup for parsing HTML content, enabling users to gather comprehensive data on books, including titles, authors, descriptions, and URLs. This tool is particularly useful for researchers, marketers, and book enthusiasts who seek to compile and analyze book data efficiently.
- Dynamic Web Scraping: Uses Selenium to interact with and scrape data from dynamic web pages that rely on JavaScript for content loading.
- HTML Content Parsing: Employs BeautifulSoup to parse and extract structured data from HTML, ensuring accurate retrieval of book details.
- Concurrency: Implements Python's
ThreadPoolExecutor
for concurrent requests, significantly speeding up the data collection process from multiple URLs. - Flexible Input Options: Supports input through direct URL specification or by reading a list of URLs from a text file, offering flexibility in how sources are specified.
- Clean and Structured Output: Cleans and normalizes extracted text data, producing structured and easily readable output.
- Python 3.6 or higher
- pip (Python package installer)
The project requires the following Python packages:
selenium
beautifulsoup4
chromedriver-autoinstaller
Install all required packages by running:
pip install selenium beautifulsoup4 chromedriver-autoinstaller
-u
,--url
(optional): Specifies a single URL to fetch book details from.-f
,--file
(optional): Specifies the path to a text file containing a list of URLs (one per line) to fetch book details from.
Note: At least one of -u
or -f
must be provided.
To scrape book details from a single URL:
python abscraper.py -u <URL>
To scrape book details from a list of URLs in a file:
python abscraper.py -f <file_path>
The script outputs a CSV file containing the book details. For each book, the following information is provided:
- Title: The title of the book.
- Author: The author(s) of the book.
- Description: A brief description of the book.
- URL: The direct URL to the book's page.
The CSV file is named based on the title of the page or pages from which the data was scraped, with special characters removed or replaced for file system compatibility.
Given a URL "https://books.example.com/top-picks", running the scraper with -u https://books.example.com/top-picks
will generate a CSV file named after the page's title, containing the scraped book details.
Contributions to the Apple Books Scraper Project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.
This project is open-sourced under the MIT License. See the LICENSE file for more details.
This project utilizes Selenium and BeautifulSoup, thanks to their contributors for providing such powerful tools for web scraping.