Skip to content

raleighguevarra/AppleBooksScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apple Books Scraper Project

The Apple Books Scraper Project is a Python-based tool designed to automate the process of extracting detailed information about books from specified URLs or lists of URLs. It leverages the power of Selenium for web navigation and BeautifulSoup for parsing HTML content, enabling users to gather comprehensive data on books, including titles, authors, descriptions, and URLs. This tool is particularly useful for researchers, marketers, and book enthusiasts who seek to compile and analyze book data efficiently.

Features

  • Dynamic Web Scraping: Uses Selenium to interact with and scrape data from dynamic web pages that rely on JavaScript for content loading.
  • HTML Content Parsing: Employs BeautifulSoup to parse and extract structured data from HTML, ensuring accurate retrieval of book details.
  • Concurrency: Implements Python's ThreadPoolExecutor for concurrent requests, significantly speeding up the data collection process from multiple URLs.
  • Flexible Input Options: Supports input through direct URL specification or by reading a list of URLs from a text file, offering flexibility in how sources are specified.
  • Clean and Structured Output: Cleans and normalizes extracted text data, producing structured and easily readable output.

Installation

Prerequisites

  • Python 3.6 or higher
  • pip (Python package installer)

Dependencies

The project requires the following Python packages:

  • selenium
  • beautifulsoup4
  • chromedriver-autoinstaller

Install all required packages by running:

pip install selenium beautifulsoup4 chromedriver-autoinstaller

Usage

Command Line Arguments

  • -u, --url (optional): Specifies a single URL to fetch book details from.
  • -f, --file (optional): Specifies the path to a text file containing a list of URLs (one per line) to fetch book details from.

Note: At least one of -u or -f must be provided.

Running the ABScraper

To scrape book details from a single URL:

python abscraper.py -u <URL>

To scrape book details from a list of URLs in a file:

python abscraper.py -f <file_path>

Output

The script outputs a CSV file containing the book details. For each book, the following information is provided:

  • Title: The title of the book.
  • Author: The author(s) of the book.
  • Description: A brief description of the book.
  • URL: The direct URL to the book's page.

The CSV file is named based on the title of the page or pages from which the data was scraped, with special characters removed or replaced for file system compatibility.

Example

Given a URL "https://books.example.com/top-picks", running the scraper with -u https://books.example.com/top-picks will generate a CSV file named after the page's title, containing the scraped book details.

Contributing

Contributions to the Apple Books Scraper Project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.

License

This project is open-sourced under the MIT License. See the LICENSE file for more details.

Acknowledgments

This project utilizes Selenium and BeautifulSoup, thanks to their contributors for providing such powerful tools for web scraping.

About

Apple Books Scraper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages