forked from lorien/awesome-web-scraping
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
142 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Python Web Scraping | ||
|
||
This list contains ruby libraries related to web scraping and data processing | ||
|
||
* [Python Web Scraping](#python-web-scraping) | ||
* [Network](#network) | ||
* [Web-scraping Frameworks](#web-scraping-frameworks) | ||
* [HTML/XML Parsing](#htmlxml-parsing) | ||
* [Text processing](#text-processing) | ||
* [Specific Formats Processing](#specific-formats-processing) | ||
* [Natural Language Processing](#natural-language-processing) | ||
* [Downloader](#downloader) | ||
* [Browser automation and emulation](#browser-automation-and-emulation) | ||
* [Multiprocessing](#multiprocessing) | ||
* [Queue](#queue) | ||
* [Cloud Computing](#cloud-computing) | ||
* [Email](#email) | ||
* [URL Manipulation](#url-manipulation) | ||
* [Web Content Extracting](#web-content-extracting) | ||
* [Asynchronous](#asynchronous) | ||
* [WebSocket](#websocket) | ||
* [DNS Resolving](#dns-resolving) | ||
* [Computer Vision](#computer-vision) | ||
* [Geolocation](#geolocation) | ||
* [Other Python Lists](#other-python-lists) | ||
|
||
## Network | ||
|
||
* [httparty](https://github.com/jnunemaker/httparty) Makes http fun again! | ||
* [faraday](https://github.com/lostisland/faraday) Simple, but flexible HTTP client library, with support for multiple backends. | ||
* [http](https://github.com/tarcieri/http) A simple Ruby DSL for making HTTP requests | ||
* [excon](https://github.com/excon/excon) Usable, fast, simple HTTP(S) 1.1 for Ruby | ||
* [nestful](https://github.com/maccman/nestful) Simple Ruby HTTP/REST client with a sane API | ||
* [EM-HTTP-Request](https://github.com/igrigorik/em-http-request) - EventMachine based asynchronous HTTP client | ||
|
||
## Web-Scraping Frameworks | ||
|
||
* TODO | ||
|
||
## HTML/XML Parsing | ||
|
||
* [nokogiri](https://github.com/sparklemotion/nokogiri) - HTML, XML, SAX, and Reader parser with XPath and CSS selector support | ||
* [loofah](https://github.com/flavorjones/loofah) - HTML/XML manipulation and sanitization based on Nokogiri | ||
|
||
## Text Processing | ||
|
||
*Libraries for parsing and manipulating plain texts.* | ||
|
||
* General | ||
* TODO | ||
|
||
## Specific Formats Processing | ||
|
||
*Libraries for parsing and manipulating specific text formats.* | ||
|
||
* Office | ||
* [Yomu](https://github.com/Erol) - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf) | ||
* [spreadsheet](https://github.com/zdavatz/spreadsheet) - The Spreadsheet Library is designed to read and write Spreadsheet Documents. | ||
* [roo](https://github.com/Empact/roo) - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets. | ||
* [google-spreadsheet-ruby](https://github.com/gimite/google-spreadsheet-ruby) - This is a library to read/write Google Spreadsheet. | ||
* [rubyXL](https://github.com/weshatheleopard/rubyXL) - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents | ||
* [remote_table](https://github.com/seamusabshere/remote_table) - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs. | ||
* [sheets](https://github.com/bspaulding/Sheets) - Work with spreadsheets easily in a native ruby format. | ||
* [workbook](https://github.com/murb/workbook) - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files... | ||
* [oxcelix](https://github.com/gbiczo/oxcelix) - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects | ||
* [wrap_excel](https://github.com/tomiacannondale/wrap_excel) - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README. | ||
|
||
## Natural Language Processing | ||
|
||
*Libraries for working with human languages.* | ||
|
||
* [Treat](https://github.com/louismullie/treat) - Treat is a toolkit for natural language processing and computational linguistics in Ruby | ||
|
||
## Downloader | ||
|
||
*Libraries for downloading.* | ||
|
||
* TODO | ||
|
||
## Browser automation and emulation | ||
* TODO | ||
|
||
## Multiprocessing | ||
|
||
* [Celluloid](https://github.com/celluloid/celluloid) - Actor-based concurrent object framework for Ruby | ||
* [Parallel](https://github.com/grosser/parallel) - Ruby parallel processing made simple and fast | ||
|
||
## Asynchronous | ||
|
||
*Libraries for asynchronous networking programming.* | ||
|
||
* [EventMachine](https://github.com/eventmachine/eventmachine) - event-driven I/O and lightweight concurrency library | ||
|
||
## Queue | ||
|
||
* [Resque](https://github.com/resque/resque) A Redis-backed Ruby library for creating background jobs, placing them on multiple queues. | ||
* [Delayed::Job](https://github.com/tobi/delayed_job) — Database backed asynchronous priority queue. | ||
* [Qu](https://github.com/bkeepers/qu) A Ruby library for queuing and processing background jobs. | ||
* [Sidekiq](https://github.com/mperham/sidekiq) Simple, efficient background processing for Ruby | ||
|
||
## Cloud Computing | ||
* TODO | ||
|
||
|
||
*Libraries for parsing email.* | ||
|
||
* [mail](https://github.com/mikel/mail) A Really Ruby Mail Library | ||
|
||
## URL Manipulation | ||
|
||
*Libraries for parsing URLs.* | ||
|
||
* TODO | ||
|
||
## Web Content Extracting | ||
|
||
*Libraries for extracting web contents.* | ||
|
||
* TODO | ||
|
||
|
||
## WebSocket | ||
|
||
*Libraries for working with WebSocket.* | ||
|
||
* [em-websocket](https://github.com/igrigorik/em-websocket) - EventMachine based WebSocket server | ||
|
||
## DNS Resolving | ||
* TODO | ||
|
||
## Computer Vision | ||
* TODO | ||
|
||
## Geolocation | ||
|
||
* [geocoder](https://github.com/alexreisner/geocoder) Complete Ruby geocoding solution | ||
* [Geokit](https://github.com/geokit/geokit) - Geokit gem provides geocoding and distance/heading calculations. | ||
|
||
## Other ruby lists | ||
|
||
* TODO |