Skip to content

tarun91vas/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

Crawler class scrapes the data from the start url until given depth of traversal. It returns a dictionary of crawled url as key and its html content as value. This class can be used as a starter for creating search applications or any other analytics project.

Working:

Given a base url, it will crawl/scrape all the pages linked to it for a given depth of crawling.The useful information from the scraped data can be extracted by specifying appropriate regular expressions.

Project setup

A sample web has been provided for the proof of concept and testing. Sample web consists of combination of relative and absolute urls along with loops in the web. Crawler class has been tested on above such commonly encountered scenarios of the web. Files can be served using simple python HTTP Server

$ cd sample_web
$ python -m SimpleHTTPServer 8000

Browse sample web on http://localhost:8000/page1.html

Warning: Sample web is no way near to the scenarios and content one will encounter on real web. This is just a starter project and can be extended to suit specific requirements.

Test run
$ python example.py

or

$ python Crawler.py

Usage

The code on class usage has been provided in the file example.py .

  • Import Crawler class
    import Crawler as mycrawler
  • Set a base url to start crawling
   start_url = 'http://localhost:8000/page1.html'
  • Initialize the crawler object
   crawler =  mycrawler.Crawler(start_url, 3)

NOTE: First argument is the base url and second is the depth of crawling which is not same as the number of links crawled but is the graph depth from base url. Depending on how dense is the part of web you want to crawl, choose depth of crawling appropriately.

  • Perform crawling
   data = crawler.crawl()
  • Sample output is given below
   {
       'link1': 'html content for link 1',
       'link2': 'html content for link 2',
       .
       .
       .
   }

About

A simple web crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published