Analysis of the source code of the sites, analysis of ajax queries, compilation of xpath queries.
Analysis of means of protection of sites from parsing, emulation of requests, cookies, headers.
Developing a spider, using python, scrapy, selenium.
Designing a database, exporting data to a database (mysql, mongo, postgress).
Failover spider failover at any stage, storing data.
Rotation of the proxy, bypass captcha, include recaptcha.
The biggest scrap results:
Speed - 1000+ pages per minute
Results - 10M Objects in the database
Famous sites: google, facebook, booking, tripadvisor, linkedin, amazon, wallmart