seed urls not loading #363

arun477 · 2019-02-02T13:24:58Z

i'm trying frontera example general crawler but its not taking urls from seed. i'm getting following output in terminal.

2019-02-02 18:47:39 [manager] DEBUG: GET_NEXT_REQUESTS(out) returned_requests=0
2019-02-02 18:47:49 [manager] DEBUG: GET_NEXT_REQUESTS(in) max_next_requests=256
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Overused keys: []
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Pending: 0

The text was updated successfully, but these errors were encountered:

arun477 · 2019-02-02T13:26:44Z

i used following command to add seed urls.
python3 -m frontera.utils.add_seeds --config logging --seeds-file ./seeds_es_smp.txt
output:
2019-02-02 18:45:46,219 INFO __main__ Starting local seeds addition from file ./seeds_es_smp.txt 2019-02-02 18:45:46,219 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,219 INFO manager Starting Frontier Manager... 2019-02-02 18:45:46,222 INFO manager Frontier Manager Started! 2019-02-02 18:45:46,222 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO __main__ Seeds addition finished

sibiryakov · 2019-04-03T16:05:53Z

Hi, which backend have yo used @coolArun ?

arun477 · 2019-04-04T06:42:46Z

Hi sibiryakov, i tried with in memory db. frontera documentation for integrating with scrapy was good but adding seed url directly through code instead of command line is not clearly mentioned any where in the docs. point me to some resources would be helpfull. thnks. (NOTE: python3, scrapy, frontera, load urls from text file and crawl)

sibiryakov · 2019-04-04T08:19:26Z

This will not work with memory db, because there is nowhere to store the seeds/queue, etc. https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html#inject-the-seed-urls
Check also quick start distributed, if you're doing distributed setup.

arun477 · 2019-04-04T09:52:50Z

thanks. is there any way to add seed urls through code instead of command line.

sibiryakov · 2019-04-04T10:19:09Z

By using the crawling strategy. There is the whole guide about it https://frontera.readthedocs.io/en/latest/topics/custom_crawling_strategy.html
The idea is that your crawling strategy has a logic of adding the seeds. If you describe what problem you're trying to solve with Frontera, I could suggest something more specific.

arun477 · 2019-04-04T11:06:04Z

thanks for the resource link sibiryakov. my use case is i have more than 10k seed urls and i want to crawl all using "breadth first search" strategy. the problem i'm facing with just "scrapy" alone is each site i try to crawl is so huge. seems like it never gonna finish soon. to solve this issue i try to eliminate all unwanted urls during crawling itself and order the urls which needs to be crawled first for this i need to perform lot of classification and filtering during crawling itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seed urls not loading #363

seed urls not loading #363

arun477 commented Feb 2, 2019

arun477 commented Feb 2, 2019

sibiryakov commented Apr 3, 2019

arun477 commented Apr 4, 2019

sibiryakov commented Apr 4, 2019

arun477 commented Apr 4, 2019

sibiryakov commented Apr 4, 2019

arun477 commented Apr 4, 2019

seed urls not loading #363

seed urls not loading #363

Comments

arun477 commented Feb 2, 2019

arun477 commented Feb 2, 2019

sibiryakov commented Apr 3, 2019

arun477 commented Apr 4, 2019

sibiryakov commented Apr 4, 2019

arun477 commented Apr 4, 2019

sibiryakov commented Apr 4, 2019

arun477 commented Apr 4, 2019