Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seed urls not loading #363

Open
arun477 opened this issue Feb 2, 2019 · 7 comments
Open

seed urls not loading #363

arun477 opened this issue Feb 2, 2019 · 7 comments

Comments

@arun477
Copy link

arun477 commented Feb 2, 2019

i'm trying frontera example general crawler but its not taking urls from seed. i'm getting following output in terminal.

2019-02-02 18:47:39 [manager] DEBUG: GET_NEXT_REQUESTS(out) returned_requests=0
2019-02-02 18:47:49 [manager] DEBUG: GET_NEXT_REQUESTS(in) max_next_requests=256
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Overused keys: []
2019-02-02 18:47:49 [overusedbuffer] DEBUG: Pending: 0
@arun477
Copy link
Author

arun477 commented Feb 2, 2019

i used following command to add seed urls.
python3 -m frontera.utils.add_seeds --config logging --seeds-file ./seeds_es_smp.txt
output:
2019-02-02 18:45:46,219 INFO __main__ Starting local seeds addition from file ./seeds_es_smp.txt 2019-02-02 18:45:46,219 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,219 INFO manager Starting Frontier Manager... 2019-02-02 18:45:46,222 INFO manager Frontier Manager Started! 2019-02-02 18:45:46,222 INFO manager -------------------------------------------------------------------------------- 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO states-context Flushing states 2019-02-02 18:45:46,232 INFO states-context Flushing of states finished 2019-02-02 18:45:46,232 INFO __main__ Seeds addition finished

@sibiryakov
Copy link
Member

Hi, which backend have yo used @coolArun ?

@arun477
Copy link
Author

arun477 commented Apr 4, 2019

Hi sibiryakov, i tried with in memory db. frontera documentation for integrating with scrapy was good but adding seed url directly through code instead of command line is not clearly mentioned any where in the docs. point me to some resources would be helpfull. thnks. (NOTE: python3, scrapy, frontera, load urls from text file and crawl)

@sibiryakov
Copy link
Member

This will not work with memory db, because there is nowhere to store the seeds/queue, etc. https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html#inject-the-seed-urls
Check also quick start distributed, if you're doing distributed setup.

@arun477
Copy link
Author

arun477 commented Apr 4, 2019

thanks. is there any way to add seed urls through code instead of command line.

@sibiryakov
Copy link
Member

By using the crawling strategy. There is the whole guide about it https://frontera.readthedocs.io/en/latest/topics/custom_crawling_strategy.html
The idea is that your crawling strategy has a logic of adding the seeds. If you describe what problem you're trying to solve with Frontera, I could suggest something more specific.

@arun477
Copy link
Author

arun477 commented Apr 4, 2019

thanks for the resource link sibiryakov. my use case is i have more than 10k seed urls and i want to crawl all using "breadth first search" strategy. the problem i'm facing with just "scrapy" alone is each site i try to crawl is so huge. seems like it never gonna finish soon. to solve this issue i try to eliminate all unwanted urls during crawling itself and order the urls which needs to be crawled first for this i need to perform lot of classification and filtering during crawling itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants