-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seed urls not loading #363
Comments
i used following command to add seed urls. |
Hi, which backend have yo used @coolArun ? |
Hi sibiryakov, i tried with in memory db. frontera documentation for integrating with scrapy was good but adding seed url directly through code instead of command line is not clearly mentioned any where in the docs. point me to some resources would be helpfull. thnks. (NOTE: python3, scrapy, frontera, load urls from text file and crawl) |
This will not work with memory db, because there is nowhere to store the seeds/queue, etc. https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html#inject-the-seed-urls |
thanks. is there any way to add seed urls through code instead of command line. |
By using the crawling strategy. There is the whole guide about it https://frontera.readthedocs.io/en/latest/topics/custom_crawling_strategy.html |
thanks for the resource link sibiryakov. my use case is i have more than 10k seed urls and i want to crawl all using "breadth first search" strategy. the problem i'm facing with just "scrapy" alone is each site i try to crawl is so huge. seems like it never gonna finish soon. to solve this issue i try to eliminate all unwanted urls during crawling itself and order the urls which needs to be crawled first for this i need to perform lot of classification and filtering during crawling itself. |
i'm trying frontera example general crawler but its not taking urls from seed. i'm getting following output in terminal.
The text was updated successfully, but these errors were encountered: