limit scanning reqs/second #18

jimpriest · 2016-09-15T13:47:58Z

We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).

I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?

I will take a stab at it and submit a pull-request when complete.

Thanks!
jim

bartdag · 2016-09-15T14:28:51Z

Hi Jim,

what kind of workers are you using (process / thread / green threads) and how many? The only time I observed pylinkvalidator scan many links quickly was when the links were quickly returning a bad response (e.g., 404).

Wait would definitively make sense. I'll check tonight where it would work best and post it here.

jimpriest · 2016-09-15T15:00:21Z

--workers=2
--timeout=20
--format=csv
--mode=process
--parser=lxml

We did have someone publish a bad link which resulted in a unusually large # of 404s.

I appreciate the insight!! I'll poke around the code this afternoon as well.

bartdag · 2016-09-16T00:11:15Z

Hi Jim, here are my notes about the wait flag

I think the wait flag should represent the minimum time each worker should wait before making a request: the number of workers will control de concurrency.
The flag should be first added to the command line options
The flag should then be added to WorkerConfig which is sent to every worker so it can configure itself.
Each worker eventually initializes a PageCrawler (one instance per worker). The page crawler should have a timestamp, e.g., last_fetch_timestamp
Before opening an url, the page crawler should check whether it should sleep (if now - last_fetch_timestamp >= wait_time)
Finally, I would add a test with a wait time, say 250ms or 500 ms and check that the test execution time took at least X ms (250 ms * number of page crawled). Example of a test that could be copied and modified.

jimpriest · 2016-09-16T12:18:23Z

Thanks so much for the detailed response! I came up with similar steps. Will see if I can find some time this weekend to hack on some code :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limit scanning reqs/second #18

limit scanning reqs/second #18

jimpriest commented Sep 15, 2016

bartdag commented Sep 15, 2016

jimpriest commented Sep 15, 2016

bartdag commented Sep 16, 2016

jimpriest commented Sep 16, 2016

limit scanning reqs/second #18

limit scanning reqs/second #18

Comments

jimpriest commented Sep 15, 2016

bartdag commented Sep 15, 2016

jimpriest commented Sep 15, 2016

bartdag commented Sep 16, 2016

jimpriest commented Sep 16, 2016