Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit scanning reqs/second #18

Open
jimpriest opened this issue Sep 15, 2016 · 4 comments
Open

limit scanning reqs/second #18

jimpriest opened this issue Sep 15, 2016 · 4 comments

Comments

@jimpriest
Copy link

We run Linkchecker daily and 99% of the time it behaves but on occasion it seems to run amok and scan a lot of links in a short amount of time. Not sure why this occurs - non of my settings change (run via Jenkins).

I was thinking of adding something like wgets '--wait' flag to limit the requests made? Any thoughts on where the best place to do this would be?

I will take a stab at it and submit a pull-request when complete.

Thanks!
jim

@bartdag
Copy link
Owner

bartdag commented Sep 15, 2016

Hi Jim,

what kind of workers are you using (process / thread / green threads) and how many? The only time I observed pylinkvalidator scan many links quickly was when the links were quickly returning a bad response (e.g., 404).

Wait would definitively make sense. I'll check tonight where it would work best and post it here.

@jimpriest
Copy link
Author

--workers=2
--timeout=20
--format=csv
--mode=process
--parser=lxml

We did have someone publish a bad link which resulted in a unusually large # of 404s.

I appreciate the insight!! I'll poke around the code this afternoon as well.

@bartdag
Copy link
Owner

bartdag commented Sep 16, 2016

Hi Jim, here are my notes about the wait flag

  1. I think the wait flag should represent the minimum time each worker should wait before making a request: the number of workers will control de concurrency.
  2. The flag should be first added to the command line options
  3. The flag should then be added to WorkerConfig which is sent to every worker so it can configure itself.
  4. Each worker eventually initializes a PageCrawler (one instance per worker). The page crawler should have a timestamp, e.g., last_fetch_timestamp
  5. Before opening an url, the page crawler should check whether it should sleep (if now - last_fetch_timestamp >= wait_time)
  6. Finally, I would add a test with a wait time, say 250ms or 500 ms and check that the test execution time took at least X ms (250 ms * number of page crawled). Example of a test that could be copied and modified.

@jimpriest
Copy link
Author

Thanks so much for the detailed response! I came up with similar steps. Will see if I can find some time this weekend to hack on some code :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants