[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693
Description
Browsertrix Version
v1.11.7-7a61568
What did you expect to happen? What happened instead?
I have a question about the settings Max Depth and Include Any Linked Page.
I think there might me an error in the implementation of the rules.
As I understand it,
- the max depth apply to pages that are in scope. That means, pages would be cut, due to max depth. It limits the pages, that are in scope. "Limits how many hops away the crawler can visit while staying within the Start URL Scope."
- The one hop out adds all pages, that are linked by pages in scope.
"If checked, the crawler will visit pages one link away outside of Start URL Scope."
Right now, the crawler from browsertix is not behaving as I would think.
Reproduction instructions
Example:
I've created an example page for illustration: https://monaulrich.online/scopetest/start/
Crawl Config "Scope Domain & Subdomain / Max Depth 1"
Here the Crawler includes what I would expect.
The start url and the 4 linked page, that have the same domain/subdomain are archived.
These pages are in scope.
Crawl Config "Scope Domain & Subdomain / Max Depth 1 / Include Any Linked Page: True"
I would expect, that all the pages, that are in scope based on the settings Scope Domain & Subdomain and Max Depth: 1, are checked for any linked pages. But only the pages 1 depth away from the start URL are included.
I am not sure, if i misunderstand the settings, or if there is an inconsistency in the implementation.
Thanks in advance.
Screenshots / Video
No response
Environment
No response
Additional details
No response
Metadata
Assignees
Labels
Type
Projects
Status
Done!