Skip to content

[Bug]: Crawl Configuration Inconsistency: Max Depth and Include Any Linked Page #693

Closed
@mona-ul

Description

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I have a question about the settings Max Depth and Include Any Linked Page.
I think there might me an error in the implementation of the rules.

As I understand it,

  • the max depth apply to pages that are in scope. That means, pages would be cut, due to max depth. It limits the pages, that are in scope. "Limits how many hops away the crawler can visit while staying within the Start URL Scope."
  • The one hop out adds all pages, that are linked by pages in scope.
    "If checked, the crawler will visit pages one link away outside of Start URL Scope."

Right now, the crawler from browsertix is not behaving as I would think.

Reproduction instructions

Example:
I've created an example page for illustration: https://monaulrich.online/scopetest/start/

Crawl Config "Scope Domain & Subdomain / Max Depth 1"
Here the Crawler includes what I would expect.
The start url and the 4 linked page, that have the same domain/subdomain are archived.
These pages are in scope.
example_structure_in_scope

Crawl Config "Scope Domain & Subdomain / Max Depth 1 / Include Any Linked Page: True"
I would expect, that all the pages, that are in scope based on the settings Scope Domain & Subdomain and Max Depth: 1, are checked for any linked pages. But only the pages 1 depth away from the start URL are included.

Browsertrix Output:
example_structure_scope_linked_pages_actual

What I would expect:
example_structure_scope_linked_pages_expected

I am not sure, if i misunderstand the settings, or if there is an inconsistency in the implementation.
Thanks in advance.

Screenshots / Video

No response

Environment

No response

Additional details

No response

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions