Skip to content

Enhance WebContentLoader to Support Recursive Link Parsing and Custom HeadersΒ #190

Open
@leandrosilvaferreira

Description

Is your feature request related to a problem? Please describe.
I'm always frustrated when the WebContentLoader cannot parse content recursively from all internal links of a given URL. Additionally, it lacks the ability to customize request headers, which can lead to blocks by services like Cloudflare or web application firewalls when using Python HTTP clients.

Describe the solution you'd like
I would like the WebContentLoader to have:

  1. A recursive parsing feature that, when enabled via a parameter, navigates all internal links from the main URL and parses the content of all these pages.
  2. The ability to override default request headers, including user-agent and authentication headers, through optional parameters.

Describe alternatives you've considered
An alternative would be to create separate utilities for recursive link parsing and custom headers, but integrating these features directly into WebContentLoader will provide a more seamless and efficient solution.

Additional context
This enhancement will make the WebContentLoader more robust and versatile, allowing it to handle more complex web scraping scenarios and avoid blocks by various web services.

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions