[net]: Proxy Request Redundancy #3491

czaky · 2024-05-17T03:07:10Z

Context

Anecdotally, using SearXNG over unreliable proxies, like tor, seems to be quite error prone. SearXNG puts quite an effort to measure the performance and reliability of engines, most likely owning to those aspects being of
significant concern.

What does this PR do?

The patch here proposes to mitigate related problems, by issuing concurrent redundant requests through the specified proxies all at once, returning the first response that is not an error.

Why is this change important?

Enables use of SearXNG through tor proxies, with least latency possible, while enhancing user privacy greatly.

How to test this PR locally?

The functionality is enabled using the: proxy_request_redundancy parameter within the outgoing network settings or the engine settings.

Example:

outgoing:
    request_timeout: 8.0
    proxies:
        "all://":
            - socks5h://tor:9050
            - socks5h://tor1:9050
            - socks5h://tor2:9050
            - socks5h://tor3:9050
    proxy_request_redundancy: 4

In this example, each network request will be sent 4 times, once through every proxy. The first (non-error) response wins.

Results

In my testing environment using several tor proxy end-points, this approach almost entirely removes engine errors related to timeouts and denied requests. The latency of the network system is also improved.

Implementation

The implementation, uses a AsyncParallelTransport(httpx.AsyncBaseTransport) wrapper to wrap multiple sub-trasports, and asyncio.wait to wait on the first completed request.

The existing implementation of the network proxy cycling has also been moved into the AsyncParallelTransport class, which should improve network client memoization and performance.

Testing

unit tests for the new functions and classes.
tested on desktop PC with 10+ upstream proxies and comparable request redundancy.

Anecdotally, using SearX over unreliable proxies, like tor, seems to be quite error prone. SearX puts quite an effort to measure the performance and reliability of engines, most likely owning to those aspects being of significant concern. The patch here proposes to mitigate related problems, by issuing concurrent redundant requests through the specified proxies at once, returning the first response that is not an error. The functionality is enabled using the: `proxy_request_redundancy` parameter within the outgoing network settings or the engine settings. Example: ```yaml outgoing: request_timeout: 8.0 proxies: "all://": - socks5h://tor:9050 - socks5h://tor1:9050 - socks5h://tor2:9050 - socks5h://tor3:9050 proxy_request_redundancy: 4 ``` In this example, each network request will be send 4 times, once through every proxy. The first (non-error) response wins. In my testing environment using several tor proxy end-points, this approach almost entirely removes engine errors related to timeouts and denied requests. The latency of the network system is also improved. The implementation, uses a `AsyncParallelTransport(httpx.AsyncBaseTransport)` wrapper to wrap multiple sub-trasports, and `asyncio.wait` to wait on the first completed request. The existing implementation of the network proxy cycling has also been moved into the `AsyncParallelTransport` class, which should improve network client memoization and performance. TESTED: - unit tests for the new functions and classes. - tested on desktop PC with 10+ upstream proxies and comparable request redundancy.

unixfox · 2024-05-17T12:09:55Z

That's a great way to get all the proxies banned at the same time by the engine, instead of having one being banned then using other ones for the request.

For me, I don't think searxng should be the tool for checking if multiple proxies are working. It's the job of an external tool.
When you configure proxies you should be aware to configure good proxies.

czaky · 2024-05-17T12:25:08Z

That's a great way to get all the proxies banned at the same time by the engine, instead of having one being banned then using other ones for the request.

For me, I don't think searxng should be the tool for checking if multiple proxies are working. It's the job of an external tool. When you configure proxies you should be aware to configure good proxies.

Those are good points and I agree
If you are able to configure good proxies,
the optional request redundancy, proposed here, should stay disabled.

When using TOR, there is no way to verify if the current exit node is banned by a specific engine.
The exit nodes cycle every 10 minutes and the client has no way to influence the choice.

I have tried several strategies and tools to check if proxies are working and to choose the good proxies at runtime.
All of those strategies are inferior to this approach,
where the decision is being made at the application level i.e. using HTTP status codes from each response.

to verify why this fails remotely

Fixed race condition with the 404 test.

unixfox · 2024-05-17T16:34:12Z

I think what you are looking for is the parameter retry_on_http_error: https://docs.searxng.org/admin/settings/settings_engine.html

If there is an error, this parameter will retry with another proxy. You can specify the number of retries with the parameter retries: https://docs.searxng.org/admin/settings/settings_outgoing.html

czaky · 2024-05-17T20:45:26Z

I think what you are looking for is the parameter retry_on_http_error: https://docs.searxng.org/admin/settings/settings_engine.html

If there is an error, this parameter will retry with another proxy. You can specify the number of retries with the parameter retries: https://docs.searxng.org/admin/settings/settings_outgoing.html

Thank you so much for this advice.

Correct me if I am wrong:

The retries param allows to retry the requests on any of the errors
including timeouts and random network stack exceptions.
The retry_on_http_error engine param, instructs the network code to treat
400-599 HTTP errors like the other exceptions.

The reason why the retries solution suggested above is not satisfactory,
comes from TOR network requests taking a long time with the tail in tens of seconds.

After adding the retries option, we are looking at 10 to 30 seconds response times
just to learn that more than half of the engines timed out or had some other issues.
While at that most of the engines get suspended quickly degrading the experience.

On the other hand, the solution presented here, delivers the full result in less than 5 seconds,
with engines not being suspended.

Tested it with:

standard TOR proxies;
with multiple TOR proxies behind HAProxy round-robin, with app-level through connectivity check;
also multiple TOR proxies compiled to minimum circuit length.
SearXNG (vanilla) docker version and with
version build from branch/HEAD (patched; A/B test by proxy_request_redundancy param)

# ...
outgoing:
  request_timeout: 10.0
  proxies:
    "all://":
      # - socks5h://192.168.0.50:9050
      # - socks5h://192.168.0.51:9050
      - socks5h://tor:9050
      - socks5h://tor1:9050
      - socks5h://tor2:9050
      - socks5h://tor3:9050
      - socks5h://tor4:9050
      - socks5h://tor5:9050
      - socks5h://tor6:9050
      - socks5h://tor7:9050
      - socks5h://tor8:9050
      - socks5h://tor9:9050
  retries: 2
  proxy_request_redundancy: 1  # Or higher for parallel execution

# ...
engines:
 # `retry_on_http_error` set on all the other engines too.
  - name: google
    engine: google
    shortcut: go
    retry_on_http_error: True

# ...

Or to speak in pictures...

the difference from (on good run):

to (every time):

(on a bad run using the standard code it is just):

This reverts commit c302f11.

…earxng into redundant_proxy_requests

unixfox · 2024-05-18T08:00:53Z

You are correct about the description of the parameters but

You need to extend "request_timeout" per engine section.

In outgoing section, "request_timeout" is the default timeout unless the engine overrides it. It's actually written here:

searxng/searx/settings.yml

Line 163 in ec41b53

request_timeout: 3.0

And about "retry_on_http_error" you may also use it globally, in the outgoing section.

If you go in the /preferences you see in the engines section the timeout configured per engine. "Max time"

czaky · 2024-05-18T15:48:56Z

You are correct about the description of the parameters but

You need to extend "request_timeout" per engine section.

In outgoing section, "request_timeout" is the default timeout unless the engine overrides it. It's actually written here:

searxng/searx/settings.yml

Line 163 in ec41b53

request_timeout: 3.0

And about "retry_on_http_error" you may also use it globally, in the outgoing section.

If you go in the /preferences you see in the engines section the timeout configured per engine. "Max time"

Thank you,

As per my previous comment, the request_timeout was already extended to 10s (or 20 or 30).
In the parallel approach, I only need to extend it to 5s (except for "wikidata" engine init step)

I've also changed the max_request_timeout which acts at the upper bound on the request from all engines and other params,
And also changed the extra_proxy_timeout param which seems to only affect engines with an .onion end-point.

I very much appreciate the ongoing support and all the advices and questions around configuration options.

Please, let me know how can I better highlight the issues with searxng over TOR and especially
the nature of the TOR proxy network. I would like to help shape our discussion here
to be more effective and more to the point.

czaky · 2024-06-06T14:00:17Z

Hello again.

If this approach is not considered technically valid in a wider context,
would you consider to start a design discussion?

BTW: I was really impressed about the project prime directives and the proposition:
"Don’t hesitate, just clone SearXNG’s sources and start hacking right now ... we are happy to receive your pull request."

return42 · 2024-06-07T08:42:09Z

BTW: I was really impressed about the project prime directives and the proposition:
"Don’t hesitate, just clone SearXNG’s sources and start hacking right now ... we are happy to receive your pull request."

We are very thankful for your contribution, thats not the point.

would you consider to start a design discussion?

Yeah, thats the point .. we are currently working on:

refactor searx.network #2685

and to be honest, I'm not as deep into the subject as @dalf and @unixfox ..

czaky changed the title ~~Proxy Request Redundancy~~ [net]: Proxy Request Redundancy May 17, 2024

return42 requested a review from dalf May 17, 2024 06:01

czaky and others added 2 commits May 17, 2024 12:04

Using conservative unittest asserts

8fe19d5

Merge branch 'searxng:master' into redundant_proxy_requests

5d15f6b

czaky added 2 commits May 17, 2024 12:45

add on workflow_dispatch to integration

c302f11

to verify why this fails remotely

Add preference for non 404 responses fix test RC.

5fe7e42

Fixed race condition with the 404 test.

czaky added 4 commits May 17, 2024 23:34

Revert "add on workflow_dispatch to integration"

6c375c7

This reverts commit c302f11.

cleanup

e82516f

cleanup

9441c97

Merge branch 'redundant_proxy_requests' of https://github.com/czaky/s…

6e9f7c2

…earxng into redundant_proxy_requests

czaky mentioned this pull request Jun 7, 2024

refactor searx.network #2685

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[net]: Proxy Request Redundancy #3491

[net]: Proxy Request Redundancy #3491

czaky commented May 17, 2024 •

edited by unixfox

Loading

unixfox commented May 17, 2024

czaky commented May 17, 2024

unixfox commented May 17, 2024

czaky commented May 17, 2024

unixfox commented May 18, 2024

czaky commented May 18, 2024

czaky commented Jun 6, 2024

return42 commented Jun 7, 2024

[net]: Proxy Request Redundancy #3491

Are you sure you want to change the base?

[net]: Proxy Request Redundancy #3491

Conversation

czaky commented May 17, 2024 • edited by unixfox Loading

Context

What does this PR do?

Why is this change important?

How to test this PR locally?

Results

Implementation

Testing

unixfox commented May 17, 2024

czaky commented May 17, 2024

unixfox commented May 17, 2024

czaky commented May 17, 2024

unixfox commented May 18, 2024

czaky commented May 18, 2024

czaky commented Jun 6, 2024

return42 commented Jun 7, 2024

czaky commented May 17, 2024 •

edited by unixfox

Loading