Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement FlareSolverr to solve *some* CF captchas #1619

Open
hoopyfrood opened this issue Jun 7, 2023 · 11 comments
Open

implement FlareSolverr to solve *some* CF captchas #1619

hoopyfrood opened this issue Jun 7, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@hoopyfrood
Copy link

Is your feature request related to a problem? Please describe.
One of the sites I am scraping suddenly setup through Cloudflare and I keep getting presented with a CAPTCHA challenge on changedetection, but not on e.g. Jackett, which does have FlareSolverr implemented.

Describe the solution you'd like
Please bake in support for FlareSolverr add the ability to specify FlareSolverr API URL in the [protocol:-http]://[fqdn or ip:-localhost]:[port:-8191] format.

Thank you!

@hoopyfrood hoopyfrood added the enhancement New feature or request label Jun 7, 2023
@plutocrat
Copy link

Have been rejected from Cloudflare and Cloudfront a lot in the last week or two. Seems like their browser fingerprinting has caught up with playwright / chrome docker containers. Would appreciate some solution, as the number of sites which I can actually access has reduced by about 20%

@restyler
Copy link

it's important to understand that cloudflare fingeprinting for API endpoints is usually limited to non-interactive methods, for example TLS fingerprinting techniques so if you managed to find some API-like endpoint in Chrome network tab you should try this. Another way to mitigate is to try higher quality residential proxies.

@plutocrat
Copy link

Interestingly, just came across https://github.com/lwthiker/curl-impersonate
This seems to bypass Cloudfront blocking, if used with a correct User-Agent

@natecovington
Copy link

I wonder if you update the HOSTS file on the machine that's running Change Detection, will it go directly to the IP of the host server and bypass CloudFlare...?

@plutocrat
Copy link

Interesting idea. However that would depend on you knowing the webserver direct IP address, and half the reason people hide behind Cloudflare or Cloudfront is to obscure that address, and to take advantage of the DDos protection afforded by those platforms.

@wpigoury
Copy link

For what it's worth I managed to use FlareSolverr to fetch a page protected by CloudFlare.
It's not very straightforward and involves using regex filtering feature from jq which could be a bit tricky to extract data from HTML.
I don't know if this will work for every cases but it should help in most simple ones.

First of all, I'm on a Synology NAS and have both FlareSolverr and changedetection.io installed as docker containers.

You have to setup FlareSolverr and make sure you can access it from changedetection.io.
To ease the configuration my FlareSolver docker has a 'hostname: flaresolverr' setting and it runs on port 8090.
If both are on the same network and setup in the same docker instance that should work.

On changedetecton.io, for the page you want to fetch here is the configuration:

  1. In 'Request' tab

Fetch Method > Basic fast Plaintext/HTTP Client (doesn't seem to work with Playwright, not sure why)
Proxy > No proxy (might work with a proxy, I couldn't test as I don't use one)
Click on Show advanced options:
Request method > POST
Request body >

{
  "cmd": "request.get",
  "url":"URL TO BE FETCHED",
  "maxTimeout": 60000
}

Request header >
Content-Type: application/json

  1. In 'Filters & Triggers' tab

CSS/JSONPath/JQ/XPath Filters >
jq:.solution.response | capture("REGEX TO EXTRACT CONTENT FROM HTML">(?<name>[^<]*)</b>"; "gm")

FlareSolver returns a json with the HTML in the 'solution.response' attribute, then I added the 'capture' filter which in this case lists the items I want to fetch from the page based on a regex.
Here it really depends on your page and on what you want to extract.

@dgtlmoon
Copy link
Owner

@wpigoury thanks for the info! I was able to get some responses using FlareSolverr, super interesting project, looks like undetected-chromedriver is actually binary patched!

Trying to think of a workflow here

  • Site gets 403, goes into flare-solverr mode
  • on next request, it asks flare-solverr for the headers (cookies etc)
  • those cookies are stored in some kind of database so any other watch for the same domain name could re-use those credentials (would have to add some extra API perhaps to flare-solverr with a mini-db, store in-memory with python dict or something)
  • add those cookies/headers to the next request of the watch

Should the site get 403 again, then I think it can just repeat the above steps...

@dgtlmoon
Copy link
Owner

dgtlmoon commented Jan 31, 2024

I have to add - on the sites where I hit the cloudflare block, simply moving to a better residential IP pretty much solved it and I didnt need flaresolverr...

@weikinhuang
Copy link

I'm selfhosting changedetection at home via docker/selfhosted browserless chrome instance, and I'm running into the cloudflare captcha for sites like www.bhphotovideo.com when trying to monitor for restock. There's a few other sites that do the same thing consistently as well.

@dgtlmoon
Copy link
Owner

OK yeah, i'll add this to my next list of tasks :) lots of requests coming in

@mimnix
Copy link

mimnix commented Dec 21, 2024

Hi guys! First of all, thanks @dgtlmoon for this great project, you rock! I've just made this adapter https://github.com/mimnix/FlareProxy You can deploy it alongside FlareSolverr, add it as a proxy in the Changedetection proxy settings, and everytime you need to demonstrate you're just a human watching a web page protected by Cloudflare, the transparent proxy is right there for you 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants