-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpectedly High CPU Usage during Load test #2490
Comments
It seems like this only happens a few minutes after the loadtest starts? What happens if you run the loadtest longer, let's say twice as long? Is CPU usage going down again or only when the loadtest stops? Did you run the same test against v10? We did some work on connection handling, especially when running out of file descriptors and other resources... maybe the backlog of unserved requests is just growing huge after a certain amount of time, because the load generator creates more requests than are actually handled? Also: Where is the throughput in requests measured? At the load generating side? Are you sure the load is actually balanced between the two pods properly? |
First of all thank you for the quick response.
Not always at start, but it almost always happens during the load test, more often during the start and sometimes after some minutes of Load Test.
This CPU spike is only for some time, 3-4minutes, then HPA scales up replicas based on CPU, new pods come up & this CPU spike comes to normal.
No, we have not. Will try that as well.
The screenshot I attached for QPS is measured on PostgREST pod(s), so we are 100% sure that it's not a load-balancing issue. @steve-chavez The perf fixes in v9.0.1, we have already disabled parallel & idle GC by setting env var inv PostgREST |
Any suggestions here? |
@bhupixb Try the latest pre-release and set the new |
@bhupixb Did the new timeout helped? Do you have new results you can share? |
Hi @steve-chavez, no, unfortunately.
I tested this on 500 RPS and increased the RPS to 600 to cause 504. As soon as the first 504 is received(at 20:00, refer bottom right panel 5XX QPS by service), I decreased the RPS from 600 to ~300(see QPS panel) at 20:00. The CPU spikes suddenly and remains high for some time even though RPS was reduced. Test 2:
Didn't see a single 504 Bad gateway. The resource usage image is attached below: |
Update: We found a workaround for this by giving a sufficiently large no. of DB connections pool size to each pod(and add idle conn. timeout to 5min) so that Postgrest doesn't run out of DB connections in case there's spiky load, post that this error didn't occur. My theory: 1 thing that we noticed that consistently happens is that the CPU usage goes unexpectedly high when there are not enough DB connections available to serve the current requests. I am not sure if it's a bug on the HTTP Server side that is not handling the queued requests correctly or if this is expected behaviour. |
@bhupixb Sorry for the late reply, thanks for the detailed report.
I believe this happens because we're still doing work(parsing the querystring, searching on the postgrest schema cache, etc) when it doesn't make sense to do so because we don't have available connections on the pool. Not sure if this is the correct way to handle this. But one option to avoid unnecessary work could be that once a number of 504s happen, we can get into a temporary state where we quickly reject new requests. pgbouncer has server_login_retry, which kinda looks related on intention:
|
Having similar issues with postgrest docker latest image. I am running a gatling load test in Kubuntu 22.0 intel7 16GB RAM After ~1 minute of load test, I get the error of "too many" connections |
Maybe try setting https://postgrest.org/en/stable/references/configuration.html#jwt-cache-max-lifetime to 3600 and see if it reduces CPU usage.
That should only happen if you set |
Environment
Description of issue
We are doing some load tests on PostgREST. Postgrest is running on Kubernetes pods with 2 replicas. It connects to AWS-managed PostgreSQL RDS.
During the load testing, we are seeing out of the 2 pods running, 1 of the pod CPU spikes very high. In the below pic, we can see that CPU usage of 1 of the pod went to almost 6 cores.
Initially, we suspected that one pod might be serving more requests than the other, but from the metrics, we saw both pods are serving almost the same no. of requests per second(QPS):
This is happening every time we do the load test, even if we scale pods to more than 2, to say 5 pods. Then we see similar behaviour with 1/2 pods and the rest of the pods keep using normal CPU.
Expected Behaviour
Since both pods are serving the same no. of requests per second, the CPU usage difference should not be this huge.
Actual Behaviour
CPU usage for a few pods is quite high as compared to others despite serving almost same amount of traffic.
Pod config for postgrest
Postgrest Configuration:
(Steps to reproduce: Include a minimal SQL definition plus how you make the request to PostgREST and the response body)
Sample Deployment file:
Any suggestion from the community on what things can be checked to find the root cause will be appreciated. Please let me know if I need to share any other info.
The text was updated successfully, but these errors were encountered: