-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote Services shell out of flaky. Demote Elasticsearch #10989
Conversation
GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840. |
This is an unrelated e2e test failures:
I will look into it independently, now kicking the bot to rerun tests. |
@k8s-bot please test |
@k8s-bot test this please |
GCE e2e build/test failed for commit 71bcfac64ce9329d1af1a4a0b264066840311840. |
I don't think the Elasticsearch test should be demoted. It is doing its job. Demote the test that makes nodes unhealthy -- a situation which is then detected and flagged by Elasticsearch. |
One one of failures I saw yesterday was due to a node that I think was made unhealthy by a previous test. The Elasticsearch test makes sure there are at least two healthy nodes before proceeding and declares failure if there are not. I reported this to @quinton-hoole and to me is seems wrong to demote this test when its environment was probably sabotaged by another test. The last failure I saw was due to something that I think is an on-going problem with our system. Every so often when a request is made via service it is then unable to be mapped to a pod to service it -- even when pods were detected and ready earlier on in the test (as confirmed by the log below). I think this is exposing a bug -- a bug our customers will face in a similar situation. Rather than demoting the test, let us understand if this is really a genuine issue and file a bug for it and fix it -- and if this is behaviour as expected (which in my opinion would be very undesireable for our customers) -- then let us document this clearly and also state how such failures should be handled and then implement that process here. Let's not demote this test because -- as always -- it's pointing out things we need to think about.
|
Not sure if I understood correctly - so please correct me if I misunderstood something. |
As I said before, this connection error occurs after there have already been successfully GET requests earlier in the test. Checking the cluster health check via Elasticsearch is done after other successful requests.
|
I did find a bad use of |
I will dig into this further to make sure I am not going mad. |
@wojtek-t this is the first attempt to speak to the application endpoint. It happens after the pods are determined to be ready and is retried to give enough time for the application inside the pod to come to life. Surely this is enough? Or am I wrong?
|
OK - I changed the PR so that it doesn't demote ElasticSearch test - we should investigate it independently (on the other hand I didn't see the failure since #11001 was merged). PTAL |
LGTM |
Thank you @wojtek-t -- I am still trying to get to the bottom of this -- although my vacation to Europe in the next two weeks might get in the way. |
GCE e2e build/test passed for commit dc711ee. |
Promote Services shell out of flaky. Demote Elasticsearch
Hi @wojtek-t This PR seems to have completely broken e2e-gce-parallel. Could you add "Services.*shell" to GCE_PARALLEL_FLAKY_TEST_REGEX until it's parallel friendly? |
Elasticsearch is unfortunately still flaky
Shell services test was fix by #10978
cc @davidopp @satnam6502 @quinton-hoole