Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weaviate appears to hang if its HTTP and gRPC ports are hit before everything's ready #6656

Open
1 task done
n-tucker opened this issue Dec 17, 2024 · 0 comments
Open
1 task done
Labels

Comments

@n-tucker
Copy link

n-tucker commented Dec 17, 2024

How to reproduce this bug?

  • Configure and deploy an equivalent Weaviate Fargate service
  • Add a ALB listener and target group for HTTP/gRPC, changing nothing else about the service

What is the expected behavior?

  • Weaviate should start without issues, just as it does locally using a docker compose file generated via the configurator
  • Weaviate should also show log entries, indicating that Weaviate is running on port 8080 and 50051 (this usually takes ~3 mins from our experience on Fargate):
    • "grpc server listening at [::]:50051"
    • "Serving weaviate at http://[::]:8080"

What is the actual behavior?

  • Weaviate seems to get stuck on the following log line: "completed registering modules", Weaviate never fulfills requests made to port 8080 and 50051
  • Weaviate never progresses beyond the log line: "completed registering modules", even if left for much longer than 3 minutes
  • The ALB marks the tasks as unhealthy, making it impossible to deploy the standard Weaviate image behind an ALB

Supporting information

We saw a similar issue when attempting to implement the HTTP listener on an ALB. Using the HTTP healthcheck endpoint available in Weaviate caused similar issues where the service seemed to hang. We worked around this by running health-checker alongside Weaviate on a different port, and using this as a healthcheck for the target group:

http_health_check() {
  # listen for health checks on port 8081, execute a dummy command in the
  # background that's always true
  exec /opt/health-checker --log-level warning \
    --listener "0.0.0.0:8081" \
    --port 8081 \
    --script "true"
}

This allowed Weaviate to start up properly, however this isn't ideal for two reasons:

  1. This doesn't check the health of Weaviate
  2. This requires us to build a new image, complicating our tooling

To avoid adding too much to our Weaviate image, we attempted to bypass the same issue with the gRPC port by delaying port forwarding to Weaviate by 5 mins:

grpc_check_with_delay() {
  start_delay=300
  listen_port=50052
  forward_to="localhost:50051"

  echo "Delaying GRPC healthcheck for ${start_delay} seconds to allow Weaviate to start"
  sleep "${start_delay}"

  echo "Delay for ${start_delay} seconds complete, initialising port forwarding"
  exec socat "TCP-LISTEN:${listen_port},fork,reuseaddr" "TCP:${forward_to}"
}

We were optimistic that this delay in forwarding traffic to Weaviate would work, but unfortunately it didn't. We're not sure why? We see Weaviate hanging at the same log line, even when the ALB is configured on a different port it still seems like something is still interfering with Weaviate's startup?

The reason we're confident that this interference is the root cause is that we found that the Weaviate helm charts have quite a substantial delay to their startup and liveness probes, at 300 and 900 seconds respectively! This is a really large number that's been explicitly set, so our understanding is that this is a known issue with Weaviate's startup? Unfortunately health checks are required in ALBs where their target is a Fargate service, and there's no way to delay when healthchecks start 😢 We wanted to ask if there's any better way that folks know of to work around this issue 🙏

TIA, and LMK if you need additional info!

gRPC workaround

EDIT: So I performed a similar workaround for the HTTP healthcheck for gRPC and spun up a gRPC healthcheck on a completely separate port, and this appears to work now! Here is the snippet from the Weaviate Dockerfile:

...
&& git clone -b v1.68.0 --depth 1 https://github.com/grpc/grpc-go \
&& cd grpc-go/examples/helloworld/greeter_server \
&& go build main.go \
&& mv main /opt/grpc-health-checker \
&& chmod +x /opt/grpc-health-checker
...

Simply changing Weaviate to use port 50052 (since the helloworld example in grpc-go uses 50051) and changing the ALB healthcheck to reference the helloworld example is enough to get Weaviate to start up correctly. I'm not clear why the port forwarding implementation doesn't work, so I'm hoping for some clarity here!

Server Version

1.25.10

Weaviate Setup

Single Node

Nodes count

1

Code of Conduct

@n-tucker n-tucker added the bug label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant