-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel catchup V2 #178
Parallel catchup V2 #178
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just some initial comments. I still need to learn about the new tools being used in this PR.
src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh
Outdated
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/files/stellar-core.cfg
Outdated
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/catchup_workers.yaml
Outdated
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/catchup_workers.yaml
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/job_preload_redis.yaml
Outdated
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/job_preload_redis.yaml
Outdated
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/redis_queue.yaml
Outdated
Show resolved
Hide resolved
I did a round of addressing comments, will continue addressing the rest of them tomorrow. Meanwhile I've started another production workload run, will see if the recent improvements resulted in more timing improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits noted in comments above, nothing critical.
Looking at the finished product .. I also just remain a little hesitant about this whole thing: the configuration layering is such that you've now got at least 3 layers of nested untyped and not-especially-debuggable languages (The helm templating system, then YAML, then the final bunch of bourne shell / python scripts and sed / awk / redis command / whatever strings).
This is kinda what we were dealing with in the pre-supercluster days (back then it was ruby, some other templating system, and dockerfiles) and one of the motives for consolidating all the logic in supercluster's F# code: there are real types, real variables your IDE and compiler can find and complain about mismatches between, you're not as likely to have to keep re-running the whole job just to find typos and nested-quoting mistakes.
I guess if the F# and API-driven approach is sufficiently alien-seeming, relative to the templates-of-YAML-driven style of Kubernetes configuration that's standard in ops environments, then .. I won't stop you. But I guess I just want to double check with you, @jacekn, @sisuresh, @marta-lokhova and anyone else who's been hacking supercluster for a while: we're all ok continuing on this path, having spent some time on it and looking critically at the result?
(This is not a total objection, just a slight hesitation; in general I think you did great work here and it's exactly what you said you were going to build and I agreed was an experiment worth conducting! I am just wondering how you and others feel about the result. I'm a bit on the fence. If it seems it might be preferable to several folks?)
src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh
Show resolved
Hide resolved
src/MissionParallelCatchup/parallel_catchup_helm/templates/catchup_workers.yaml
Show resolved
Hide resolved
I think what Jay has here works well for jobs like parallel catchup where we had very little client side logic (that client side logic was mainly just orchestration too). I think F# should still be used for all of our other missions for ease of development and debugging, but if this V2 pubnet parallel catchup improves the reliability of the existing mission, then I think the tradeoff is worth it. |
I've pushed a few changes that should address all the outstanding comments.
There were a couple of experiments:
@graydon to address your comment. In my opinion the benefits of the new approach (I listed out some of them on top) out-weights the cons you mentioned. The parallel catchup mission is a fairly straightforward and static setup, and the Helm/Yaml approach made the structure clear and more visual. All of the container logics involving shell scripts are fairly small and manageable (including the worker logic). The extra layer of configuration setting can appear a little intimidating at first. But some of this is because we wanted to have a single point of entry for all supercluster missions. I'd like to point out, there is an extra layer which is the Jenkins job, so the overall pipeline is: For other more intricate setups (such as experimenting with topology), F# programmability still offers tremendous benefits and likely can't be replaced. |
# Description `redis-cli` is needed for the container to communicate to the Redis host. The new parallel catchup mission stellar/supercluster#178 requires access to Redis task queue. This is only added as a test dependency i.e. package version with `*~buildtests`, and does not affect production stellar-core packages. # Checklist - [ ] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [ ] Rebased on top of master (no merge commits) - [ ] Ran `clang-format` v8.0.0 (via `make format` or the Visual Studio extension) - [ ] Compiles - [ ] Ran all tests - [ ] If change impacts performance, include supporting evidence per the [performance document](https://github.com/stellar/stellar-core/blob/master/performance-eval/performance-eval.md)
What
Resolves https://github.com/stellar/stellar-core-internal/issues/264, by adding:
./src/MissionParallelCatchup/parallel_catchup_helm
- a Helm project that encapsulates the core parallel catchup logic -- job queues and core workers -- inside Kubernetes./src/MissionParallelCatchup/job_monitor.py
- an (containerized) application monitoring the component statuses, exposing them via an HTTP endpoint/status
.MissionHistoryPubnetParallelCatchupV2
- a new mission which manages the Helm project and monitors job status..NET
version to 8.0, and bringsstellar-dotnet-sdk
andKubernetesClient
to the latest versions.To try the new mission, run this command :
$ dotnet run mission HistoryPubnetParallelCatchupV2 --image=docker-registry.services.stellar-ops.com/dev/stellar-core:latest --pubnet-parallel-catchup-num-workers=6 --pubnet-parallel-catchup-starting-ledger=52500000
on your local machine (you need to first have access to an Kubernetes cluster, e.g. via sshuttle to the ssc)
Why
The current parallel catchup mission frequently suffers from instability (e.g. see https://github.com/stellar/stellar-supercluster/issues/280 and #150). The main reason is we are managing the lifecycle of pods, managing the job queue and performing status monitoring all inside the mission itself (outside Kubernetes cluster, by constantly querying Kubernetes API. There are many potential points of failure (e.g. a pod fails, a pod cannot be allocated, the K8s API times out) each would result in a total mission failure.
By moving all of the core logic inside Kubernetes (leveraging k8s's rich resource and workload management capabilities) this design is more resilient to any single point of failure. It also eliminates virtually all API queries, by moving the status checking query on the job monitor HTTP endpoint (a single HTTP (not API) query every 30 secs), thus eliminating a main source of unreliability.
Besides, there are a few other improvements/benefits:
value.yaml
) set of yaml files defining the Kubernetes objects -- better readability and maintainabilityCheck out the Design doc for more details