Parallel catchup V2 #178

jayz22 · 2024-07-13T00:22:22Z

What

Resolves https://github.com/stellar/stellar-core-internal/issues/264, by adding:

./src/MissionParallelCatchup/parallel_catchup_helm - a Helm project that encapsulates the core parallel catchup logic -- job queues and core workers -- inside Kubernetes
./src/MissionParallelCatchup/job_monitor.py - an (containerized) application monitoring the component statuses, exposing them via an HTTP endpoint /status.
MissionHistoryPubnetParallelCatchupV2 - a new mission which manages the Helm project and monitors job status.
As pass-by improvement, upgrades .NET version to 8.0, and brings stellar-dotnet-sdk and KubernetesClient to the latest versions.

To try the new mission, run this command :
$ dotnet run mission HistoryPubnetParallelCatchupV2 --image=docker-registry.services.stellar-ops.com/dev/stellar-core:latest --pubnet-parallel-catchup-num-workers=6 --pubnet-parallel-catchup-starting-ledger=52500000
on your local machine (you need to first have access to an Kubernetes cluster, e.g. via sshuttle to the ssc)

Why

The current parallel catchup mission frequently suffers from instability (e.g. see https://github.com/stellar/stellar-supercluster/issues/280 and #150). The main reason is we are managing the lifecycle of pods, managing the job queue and performing status monitoring all inside the mission itself (outside Kubernetes cluster, by constantly querying Kubernetes API. There are many potential points of failure (e.g. a pod fails, a pod cannot be allocated, the K8s API times out) each would result in a total mission failure.

By moving all of the core logic inside Kubernetes (leveraging k8s's rich resource and workload management capabilities) this design is more resilient to any single point of failure. It also eliminates virtually all API queries, by moving the status checking query on the job monitor HTTP endpoint (a single HTTP (not API) query every 30 secs), thus eliminating a main source of unreliability.

Besides, there are a few other improvements/benefits:

Resilience and partial progress - The worker set can be partially allocated and make progress (when the cluster is resource restrained). Any pod failure due to Kubernetes will be retried later.
Eliminates the overhead of pod creation/deletion, saving the parallel catchup ~3hrs (or ~8%) of total run time.
Stores worker timing metrics, making them available in Prometheus
Mostly static (with customizable inputs via value.yaml) set of yaml files defining the Kubernetes objects -- better readability and maintainability
Separates out range generation script, making it easy for future optimization on the range algorithm
Tweaked resource request/limit of worker pods to improve performance (this is still an ongoing work)

Check out the Design doc for more details

sisuresh

Nice work! Just some initial comments. I still need to learn about the new tools being used in this PR.

src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh

src/FSLibrary/MissionHistoryPubnetParallelCatchupV2.fs

doc/missions.md

src/MissionParallelCatchup/Dockerfile.jobmonitor

src/MissionParallelCatchup/parallel_catchup_helm/files/stellar-core.cfg

src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh

src/MissionParallelCatchup/parallel_catchup_helm/templates/catchup_workers.yaml

src/MissionParallelCatchup/parallel_catchup_helm/templates/job_preload_redis.yaml

src/MissionParallelCatchup/parallel_catchup_helm/templates/redis_queue.yaml

src/MissionParallelCatchup/parallel_catchup_helm/values.yaml

…clamp

jayz22 · 2024-07-19T01:25:52Z

I did a round of addressing comments, will continue addressing the rest of them tomorrow. Meanwhile I've started another production workload run, will see if the recent improvements resulted in more timing improvements.

graydon

A few nits noted in comments above, nothing critical.

Looking at the finished product .. I also just remain a little hesitant about this whole thing: the configuration layering is such that you've now got at least 3 layers of nested untyped and not-especially-debuggable languages (The helm templating system, then YAML, then the final bunch of bourne shell / python scripts and sed / awk / redis command / whatever strings).

This is kinda what we were dealing with in the pre-supercluster days (back then it was ruby, some other templating system, and dockerfiles) and one of the motives for consolidating all the logic in supercluster's F# code: there are real types, real variables your IDE and compiler can find and complain about mismatches between, you're not as likely to have to keep re-running the whole job just to find typos and nested-quoting mistakes.

I guess if the F# and API-driven approach is sufficiently alien-seeming, relative to the templates-of-YAML-driven style of Kubernetes configuration that's standard in ops environments, then .. I won't stop you. But I guess I just want to double check with you, @jacekn, @sisuresh, @marta-lokhova and anyone else who's been hacking supercluster for a while: we're all ok continuing on this path, having spent some time on it and looking critically at the result?

(This is not a total objection, just a slight hesitation; in general I think you did great work here and it's exactly what you said you were going to build and I agreed was an experiment worth conducting! I am just wondering how you and others feel about the result. I'm a bit on the fence. If it seems it might be preferable to several folks?)

src/MissionParallelCatchup/job_monitor.py

src/FSLibrary/MissionHistoryPubnetParallelCatchupV2.fs

src/MissionParallelCatchup/parallel_catchup_helm/files/logarithmic_range_generator.sh

src/MissionParallelCatchup/parallel_catchup_helm/files/worker.sh

src/MissionParallelCatchup/parallel_catchup_helm/templates/catchup_workers.yaml

sisuresh · 2024-07-22T19:50:15Z

A few nits noted in comments above, nothing critical.

Looking at the finished product .. I also just remain a little hesitant about this whole thing: the configuration layering is such that you've now got at least 3 layers of nested untyped and not-especially-debuggable languages (The helm templating system, then YAML, then the final bunch of bourne shell / python scripts and sed / awk / redis command / whatever strings).

This is kinda what we were dealing with in the pre-supercluster days (back then it was ruby, some other templating system, and dockerfiles) and one of the motives for consolidating all the logic in supercluster's F# code: there are real types, real variables your IDE and compiler can find and complain about mismatches between, you're not as likely to have to keep re-running the whole job just to find typos and nested-quoting mistakes.

I guess if the F# and API-driven approach is sufficiently alien-seeming, relative to the templates-of-YAML-driven style of Kubernetes configuration that's standard in ops environments, then .. I won't stop you. But I guess I just want to double check with you, @jacekn, @sisuresh, @marta-lokhova and anyone else who's been hacking supercluster for a while: we're all ok continuing on this path, having spent some time on it and looking critically at the result?

(This is not a total objection, just a slight hesitation; in general I think you did great work here and it's exactly what you said you were going to build and I agreed was an experiment worth conducting! I am just wondering how you and others feel about the result. I'm a bit on the fence. If it seems it might be preferable to several folks?)

I think what Jay has here works well for jobs like parallel catchup where we had very little client side logic (that client side logic was mainly just orchestration too). I think F# should still be used for all of our other missions for ease of development and debugging, but if this V2 pubnet parallel catchup improves the reliability of the existing mission, then I think the tradeoff is worth it.

src/MissionParallelCatchup/parallel_catchup_helm/files/worker.sh

…ppc-v2

…ed worker

jayz22 · 2024-07-26T04:59:29Z

I've pushed a few changes that should address all the outstanding comments.
Besides what's already mentioned in the comments, here are a few main improvements since the last review:

Simplified parameter overriding in the F# mission. Parameters can be directly overridden via the command line, and it is additive to the values.yaml file.
Improved context in the job monitor messages: for "in-progress" and "failed" queue it now includes the actual ranges (not just the count). Entries in the failed queue are also enhanced with the pod name for better traceability. Worker statuses contain an additional field with the detailed status from stellar-core:info. an example response now looks like:

[00:03:11 INF] job monitor response: {"num_remain": 560, "num_succeeded": 0, "jobs_failed": [118400/1920|stellar-core-1 ], "jobs_in_progress": ["52516223/16320", "52548223/16320", "52596223/16320"], "workers": [{"worker_id": 0, "status": "running", "info": ["Catching up to ledger 52596223: Waiting: apply-buckets : 0/1 children completed"]}, {"worker_id": 1, "status": "running", "info": ["Catching up to ledger 52548223: Waiting: apply-buckets : 0/1 children completed"]}, {"worker_id": 2, "status": "running", "info": ["Catching up to ledger 52516223: Waiting: apply-buckets : 0/1 children completed"]}]}

Implemented copying of the log stream of the pod owning a failed range, and dumping into local file.

There were a couple of experiments:

Tried a full-history catchup with "piecewise logarithmic" partitioning method (see logarithmic_range_generator.sh). The result is much, much worse than the current uniform partitioning method. The hope was that ledger apply time roughly follows exponential trend throughout history, thus having more ranges in the front part can reduce the overall weight of bucket download&apply time (making the total time for different ranges more even). In reality the ledger apply time distribution is more complicated, and some middling ranges (with increased range length than before) takes forever to finish, resulting in highly uneven workload distribution. So I've switched back to the normal uniform range generation for now, more experiment (and data) is needed for better optimization.
Experimented with writing the core-worker script in Python. This is possible because the stellar-core image already contains Python, so we can directly invoke the script from the command (no need for a new docker image with the Python code). This did not make the script any simpler. The worker script mostly performs unix command (redis-cli, file processing, os calls), moving them into Python made it worse (some basic file processing commands such as tac and ls -t does not have a standard Python variant and has to be manually implemented), and it's left with mostly a bunch of subprocess calls. I did not commit the python version of the worker.

A few nits noted in comments above, nothing critical.

Looking at the finished product .. I also just remain a little hesitant about this whole thing: the configuration layering is such that you've now got at least 3 layers of nested untyped and not-especially-debuggable languages (The helm templating system, then YAML, then the final bunch of bourne shell / python scripts and sed / awk / redis command / whatever strings).

This is kinda what we were dealing with in the pre-supercluster days (back then it was ruby, some other templating system, and dockerfiles) and one of the motives for consolidating all the logic in supercluster's F# code: there are real types, real variables your IDE and compiler can find and complain about mismatches between, you're not as likely to have to keep re-running the whole job just to find typos and nested-quoting mistakes.

I guess if the F# and API-driven approach is sufficiently alien-seeming, relative to the templates-of-YAML-driven style of Kubernetes configuration that's standard in ops environments, then .. I won't stop you. But I guess I just want to double check with you, @jacekn, @sisuresh, @marta-lokhova and anyone else who's been hacking supercluster for a while: we're all ok continuing on this path, having spent some time on it and looking critically at the result?

(This is not a total objection, just a slight hesitation; in general I think you did great work here and it's exactly what you said you were going to build and I agreed was an experiment worth conducting! I am just wondering how you and others feel about the result. I'm a bit on the fence. If it seems it might be preferable to several folks?)

@graydon to address your comment. In my opinion the benefits of the new approach (I listed out some of them on top) out-weights the cons you mentioned. The parallel catchup mission is a fairly straightforward and static setup, and the Helm/Yaml approach made the structure clear and more visual. All of the container logics involving shell scripts are fairly small and manageable (including the worker logic).

The extra layer of configuration setting can appear a little intimidating at first. But some of this is because we wanted to have a single point of entry for all supercluster missions. I'd like to point out, there is an extra layer which is the Jenkins job, so the overall pipeline is:
Jenkins --> (dotnet) parallel-catchup mission --> Helm config --> pod configs.
But we don't really need the dotnet layer (we need it because of single point of entry with other missions), as Jenkins can directly config Helm and launch project, so in the future (if we want) we might spin out the parallel catchup mission to run completely standalone inside Kubernetes (which wouldn't be possible with other missions), Jenkins can directly configure it via the Helm command line and that shall simply the overall flow a lot.

For other more intricate setups (such as experimenting with topology), F# programmability still offers tremendous benefits and likely can't be replaced.

# Description `redis-cli` is needed for the container to communicate to the Redis host. The new parallel catchup mission stellar/supercluster#178 requires access to Redis task queue. This is only added as a test dependency i.e. package version with `*~buildtests`, and does not affect production stellar-core packages. # Checklist - [ ] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [ ] Rebased on top of master (no merge commits) - [ ] Ran `clang-format` v8.0.0 (via `make format` or the Visual Studio extension) - [ ] Compiles - [ ] Ran all tests - [ ] If change impacts performance, include supporting evidence per the [performance document](https://github.com/stellar/stellar-core/blob/master/performance-eval/performance-eval.md)

jayz22 added 6 commits July 15, 2024 16:58

initial setup

ec11a2c

upgrade to net8.0 and stellar-dotnet-sdk to v10.0.0

4959fbf

update KubernetesClient to v14.0.2

3768a66

working implementation

9c5a4df

expose num workers as input, and run formatting

57a1efc

setting resource limits

4e8470c

jayz22 force-pushed the mppc-v2 branch from b4c40f7 to 4e8470c Compare July 15, 2024 20:59

jayz22 marked this pull request as ready for review July 15, 2024 22:26

jayz22 requested review from sisuresh, graydon and jacekn July 15, 2024 22:26

jayz22 added 2 commits July 16, 2024 15:52

more resource tweaking

b237a4a

add logarithmic range generation strategy

b6f4b92

sisuresh reviewed Jul 17, 2024

View reviewed changes

doc/missions.md Show resolved Hide resolved

jacekn reviewed Jul 18, 2024

View reviewed changes

jayz22 mentioned this pull request Jul 18, 2024

Add redis-tools as a test-only docker dependency stellar/stellar-core#4390

Merged

6 tasks

jayz22 added 4 commits July 18, 2024 21:02

tweak resource limits again

8b7f4e9

fix logarithmic generator script to correct ledger boundary, and add …

808f2a4

…clamp

cleanups

5dc1af1

use redis 7

e3d8985

graydon requested changes Jul 20, 2024

View reviewed changes

graydon approved these changes Jul 23, 2024

View reviewed changes

src/MissionParallelCatchup/parallel_catchup_helm/files/worker.sh Show resolved Hide resolved

jayz22 added 5 commits July 23, 2024 18:14

use ubuntu-22.04 as the base image for job monitor

30d9631

clean up value overriding logic

3402ff7

clean up resource specification format

64db86e

increase range gen sleep interval to 1sec

561b94a

use configMap and envFrom to specify kv pairs in env

f441101

jayz22 added 9 commits July 24, 2024 01:59

parameterize redis host name and port

745a466

make job_monitor host name parameterized

94a51d0

remove timestamp from log file name

9b6424a

Merge branch 'main' of https://github.com/stellar/supercluster into m…

1eff717

…ppc-v2

fmt

2889872

parameterize the ingressClassName

2b004a1

make sure env vars are set before use

f2e4f70

Improve job-monitor response context; copy log stream to fail on fail…

6bb47cd

…ed worker

use stellar/ssc-job-monitor docker image

c9ceb02

Update docs

cf4eeb1

sisuresh approved these changes Jul 29, 2024

View reviewed changes

graydon approved these changes Jul 30, 2024

View reviewed changes

graydon merged commit 5d1921d into main Jul 30, 2024
1 check passed

graydon deleted the mppc-v2 branch July 30, 2024 18:47

sisuresh mentioned this pull request Aug 29, 2024

Look into creating a Kubernetes service to orchestrate pubnet parallel catchup #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel catchup V2 #178

Parallel catchup V2 #178

jayz22 commented Jul 13, 2024 •

edited

Loading

sisuresh left a comment

jayz22 commented Jul 19, 2024

graydon left a comment •

edited

Loading

sisuresh commented Jul 22, 2024

jayz22 commented Jul 26, 2024

Parallel catchup V2 #178

Parallel catchup V2 #178

Conversation

jayz22 commented Jul 13, 2024 • edited Loading

What

Why

sisuresh left a comment

Choose a reason for hiding this comment

jayz22 commented Jul 19, 2024

graydon left a comment • edited Loading

Choose a reason for hiding this comment

sisuresh commented Jul 22, 2024

jayz22 commented Jul 26, 2024

jayz22 commented Jul 13, 2024 •

edited

Loading

graydon left a comment •

edited

Loading