Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: perf worker #18007

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Sep 2, 2024

How hard can we actually hit dqlite from inside of juju. Let's find out!


The premise is simple, can we afford to run multiple read transactions with wild abandon?

Before this change a spike was done to test the premise, but from the outside of Juju, going via the CLI and the apiserver. The problem we hit pretty quickly was that we required a fmutex to access the local yaml files found in ~/.local/share/juju. Hacking that out of the way allowed us to reach 800/tps (transactions per second). After doing the spike, the thought was to remove the overhead from the CLI, WebSockets, and API server/facades to get a true figure.

This proposal introduces a model worker. The model worker can acquire the service factory for that given model and then run several read-only queries. This then removes the potential of accidental writes into the database whilst testing this setup.


The perf worker creates a new step worker every 30 seconds. The step worker will then run a series (currently 20 iterations), every second. This will then build up over time, currently no upper limit (this will likely change).

The step function will then call a series of calls against the controller and model database.

To add more work and transactions, add more models.

Checklist

  • Code style: imports ordered, good names, simple structure, etc
  • Comments saying why design decisions were made
  • Go unit tests, with comments saying what you're testing

QA steps

Download the bundle and overlay https://gist.github.com/SimonRichardson/0283807cdc026c4ecfe557891958f5b6
Alternatively, deploy the COS stack to then monitor the controller.

$ juju bootstrap lxd test --build-agent --bootstrap-base=ubuntu@20.04
$ juju add-user prometheus
$ juju grant prometheus read controller
$ juju change-user-password prometheus
$ juju switch controller
$ juju deploy ./bundle.yaml --overlay ./overlay.yaml --map-machines=existing

Links

Jira card: https://warthogs.atlassian.net/browse/JUJU-6625

@jnsgruk
Copy link
Member

jnsgruk commented Sep 2, 2024

Interested in the results of this!

@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 3, 2024

Test scenario:

  1. Deploy the controller (see QA steps)
  2. Add 20 models: for i in {1..20}; do juju add-model "model-$i" && sleep 1; done

Initial results:

screencapture-10-50-139-245-3000-d-cdwsdu6pjxxq8b-juju2-controllers-2024-09-03-14_50_12

Observations:

  1. We hit over 1.5k of read transactions.
  2. Load on the machine is relatively low, with most cores under utilized (I'll grab specs from this next time).
  3. The number of retries is low compared to the initial spike (1 in 3 transactions succeeded), which alludes to the fact that there are not enough writes in this current performance scenario (which is fine for this exercise).

Thoughts:

  1. I've run this several times and we seem to always hit about 1.5k/tps. I think it's more likely that we're being throttled intentionally by the semaphore (see below). I think the easiest way to test this hypothesis is to remove it completely and see where the strain is. The semaphore was originally added from the research done in the original dqlite benchmark [JUJU-3843] Dqlite benchmark tests #15623. That benchmark performed more mixed-type loads (write and reads), so the semaphore was required to allow more writes to go through without retries.
  2. We lack mechanical sympathy when we get a rash of errors. We should potentially handle database errors earlier and prevent database requests knowing that they're likely to fail. Having the semaphore only creates a backlog of transaction requests that are just funneled through the semaphore filter. Each is turned away only after performing additional work. Potentially a circuit breaker (or similar) could be used to turn away once we encounter a repeated specific error. Only unlocking once the error has been removed. The way dqlite is setup, one non-transactional error in one database seems to propagate through to other databases, so a central gate is logical.

if err := t.semaphore.Acquire(ctx, 1); err != nil {
return errors.Trace(err)
}
defer t.semaphore.Release(1)

How hard can we actually hit dqlite from inside of juju. Let's find
out!
jujubot added a commit that referenced this pull request Sep 4, 2024
#18012

The atomic context can be pooled to prevent the creation of objects on the heap. There was originally a TODO about doing this, but until the state base was finalized it was left to do until later. There are a few services already using the atomic setup, so it feels like we should do this before it's forgotten.

There should be no changes to the testing, this is just an implementation detail and should be transparent to the tests.

This was observed when testing #18007

<!-- 
The PR title should match: <type>(optional <scope>): <description>.

Please also ensure all commits in this PR comply with our conventional commits specification:
https://docs.google.com/document/d/1SYUo9G7qZ_jdoVXpUVamS5VCgHmtZ0QA-wZxKoMS-C0 
-->

<!-- Why this change is needed and what it does. -->

## Checklist

<!-- If an item is not applicable, use `~strikethrough~`. -->

- [x] Code style: imports ordered, good names, simple structure, etc
- [x] Comments saying why design decisions were made
- [x] Go unit tests, with comments saying what you're testing

## QA steps

Existing tests should pass.

Testing adding units to ensure no regressions have been performed is a good test also.

```sh
$ juju bootstrap lxd test
$ juju add-model default
$ juju deploy ubuntu
$ juju remove-application ubuntu
```

## Links


**Jira card:** https://warthogs.atlassian.net/browse/JUJU-6625
@SimonRichardson
Copy link
Member Author

Test scenario:

  1. Remove the semaphore completely.
  2. Deploy the controller (see QA steps)
  3. Add 20 models: for i in {1..20}; do juju add-model "model-$i" && sleep 1; done

Initial results:

screencapture-10-50-139-62-3000-d-adww29ndq27eob-juju2-controllers-2024-09-04-15_34_23

Observations:

  1. We hit over 800 read transactions per second
  2. The load on the machine was higher than the first tests.
  3. The ramp-up time between is a lot faster, which is obvious because we don't have a semaphore throttling the transactions.

Thoughts:

  1. Run another test with more models, to see if we can push the throughput of transactions higher. We might be seeing, going slower to move faster...

@SimonRichardson
Copy link
Member Author

I ran another test without the semaphore and wasn't able to push past 800 transactions per second. I'll explore if the semaphore should be tuned higher, or lower.

To get high numbers using sqlite, it is advised to only have a dedicated read connection. I'm unsure if that would translate to dqlite, but we could potentially spike a prototype to see if it's feasible.

@hpidcock hpidcock added the 4.0 label Sep 10, 2024
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 23, 2024
To compare apples to apples, we need a baseline, this starts by
bringing over the perf-worker from 4.0.

See: juju#18007
SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 24, 2024
To compare apples to apples, we need a baseline, this starts by
bringing over the perf-worker from 4.0.

See: juju#18007
@SimonRichardson
Copy link
Member Author

See 3.6 results #18131 (comment)

Takeaway: It looks like mongo can hit a higher number of read queries, even if we take into account queries vs queries inside of a transaction. The latter is unlikely to have 6 queries per transaction (more likely 2-3 queries per transaction) to get to 8000+ queries per second.

I've got another round of testing to perform, which might give us the apples vs apples that we're looking for, but we might have to revisit the semaphore and retry solution to try and optimize it in the future.

jameinel and others added 4 commits September 26, 2024 11:08
It takes a lot to wire raw metrics all the way through, and the tests suites are broken because manifolds in test suites
expect things that aren't there. But it does show up in `juju_metrics`.
chore: expose metrics to the perf worker
If you use a normal prometheus.Counter, they have to have a unique
namespace, or you get a collision of the same Collector being registered
2x. So we move to a CounterVec and include the model-uuid, which gets us
around this. It is probably better to have it, anyway.
The issue really was that we were registering the same (Name,Subsystem) multiple times.
The only way around this is to push up the registration of the perf worker higher, and
have it create a CounterVec. We can re-use ForModel. I don't know that it is ideal,
as it means the agent/engine knows about additional metrics that are specific to 1 worker.
But we're just hacking it together here.
@jameinel
Copy link
Member

I worked with Simon to add actual metrics around "how many steps of the perf-worker" can run. And then we took the 3.6 version and compared it with the 4.0 version. We still need to dig in to really understand the details, but the initial results are a bit disheartening.
image

The first image is running both perf-worker-3.6 and perf-worker simultaneously, and setting them up to run 50 models each (50 models 10 step workers, is 500+ steps/second trying to run.)

perf-worker-3.6 got up to 10k / second, while perf-worker was pretty consistent around 1k/second.
The second image is first running perf-worker-3.6 and having it scale to 100 models, and then perf-worker and having it scale to 100 models. This time perf-worker-3.6 got to around 20k/s, while perf-worker again stayed close to 1k/s. At the same time, CPU load capped out for perf-worker around 300%. So some of this is just our internal semaphore throttling.

As another data point:

$ time juju models -c perf-36| wc -l
53

real    0m0.098s
user    0m0.048s
sys     0m0.068s
$ time juju models -c perf-40 | wc -l
54

real    0m14.606s
user    0m0.097s
sys     0m0.023s

So juju models slowed down quite a bit as the number of models increased. In the separate testing:

$ time juju models -c perf-40 | wc -l
103

real    1m9.098s
user    0m0.079s
sys     0m0.033s

This might just be poorly optimized queries, though, and not just about load. (it is easy for juju models to be at least linear with a bad factor, and trivially become quadratic if we haven't optimized it.)

@jameinel
Copy link
Member

One very positive note. All of this was juju bootstrap lxd; juju consume mk8s:prometheus; juju integrate -m controller controller prometheus and those cross model relations worked across a 3.5 microk8s controller, a 3.6 test and a 4.0 test controller. There were some other issues to address (being able to label your data feeds), but the easy of getting grafana up, and getting it scraping your controllers was really good.

@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 27, 2024

Screenshot from 2024-09-27 12-20-04

I ran the same setup, microk8s + cos-lite and 3.6 and 4.0 relating to that prometheus. Mongo stayed at 4000 req/s and dqlite was the low 300 req/s. I added more models (to 40 models) and it just highlighted that mongo can perform more queries, whilst dqlite was hovering around the same amount. It's clear that the rate-limiting is just hurting us.

It's clear we can't run without the semaphore (performance dropped), we need to work out a better strategy for this. I'm unsure what.

Observations

Adding more models whilst underload is a lot slower for dqlite. This is obvious, once you realize we're being rate-limited.

@SimonRichardson
Copy link
Member Author

I've experimented with creating a poor man's hash ring in a branch.

Long story short; In the dbaccessor worker, I create 3 dqlite apps instead of one. Each dqlite app is created with its own transaction runner (each transaction runner has its own semaphore) and then depending on the hash of the namespace it's sent to the appropriate dqlite app. The controller namespace (global) is special as it is always at app 0. Each dqlite app is currently bound to an abstract domain socket (@node-%d) to make things a lot simpler.

The distribution of the namespace isn't perfect and it's a tendency to favor one index over the other (see below). We could potentially have a round-robin instead, which would give a fairer distribution. There is no consideration at all over scaling down the dqlite apps if a model goes away in the code, although I do have ideas on how to do this.

namespace: "a04af513-ea0e-418a-8953-554e19b90a30", hash: 2146607202, index: 0
namespace: "f8d0aafe-bd6a-497a-8f98-714c93d060d7", hash: 2735222685, index: 0
namespace: "d4459721-aea3-42da-85f1-57e7e6b7083f", hash: 1944514803, index: 0
namespace: "77b9f43d-8a35-4add-8f4e-b6e4e422e6f1", hash: 3112230964, index: 1
namespace: "703cf2df-3c4d-414d-8533-d4663cffc7cb", hash: 931743233, index: 2
namespace: "b1625b7f-d93e-45f4-8df5-0f0f396dc273", hash: 1924279564, index: 1
namespace: "d4e51406-03ea-49f2-84b0-68cb0c6969ce", hash: 1975105366, index: 1
namespace: "407e4d07-3dc0-4d1f-8bb5-e523e93b1ca1", hash: 2777876998, index: 1
namespace: "995ffc68-4a01-4ab3-8ff4-04df003d300e", hash: 314819688, index: 0
namespace: "ccb8b4ca-fe24-4e8c-8923-4cb3786f25b5", hash: 3402146965, index: 1
namespace: "b2b789f7-f592-46b8-834d-38c34a1687e4", hash: 3388300066, index: 1
namespace: "9bf174b9-ce87-4516-8c06-ce2d02627f81", hash: 3670112825, index: 2
namespace: "87e51109-0c68-4592-8a74-795f231e77b8", hash: 2295687420, index: 0
namespace: "53fb30b7-5f32-4e06-8914-ceeb76d599ed", hash: 327607678, index: 1
namespace: "1393e58a-b872-4e90-8c02-606db9606468", hash: 3960529581, index: 0
namespace: "b74a01d6-f0e6-4252-8932-17594308ae0a", hash: 336590602, index: 1
namespace: "c0fc501e-062a-430c-8a00-91512d86b4a9", hash: 2142247282, index: 1
namespace: "71815fa4-1495-417b-82af-22953bb2c19e", hash: 1480114457, index: 2
namespace: "611422c6-2112-40ee-82d2-c6de01b58cc1", hash: 1377263489, index: 2
namespace: "2f93adfa-f616-459c-8051-ef88b72ad3e7", hash: 3778284257, index: 2
namespace: "ea952dbe-a72e-4535-8a38-78bb877ae05f", hash: 2429903874, index: 0
namespace: "3112b3ae-b580-4ecb-8918-7558373914e7", hash: 1486829634, index: 0

The great thing about this change was that all of the code changes were located in two places; the dbaccessor worker and the bootstrap dqlite code.


Results:

Note: this wasn't testing 3.6 at the same time, although in the prior runs, the tests weren't taxing the system, so keep that in mind.

Screenshot from 2024-09-27 14-23-54

The number of dqlite applications was 3 and was hash indexed between the 3, with bias on index 1. The number of iterations rose above 1000, peaking at 1337.

Observations

We're not scaling linearly as that should put us in the 900 ballpark (300 req/s * 3 dqlite apps). I wonder if we're just not on the controller application, we scale much better. Reads vs writes is coming into play.

SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 27, 2024
@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 27, 2024

Further to the prior comment, I created 6 dqlite applications and we did not scale superlinearly. We top out at 1736 req/s. I then decided to add more models and the writes just tanked all the reads, which you can see in the screen shot below, the iteration count just tanks.

Screenshot from 2024-09-27 15-33-43


Observations

If the reads are performing well, then a series of writes can dramatically ruin the read performance. A potential solution to this, is to split writes and reads, which is possible in sqlite, but I doubt is possible for dqlite, but is worth exploring.

@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 27, 2024

Doubling the semaphore reduces the number of req/s to 859. This is with 3 dqlite apps running with 48 rate limit (weighted semaphore).

Screenshot from 2024-09-27 16-21-30


I'm going to try and reduce the semaphore number by half and see if it turns out that the semaphore number matching the machine processes is the sweet spot.

@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 27, 2024

Cutting the semaphore to half (24 to 12) didn't reveal any difference between the req/s. The number ended up being 1326 req/s.

Screenshot from 2024-09-27 16-43-34

Observations

There seems to be a fine balance between backing off on the semaphore and overwhelming dqlite. Removing the semaphore or doubling it dramatically makes things worse, reducing it by half didn't have any real difference. I'm wondering if setting it to 2, would see any changes, or have we just maxed the throughput of dqlite?

@SimonRichardson
Copy link
Member Author

Setting the semaphore to 2, reveals that we can burst up to 2489 req/s.

Screenshot from 2024-09-27 17-18-05

Observations

We see that we're connecting to dqlite a lot, so I believe setting the number of open connections to 1 (or a low number) should force us to reuse the connection and not spend time connecting to dqlite all the time.

@SimonRichardson
Copy link
Member Author

Setting SetMaxOpenConns and SetMaxIdleConns to 2, to ensure pool reuse didn't do anything 2355 req/s.

Screenshot from 2024-09-27 17-39-03

Next attempt is to remove the semaphore and just cause the pool reuse.

@SimonRichardson
Copy link
Member Author

Removing the semaphore and still with setting SetMaxOpenConns and SetMaxIdleConns to 2, peaks to 3353 req/s.

Screenshot from 2024-09-27 17-59-39

@jameinel
Copy link
Member

I also tried removing the semaphore, going to db.SetMaxConnections(2), and ended up with a system that only hits 200% cpu, but is able to hit 3.6k simultaneous requests.

image

The actual perf worker had 6 options plus a default, but was using % 6,
which would never have returned a 6.
It also did 20 steps per step worker every second, which is a nice round
number, but means that it didn't sample evenly the actual steps.
21 cleanly divides 7, so is a fairer sampling and test.
@jameinel
Copy link
Member

As a point, Simon asked "so how much can we throw at 3.6". So I scaled up to 200 models. Which at most runs 40k steps / second. I then took it a step further and went to 300 models, which should cap out at 60k steps/second.
At this point, I'm running at 80% cpu max, and not being thermal throttled, but only hitting a peak of 39k steps/second.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2377 1000000   20   0 3165296   1.7g 102848 S  1162   5.3 275:35.97 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju+
   2326 1000000   20   0 4971728 906080  53724 S 726.9   2.8 177:14.17 /snap/juju-db/178/bin/mongod --config /var/snap/juju-db/common/juju-+

So about 18 CPU cores fully engaged. (I have 12 E cores, 8 P cores, for a total of 28 threads.)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants