test: perf worker #18007

SimonRichardson · 2024-09-02T16:36:29Z

How hard can we actually hit dqlite from inside of juju. Let's find out!

The premise is simple, can we afford to run multiple read transactions with wild abandon?

Before this change a spike was done to test the premise, but from the outside of Juju, going via the CLI and the apiserver. The problem we hit pretty quickly was that we required a fmutex to access the local yaml files found in ~/.local/share/juju. Hacking that out of the way allowed us to reach 800/tps (transactions per second). After doing the spike, the thought was to remove the overhead from the CLI, WebSockets, and API server/facades to get a true figure.

This proposal introduces a model worker. The model worker can acquire the service factory for that given model and then run several read-only queries. This then removes the potential of accidental writes into the database whilst testing this setup.

The perf worker creates a new step worker every 30 seconds. The step worker will then run a series (currently 20 iterations), every second. This will then build up over time, currently no upper limit (this will likely change).

The step function will then call a series of calls against the controller and model database.

To add more work and transactions, add more models.

Checklist

Code style: imports ordered, good names, simple structure, etc
Comments saying why design decisions were made
Go unit tests, with comments saying what you're testing

QA steps

Download the bundle and overlay https://gist.github.com/SimonRichardson/0283807cdc026c4ecfe557891958f5b6
Alternatively, deploy the COS stack to then monitor the controller.

$ juju bootstrap lxd test --build-agent --bootstrap-base=ubuntu@20.04
$ juju add-user prometheus
$ juju grant prometheus read controller
$ juju change-user-password prometheus
$ juju switch controller
$ juju deploy ./bundle.yaml --overlay ./overlay.yaml --map-machines=existing

Links

Jira card: https://warthogs.atlassian.net/browse/JUJU-6625

jnsgruk · 2024-09-02T16:47:25Z

Interested in the results of this!

SimonRichardson · 2024-09-03T14:52:44Z

Test scenario:

Deploy the controller (see QA steps)
Add 20 models: for i in {1..20}; do juju add-model "model-$i" && sleep 1; done

Initial results:

Observations:

We hit over 1.5k of read transactions.
Load on the machine is relatively low, with most cores under utilized (I'll grab specs from this next time).
The number of retries is low compared to the initial spike (1 in 3 transactions succeeded), which alludes to the fact that there are not enough writes in this current performance scenario (which is fine for this exercise).

Thoughts:

I've run this several times and we seem to always hit about 1.5k/tps. I think it's more likely that we're being throttled intentionally by the semaphore (see below). I think the easiest way to test this hypothesis is to remove it completely and see where the strain is. The semaphore was originally added from the research done in the original dqlite benchmark [JUJU-3843] Dqlite benchmark tests #15623. That benchmark performed more mixed-type loads (write and reads), so the semaphore was required to allow more writes to go through without retries.
We lack mechanical sympathy when we get a rash of errors. We should potentially handle database errors earlier and prevent database requests knowing that they're likely to fail. Having the semaphore only creates a backlog of transaction requests that are just funneled through the semaphore filter. Each is turned away only after performing additional work. Potentially a circuit breaker (or similar) could be used to turn away once we encounter a repeated specific error. Only unlocking once the error has been removed. The way dqlite is setup, one non-transactional error in one database seems to propagate through to other databases, so a central gate is logical.

juju/internal/database/txn/transaction.go

Lines 233 to 236 in 29bc8b5

    
           if err := t.semaphore.Acquire(ctx, 1); err != nil { 
        
           	return errors.Trace(err) 
        
           } 
        
           defer t.semaphore.Release(1)

How hard can we actually hit dqlite from inside of juju. Let's find out!

#18012 The atomic context can be pooled to prevent the creation of objects on the heap. There was originally a TODO about doing this, but until the state base was finalized it was left to do until later. There are a few services already using the atomic setup, so it feels like we should do this before it's forgotten. There should be no changes to the testing, this is just an implementation detail and should be transparent to the tests. This was observed when testing #18007   ## Checklist  - [x] Code style: imports ordered, good names, simple structure, etc - [x] Comments saying why design decisions were made - [x] Go unit tests, with comments saying what you're testing ## QA steps Existing tests should pass. Testing adding units to ensure no regressions have been performed is a good test also. ```sh $ juju bootstrap lxd test $ juju add-model default $ juju deploy ubuntu $ juju remove-application ubuntu ``` ## Links **Jira card:** https://warthogs.atlassian.net/browse/JUJU-6625

SimonRichardson · 2024-09-04T14:48:04Z

Test scenario:

Remove the semaphore completely.
Deploy the controller (see QA steps)
Add 20 models: for i in {1..20}; do juju add-model "model-$i" && sleep 1; done

Initial results:

Observations:

We hit over 800 read transactions per second
The load on the machine was higher than the first tests.
The ramp-up time between is a lot faster, which is obvious because we don't have a semaphore throttling the transactions.

Thoughts:

Run another test with more models, to see if we can push the throughput of transactions higher. We might be seeing, going slower to move faster...

SimonRichardson · 2024-09-09T11:28:06Z

I ran another test without the semaphore and wasn't able to push past 800 transactions per second. I'll explore if the semaphore should be tuned higher, or lower.

To get high numbers using sqlite, it is advised to only have a dedicated read connection. I'm unsure if that would translate to dqlite, but we could potentially spike a prototype to see if it's feasible.

To compare apples to apples, we need a baseline, this starts by bringing over the perf-worker from 4.0. See: juju#18007

SimonRichardson · 2024-09-24T15:30:02Z

See 3.6 results #18131 (comment)

Takeaway: It looks like mongo can hit a higher number of read queries, even if we take into account queries vs queries inside of a transaction. The latter is unlikely to have 6 queries per transaction (more likely 2-3 queries per transaction) to get to 8000+ queries per second.

I've got another round of testing to perform, which might give us the apples vs apples that we're looking for, but we might have to revisit the semaphore and retry solution to try and optimize it in the future.

It takes a lot to wire raw metrics all the way through, and the tests suites are broken because manifolds in test suites expect things that aren't there. But it does show up in `juju_metrics`.

chore: expose metrics to the perf worker

If you use a normal prometheus.Counter, they have to have a unique namespace, or you get a collision of the same Collector being registered 2x. So we move to a CounterVec and include the model-uuid, which gets us around this. It is probably better to have it, anyway.

The issue really was that we were registering the same (Name,Subsystem) multiple times. The only way around this is to push up the registration of the perf worker higher, and have it create a CounterVec. We can re-use ForModel. I don't know that it is ideal, as it means the agent/engine knows about additional metrics that are specific to 1 worker. But we're just hacking it together here.

jameinel · 2024-09-27T01:03:08Z

I worked with Simon to add actual metrics around "how many steps of the perf-worker" can run. And then we took the 3.6 version and compared it with the 4.0 version. We still need to dig in to really understand the details, but the initial results are a bit disheartening.

The first image is running both perf-worker-3.6 and perf-worker simultaneously, and setting them up to run 50 models each (50 models 10 step workers, is 500+ steps/second trying to run.)

perf-worker-3.6 got up to 10k / second, while perf-worker was pretty consistent around 1k/second.
The second image is first running perf-worker-3.6 and having it scale to 100 models, and then perf-worker and having it scale to 100 models. This time perf-worker-3.6 got to around 20k/s, while perf-worker again stayed close to 1k/s. At the same time, CPU load capped out for perf-worker around 300%. So some of this is just our internal semaphore throttling.

As another data point:

$ time juju models -c perf-36| wc -l
53

real    0m0.098s
user    0m0.048s
sys     0m0.068s
$ time juju models -c perf-40 | wc -l
54

real    0m14.606s
user    0m0.097s
sys     0m0.023s

So juju models slowed down quite a bit as the number of models increased. In the separate testing:

$ time juju models -c perf-40 | wc -l
103

real    1m9.098s
user    0m0.079s
sys     0m0.033s

This might just be poorly optimized queries, though, and not just about load. (it is easy for juju models to be at least linear with a bad factor, and trivially become quadratic if we haven't optimized it.)

jameinel · 2024-09-27T01:08:48Z

One very positive note. All of this was juju bootstrap lxd; juju consume mk8s:prometheus; juju integrate -m controller controller prometheus and those cross model relations worked across a 3.5 microk8s controller, a 3.6 test and a 4.0 test controller. There were some other issues to address (being able to label your data feeds), but the easy of getting grafana up, and getting it scraping your controllers was really good.

SimonRichardson · 2024-09-27T11:24:38Z

I ran the same setup, microk8s + cos-lite and 3.6 and 4.0 relating to that prometheus. Mongo stayed at 4000 req/s and dqlite was the low 300 req/s. I added more models (to 40 models) and it just highlighted that mongo can perform more queries, whilst dqlite was hovering around the same amount. It's clear that the rate-limiting is just hurting us.

It's clear we can't run without the semaphore (performance dropped), we need to work out a better strategy for this. I'm unsure what.

Observations

Adding more models whilst underload is a lot slower for dqlite. This is obvious, once you realize we're being rate-limited.

SimonRichardson · 2024-09-27T13:28:40Z

I've experimented with creating a poor man's hash ring in a branch.

Long story short; In the dbaccessor worker, I create 3 dqlite apps instead of one. Each dqlite app is created with its own transaction runner (each transaction runner has its own semaphore) and then depending on the hash of the namespace it's sent to the appropriate dqlite app. The controller namespace (global) is special as it is always at app 0. Each dqlite app is currently bound to an abstract domain socket (@node-%d) to make things a lot simpler.

The distribution of the namespace isn't perfect and it's a tendency to favor one index over the other (see below). We could potentially have a round-robin instead, which would give a fairer distribution. There is no consideration at all over scaling down the dqlite apps if a model goes away in the code, although I do have ideas on how to do this.

namespace: "a04af513-ea0e-418a-8953-554e19b90a30", hash: 2146607202, index: 0
namespace: "f8d0aafe-bd6a-497a-8f98-714c93d060d7", hash: 2735222685, index: 0
namespace: "d4459721-aea3-42da-85f1-57e7e6b7083f", hash: 1944514803, index: 0
namespace: "77b9f43d-8a35-4add-8f4e-b6e4e422e6f1", hash: 3112230964, index: 1
namespace: "703cf2df-3c4d-414d-8533-d4663cffc7cb", hash: 931743233, index: 2
namespace: "b1625b7f-d93e-45f4-8df5-0f0f396dc273", hash: 1924279564, index: 1
namespace: "d4e51406-03ea-49f2-84b0-68cb0c6969ce", hash: 1975105366, index: 1
namespace: "407e4d07-3dc0-4d1f-8bb5-e523e93b1ca1", hash: 2777876998, index: 1
namespace: "995ffc68-4a01-4ab3-8ff4-04df003d300e", hash: 314819688, index: 0
namespace: "ccb8b4ca-fe24-4e8c-8923-4cb3786f25b5", hash: 3402146965, index: 1
namespace: "b2b789f7-f592-46b8-834d-38c34a1687e4", hash: 3388300066, index: 1
namespace: "9bf174b9-ce87-4516-8c06-ce2d02627f81", hash: 3670112825, index: 2
namespace: "87e51109-0c68-4592-8a74-795f231e77b8", hash: 2295687420, index: 0
namespace: "53fb30b7-5f32-4e06-8914-ceeb76d599ed", hash: 327607678, index: 1
namespace: "1393e58a-b872-4e90-8c02-606db9606468", hash: 3960529581, index: 0
namespace: "b74a01d6-f0e6-4252-8932-17594308ae0a", hash: 336590602, index: 1
namespace: "c0fc501e-062a-430c-8a00-91512d86b4a9", hash: 2142247282, index: 1
namespace: "71815fa4-1495-417b-82af-22953bb2c19e", hash: 1480114457, index: 2
namespace: "611422c6-2112-40ee-82d2-c6de01b58cc1", hash: 1377263489, index: 2
namespace: "2f93adfa-f616-459c-8051-ef88b72ad3e7", hash: 3778284257, index: 2
namespace: "ea952dbe-a72e-4535-8a38-78bb877ae05f", hash: 2429903874, index: 0
namespace: "3112b3ae-b580-4ecb-8918-7558373914e7", hash: 1486829634, index: 0

The great thing about this change was that all of the code changes were located in two places; the dbaccessor worker and the bootstrap dqlite code.

Results:

Note: this wasn't testing 3.6 at the same time, although in the prior runs, the tests weren't taxing the system, so keep that in mind.

The number of dqlite applications was 3 and was hash indexed between the 3, with bias on index 1. The number of iterations rose above 1000, peaking at 1337.

Observations

We're not scaling linearly as that should put us in the 900 ballpark (300 req/s * 3 dqlite apps). I wonder if we're just not on the controller application, we scale much better. Reads vs writes is coming into play.

See: juju#18007 (comment)

SimonRichardson · 2024-09-27T14:56:26Z

Further to the prior comment, I created 6 dqlite applications and we did not scale superlinearly. We top out at 1736 req/s. I then decided to add more models and the writes just tanked all the reads, which you can see in the screen shot below, the iteration count just tanks.

Observations

If the reads are performing well, then a series of writes can dramatically ruin the read performance. A potential solution to this, is to split writes and reads, which is possible in sqlite, but I doubt is possible for dqlite, but is worth exploring.

SimonRichardson · 2024-09-27T15:22:23Z

Doubling the semaphore reduces the number of req/s to 859. This is with 3 dqlite apps running with 48 rate limit (weighted semaphore).

I'm going to try and reduce the semaphore number by half and see if it turns out that the semaphore number matching the machine processes is the sweet spot.

SimonRichardson · 2024-09-27T15:46:09Z

Cutting the semaphore to half (24 to 12) didn't reveal any difference between the req/s. The number ended up being 1326 req/s.

Observations

There seems to be a fine balance between backing off on the semaphore and overwhelming dqlite. Removing the semaphore or doubling it dramatically makes things worse, reducing it by half didn't have any real difference. I'm wondering if setting it to 2, would see any changes, or have we just maxed the throughput of dqlite?

SimonRichardson · 2024-09-27T16:21:09Z

Setting the semaphore to 2, reveals that we can burst up to 2489 req/s.

Observations

We see that we're connecting to dqlite a lot, so I believe setting the number of open connections to 1 (or a low number) should force us to reuse the connection and not spend time connecting to dqlite all the time.

SimonRichardson · 2024-09-27T16:41:28Z

Setting SetMaxOpenConns and SetMaxIdleConns to 2, to ensure pool reuse didn't do anything 2355 req/s.

Next attempt is to remove the semaphore and just cause the pool reuse.

SimonRichardson · 2024-09-27T17:01:50Z

Removing the semaphore and still with setting SetMaxOpenConns and SetMaxIdleConns to 2, peaks to 3353 req/s.

jameinel · 2024-09-27T19:27:07Z

I also tried removing the semaphore, going to db.SetMaxConnections(2), and ended up with a system that only hits 200% cpu, but is able to hit 3.6k simultaneous requests.

The actual perf worker had 6 options plus a default, but was using % 6, which would never have returned a 6. It also did 20 steps per step worker every second, which is a nice round number, but means that it didn't sample evenly the actual steps. 21 cleanly divides 7, so is a fairer sampling and test.

jameinel · 2024-09-27T21:05:34Z

As a point, Simon asked "so how much can we throw at 3.6". So I scaled up to 200 models. Which at most runs 40k steps / second. I then took it a step further and went to 300 models, which should cap out at 60k steps/second.
At this point, I'm running at 80% cpu max, and not being thermal throttled, but only hitting a peak of 39k steps/second.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   2377 1000000   20   0 3165296   1.7g 102848 S  1162   5.3 275:35.97 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju+
   2326 1000000   20   0 4971728 906080  53724 S 726.9   2.8 177:14.17 /snap/juju-db/178/bin/mongod --config /var/snap/juju-db/common/juju-+

So about 18 CPU cores fully engaged. (I have 12 E cores, 8 P cores, for a total of 28 threads.)

SimonRichardson force-pushed the perf-worker branch from a29e9a3 to 7729594 Compare September 3, 2024 13:16

SimonRichardson force-pushed the perf-worker branch from 7729594 to 3c59e3b Compare September 3, 2024 15:07

SimonRichardson mentioned this pull request Sep 3, 2024

feat: pool atomic context #18012

Merged

3 tasks

test: perf worker

68628ab

How hard can we actually hit dqlite from inside of juju. Let's find out!

SimonRichardson force-pushed the perf-worker branch from 3c59e3b to 68628ab Compare September 4, 2024 13:55

hpidcock added the 4.0 label Sep 10, 2024

SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 23, 2024

feat: prepare perf-worker for 3.6

9ab7d28

To compare apples to apples, we need a baseline, this starts by bringing over the perf-worker from 4.0. See: juju#18007

SimonRichardson mentioned this pull request Sep 23, 2024

feat: prepare perf-worker for 3.6 #18131

Draft

5 tasks

SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 24, 2024

feat: prepare perf-worker for 3.6

ec1ff43

To compare apples to apples, we need a baseline, this starts by bringing over the perf-worker from 4.0. See: juju#18007

jameinel and others added 4 commits September 26, 2024 11:08

chore: expose metrics to the perf worker

0fae5f3

It takes a lot to wire raw metrics all the way through, and the tests suites are broken because manifolds in test suites expect things that aren't there. But it does show up in `juju_metrics`.

Merge pull request #79 from jameinel/4.0-perf-worker

3366e17

chore: expose metrics to the perf worker

SimonRichardson added a commit to SimonRichardson/juju that referenced this pull request Sep 27, 2024

feat: perf worker, but hashed dqlite apps

9087dbb

See: juju#18007 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: perf worker #18007

test: perf worker #18007

SimonRichardson commented Sep 2, 2024 •

edited

Loading

jnsgruk commented Sep 2, 2024

SimonRichardson commented Sep 3, 2024 •

edited

Loading

SimonRichardson commented Sep 4, 2024

SimonRichardson commented Sep 9, 2024

SimonRichardson commented Sep 24, 2024

jameinel commented Sep 27, 2024

jameinel commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024

jameinel commented Sep 27, 2024

jameinel commented Sep 27, 2024

test: perf worker #18007

Are you sure you want to change the base?

test: perf worker #18007

Conversation

SimonRichardson commented Sep 2, 2024 • edited Loading

Checklist

QA steps

Links

jnsgruk commented Sep 2, 2024

SimonRichardson commented Sep 3, 2024 • edited Loading

SimonRichardson commented Sep 4, 2024

SimonRichardson commented Sep 9, 2024

SimonRichardson commented Sep 24, 2024

jameinel commented Sep 27, 2024

jameinel commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024 • edited Loading

Observations

SimonRichardson commented Sep 27, 2024

Results:

Observations

SimonRichardson commented Sep 27, 2024 • edited Loading

Observations

SimonRichardson commented Sep 27, 2024 • edited Loading

SimonRichardson commented Sep 27, 2024 • edited Loading

Observations

SimonRichardson commented Sep 27, 2024

Observations

SimonRichardson commented Sep 27, 2024

SimonRichardson commented Sep 27, 2024

jameinel commented Sep 27, 2024

jameinel commented Sep 27, 2024

SimonRichardson commented Sep 2, 2024 •

edited

Loading

SimonRichardson commented Sep 3, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading

SimonRichardson commented Sep 27, 2024 •

edited

Loading