Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: prepare perf-worker for 3.6 #18131

Closed
wants to merge 5 commits into from

Conversation

SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Sep 23, 2024

To compare apples to apples, we need a baseline, this starts by bringing over the perf-worker from 4.0.

See: #18007


Getting read stats isn't quite as easy as with the domain services in the 4.0 branch, which was designed to get db metrics from the get-go. Instead, we can cheat and brute force our way through mongostat. This will return the number of query (reads) happening against the mongo cluster. This isn't quite the same, as gridfs is included, but should be good enough for now.

Also, keep in mind that we're tracking dqlite transactions, where there could be multiple queries, whereas we're tracking individual queries with mongo.

Login to mongostat with bash script.

#!/bin/bash

machine=${1:-0}
model=${2:-controller}


read -d '' -r cmds <<'EOF'
conf=/var/lib/juju/agents/machine-*/agent.conf
user=`sudo grep tag $conf | cut -d' ' -f2`
password=`sudo grep statepassword $conf | cut -d' ' -f2`
if [ -f /snap/bin/juju-db.mongostat ]; then
  client=/snap/bin/juju-db.mongostat
elif [ -f /usr/lib/juju/mongo*/bin/mongostat ]; then
  client=/usr/lib/juju/mongo*/bin/mongostat
else
  client=/usr/bin/mongostat
fi
echo $user $password
$client --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password" mongodb://127.0.0.1:37017 15
EOF
juju ssh -m "${model}" "${machine}" "${cmds}"

Checklist

  • Code style: imports ordered, good names, simple structure, etc
  • Comments saying why design decisions were made
  • Go unit tests, with comments saying what you're testing
  • Integration tests, with comments saying what you're testing
  • doc.go added or updated in changed packages

QA steps

Links

Jira card: JUJU-

@hpidcock hpidcock added the 3.6 label Sep 23, 2024
To compare apples to apples, we need a baseline, this starts by
bringing over the perf-worker from 4.0.

See: juju#18007
@SimonRichardson
Copy link
Member Author

SimonRichardson commented Sep 24, 2024

Test scenario:

  1. Deploy the controller (see QA steps)
  2. Add 20 models: for i in {1..20}; do juju add-model "model-$i" && sleep 1; done

Initial results:

screencapture-10-50-139-18-3000-d-adyve57goqk1sd-juju3-controllers-2024-09-24-15_32_35

$ juju mongostat
2024-09-24T14:17:39.354+0000    WARNING: On some systems, a password provided directly using --password may be visible to system status programs such as `ps` that may be invoked by other users. Consider omitting the password to provide it via stdin, or using the --config option to specify a configuration file with the password.
2024-09-24T14:17:39.354+0000    WARNING: --sslAllowInvalidCertificates and --sslAllowInvalidHostnames are deprecated, please use --tlsInsecure instead
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
    *0    43      1     *0       4     1|0  0.0% 0.5%       0 2.11G 488M 0|0 0|0  8.33k   96.8k   26 juju  PRI Sep 24 14:17:54.379
    46   158      5     *0      30    21|0  0.0% 0.5%       0 2.11G 494M 0|0 0|0   192k    246k   26 juju  PRI Sep 24 14:18:09.381
    46   228      9     *0      40    35|0  0.0% 0.5%       1 2.12G 502M 0|0 0|0  68.7k    328k   37 juju  PRI Sep 24 14:18:24.380
    86   390     13     *0      57    47|0  0.0% 0.5%       0 2.13G 509M 0|0 0|0   109k   11.4m   46 juju  PRI Sep 24 14:18:39.381
   116   553     14     *0      68    49|0  0.0% 0.6%       0 2.14G 518M 0|0 0|0   141k   11.5m   54 juju  PRI Sep 24 14:18:54.382
   131   801     67     *0      88    60|0  0.0% 0.6%       0 2.19G 529M 0|0 0|0   188k   12.0m   65 juju  PRI Sep 24 14:19:09.381
   142  1051     14     *0      78    51|0  0.0% 0.6%       1 2.20G 538M 0|0 0|0   223k   11.9m   74 juju  PRI Sep 24 14:19:24.381
   142  1318     14     *0      73    46|0  0.0% 0.6%       0 2.21G 544M 0|0 0|0   260k   12.3m   78 juju  PRI Sep 24 14:19:39.383
   140  1619     10     *0      61    34|0  0.0% 0.6%       0 2.22G 552M 0|0 0|0   295k   12.6m   91 juju  PRI Sep 24 14:19:54.381
   179  2120     13     *0      78    38|0  0.1% 0.6%       0 2.22G 557M 0|0 0|0   383k   9.62m   91 juju  PRI Sep 24 14:20:09.383
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
   111  2384      1     *0      33     3|0  0.0% 0.6%       1 2.24G 561M 0|0 0|0   387k   2.43m   93 juju  PRI Sep 24 14:20:24.382
   126  2752      1     *0      37     3|0  0.0% 0.6%       0 2.24G 564M 0|0 0|0   445k   2.79m   95 juju  PRI Sep 24 14:20:39.384
   148  3248      1     *0      40     2|0  0.0% 0.6%       0 2.24G 568M 0|0 0|0   524k   3.28m   97 juju  PRI Sep 24 14:20:54.381
   164  3629      1     *0      38     2|0  0.0% 0.6%       0 2.24G 571M 0|0 1|0   584k   3.66m   97 juju  PRI Sep 24 14:21:09.380
   180  4009      1     *0      42     5|0  0.0% 0.6%       1 2.25G 578M 0|0 0|0   644k   4.04m  106 juju  PRI Sep 24 14:21:24.382
   208  4551      1     *0      49     4|0  0.1% 0.6%       0 2.28G 583M 0|0 0|0   732k   4.57m  114 juju  PRI Sep 24 14:21:39.381
   228  5000      1     *0      51     5|0  0.1% 0.6%       0 2.29G 592M 0|0 0|0   803k   5.02m  125 juju  PRI Sep 24 14:21:54.380
   240  5275      1     *0      52     2|0  0.1% 0.6%       0 2.30G 596M 0|0 0|0   846k   5.29m  125 juju  PRI Sep 24 14:22:09.380
   259  5802      1     *0      57     2|0  0.0% 0.6%       1 2.30G 602M 0|0 0|0   930k   5.82m  125 juju  PRI Sep 24 14:22:24.384
   300  6316      1     *0      65     4|0  0.1% 0.6%       0 2.30G 609M 0|0 0|0  1.02m   6.34m  125 juju  PRI Sep 24 14:22:39.381
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
   309  6607      1     *0      69     3|0  0.1% 0.6%       0 2.31G 614M 0|0 0|0  1.06m   6.61m  125 juju  PRI Sep 24 14:22:54.379
   317  7045      1     *0      71     5|0  0.1% 0.6%       0 2.33G 622M 0|0 0|0  1.13m   7.05m  134 juju  PRI Sep 24 14:23:09.382
   370  7642      1     *0      72     5|0  0.1% 0.6%       1 2.34G 630M 0|0 0|0  1.23m   7.65m  134 juju  PRI Sep 24 14:23:24.380
   361  7955      1     *0      74     8|0  0.1% 0.7%       0 2.37G 641M 0|0 0|0  1.27m   7.95m  155 juju  PRI Sep 24 14:23:39.380
   377  8090      1     *0      78     4|0  0.1% 0.7%       0 2.39G 647M 0|0 0|0  1.30m   8.08m  155 juju  PRI Sep 24 14:23:54.379
   381  8379     54     21     174     5|0  0.1% 0.7%       0 2.43G 655M 0|0 0|0  1.35m   8.40m  155 juju  PRI Sep 24 14:24:09.379
   389  8672      1     *0      86     4|0  0.1% 0.7%       1 2.43G 662M 0|0 0|0  1.39m   8.67m  155 juju  PRI Sep 24 14:24:24.387
   397  8694      1     *0      93     5|0  0.1% 0.7%       0 2.44G 677M 0|0 0|0  1.39m   8.69m  162 juju  PRI Sep 24 14:24:39.383
   404  8598      1     *0      94     5|0  0.1% 0.7%       0 2.44G 683M 0|0 0|0  1.38m   8.60m  162 juju  PRI Sep 24 14:24:54.380
   392  8773      1     *0      85     2|0  0.1% 0.7%       0 2.44G 690M 1|0 0|0  1.40m   8.77m  162 juju  PRI Sep 24 14:25:09.387
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
   389  8756      1     *0      86     3|0  0.1% 0.7%       1 2.44G 696M 0|0 0|0  1.40m   8.75m  162 juju  PRI Sep 24 14:25:24.380
   398  8655      1     *0      86     3|0  0.1% 0.7%       0 2.44G 702M 4|0 6|0  1.39m   8.66m  162 juju  PRI Sep 24 14:25:39.381
   395  8625      1     *0      83     3|0  0.1% 0.7%       0 2.44G 708M 0|0 0|0  1.38m   8.63m  162 juju  PRI Sep 24 14:25:54.385
   395  8723      1     *0      83     3|0  0.2% 0.8%       0 2.46G 714M 0|0 0|0  1.40m   8.72m  162 juju  PRI Sep 24 14:26:09.380
   383  8698      1     *0      76     2|0  0.1% 0.8%       1 2.46G 720M 0|0 0|0  1.39m   8.69m  162 juju  PRI Sep 24 14:26:24.382
   395  8548      1     *0      83     2|0  0.2% 0.8%       0 2.46G 727M 0|0 0|0  1.37m   8.55m  162 juju  PRI Sep 24 14:26:39.382
   395  8748      1     *0      86     2|0  0.2% 0.8%       0 2.48G 732M 2|0 0|0  1.40m   8.75m  162 juju  PRI Sep 24 14:26:54.386
   387  8636      1     *0      88     2|0  0.2% 0.8%       0 2.48G 739M 0|0 1|0  1.38m   8.63m  162 juju  PRI Sep 24 14:27:09.381
   391  8701      1     *0      78     2|0  0.2% 0.8%       1 2.48G 745M 0|0 0|0  1.39m   8.70m  162 juju  PRI Sep 24 14:27:24.384
   423  8720      1     *0      91     4|0  0.2% 0.8%       0 2.50G 752M 0|0 1|0  1.40m   8.73m  162 juju  PRI Sep 24 14:27:39.381
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
   405  8720      1     *0      87     3|0  0.2% 0.8%       0 2.50G 758M 0|0 0|0  1.40m   8.72m  162 juju  PRI Sep 24 14:27:54.380
   395  8697      1     *0      85     4|0  0.2% 0.8%       0 2.50G 764M 0|0 0|0  1.39m   8.69m  162 juju  PRI Sep 24 14:28:09.380
   392  8666      1     *0      83     4|0  0.2% 0.8%       1 2.50G 769M 0|0 0|0  1.39m   8.66m  162 juju  PRI Sep 24 14:28:24.381
   414  8759      4      6     104    15|0  0.2% 0.8%       0 2.52G 777M 0|0 1|0  1.41m   8.76m  162 juju  PRI Sep 24 14:28:39.381
   400  8017      8     17     121    34|0  0.2% 0.9%       0 2.52G 787M 0|0 0|0  1.31m   8.02m  162 juju  PRI Sep 24 14:28:54.381
   359  7069     25     13      99    21|0  0.2% 0.9%       0 2.56G 795M 0|0 4|0  1.15m   7.07m  162 juju  PRI Sep 24 14:29:09.381
   324  6128      9     20     108    36|0  0.2% 0.8%       1 2.56G 801M 0|0 0|0  1.01m   6.14m  162 juju  PRI Sep 24 14:29:24.384
   275  4986      9     17      92    34|0  0.2% 0.8%       0 2.56G 801M 0|0 0|0   831k   5.03m  162 juju  PRI Sep 24 14:29:39.384
   211  3769      8     18      73    31|0  0.2% 0.8%       0 2.56G 801M 0|0 0|0   632k   3.83m  162 juju  PRI Sep 24 14:29:54.382
   173  2656      8     19      60    34|0  0.2% 0.8%       0 2.56G 803M 0|0 0|0   459k   2.69m  162 juju  PRI Sep 24 14:30:09.386
insert query update delete getmore command dirty used flushes vsize  res qrw arw net_in net_out conn  set repl                time
   110  1503      8     17      11    33|0  0.1% 0.7%       1 2.56G 806M 0|0 0|0   269k   1.55m  162 juju  PRI Sep 24 14:30:24.381
    40   324      4     10       6    16|0  0.1% 0.7%       0 2.56G 806M 0|0 0|0  67.8k    379k  162 juju  PRI Sep 24 14:30:39.379

Observations:

  1. We hit over 8k of read operations
  2. The load on the machine was higher than that of the dqlite tests with a semaphore.

Thoughts:

  1. We're not comparing apples vs apples. Transactions contain many queries, yet mongostat returns operations, which can be viewed as queries within a transaction for dqlite. We don't know the total number of queries that are running for dqlite, it is unlikely to be the same as the 8k mark.
  2. How much can we trust mongostat. I'm not saying it's lying, but we have seen in the past, miss-reporting from mongo tools.
  3. Adding prometheus metrics around mongo, so that we can validate the numbers is quite a large task, one that we don't want to tackle right now.
  4. The semaphore in Juju is giving a nice back-off mechanism, but it's not allowing us to reach peak performance.

Next steps:

  1. Hit just the API without going through the CLI and see if it's possible to get a more apples vs apples view.

@SimonRichardson SimonRichardson mentioned this pull request Sep 24, 2024
3 tasks
jameinel and others added 4 commits September 26, 2024 08:53
It takes a lot to wire raw metrics all the way through, and the tests suites are broken because manifolds in test suites
expect things that aren't there. But it does show up in `juju_metrics`.
chore: expose metrics to the perf worker
If you use a normal prometheus.Counter, they have to have a unique
namespace, or you get a collision of the same Collector being registered
2x. So we move to a CounterVec and include the model-uuid, which gets us
around this. It is probably better to have it, anyway.
The issue really was that we were registering the same (Name,Subsystem) multiple times.
The only way around this is to push up the registration of the perf worker higher, and
have it create a CounterVec. We can re-use ForModel. I don't know that it is ideal,
as it means the agent/engine knows about additional metrics that are specific to 1 worker.
But we're just hacking it together here.
@SimonRichardson SimonRichardson added the do not merge Even if a PR has been approved, do not merge the PR! label Sep 27, 2024
@SimonRichardson
Copy link
Member Author

Closing this, we're not going to ship perf worker with 3.6, this was a good exercise in how to match 4.0 with passing Sate into the model workers. Performance from that perspective would have been a bonus, instead of going via the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.6 do not merge Even if a PR has been approved, do not merge the PR!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants