Proxy release 2024-10-10 #9341

vipvap · 2024-10-10T06:02:05Z

Proxy release 2024-10-10

Please merge this Pull Request using 'Create a merge commit' button

The apt install stage before this commit: 0 upgraded, 391 newly installed, 0 to remove and 9 not upgraded. Need to get 261 MB of archives. after: 0 upgraded, 367 newly installed, 0 to remove and 9 not upgraded. Need to get 220 MB of archives.

Address minor technical debt in Layer inspired by #9224: - layer usage as arg same as in spans - avoid one Weak::upgrade

Because: - it's nice to be up-to-date, - we already had axum 0.7 in our dependency tree, so this avoids having to compile two versions, and - removes one of the remaining dpendencies to hyper version 0 Also bumps the 'tokio-tungstenite' dependency, to avoid having two versions in the dependency tree.

Follow-up of #9234 to give hyper 1.0 the version-free name, and the legacy version of hyper the one with the version number inside. As we move away from hyper 0.14, we can remove the `hyper0` name piece by piece. Part of #9255

## Problem `Oversized vectored read [...]` logs are spewing in prod because we have a few keys that are unexpectedly large: * reldir/relblock - these are unbounded, so it's known technical debt * slru block - they can be a bit bigger than 128KiB due to storage format overhead ## Summary of changes * Bump threshold to 130KiB * Don't warn on oversized reldir and dbdir keys Closes #8967

Panic was triggered only when dump selected no timelines. sentry report: https://neondatabase.sentry.io/issues/5832368589/

…#9253) See [this comment](#8888 (comment)).

* I had to install `m4` in order to be able to run locally * The docs/docker.md was missing a pointer to where the compute node code is (Was originally on #8888 but I am pulling this out)

Add wrappers for a few commands that didn't have them before. Move the logic to generate tenant and timeline IDs from NeonCli to the callers, so that NeonCli is more purely just a type-safe wrapper around 'neon_local'.

In the passing, rename it to NeonLocalCli, to reflect that the binary is called 'neon_local'. Add wrapper for the 'timeline_import' command, eliminating the last raw call to the raw_cli() function from tests, except for a few in test_neon_cli.py which are about testing the 'neon_local' iteself. All the other calls are now made through the strongly-typed wrapper functions

…ds (#9195) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.

…9195)

## Problem The S3 tests couldn't use SSO authentication for local tests against S3. ## Summary of changes Enable the `sso` feature of `aws-config`. Also run `cargo hakari generate` which made some updates to `workspace_hack`.

## Problem Secondary tenant heatmaps were always downloaded, even when they hadn't changed. This can be avoided by using a conditional GET request passing the `ETag` of the previous heatmap. ## Summary of changes The `ETag` was already plumbed down into the heatmap downloader, and just needed further plumbing into the remote storage backends. * Add a `DownloadOpts` struct and pass it to `RemoteStorage::download()`. * Add an optional `DownloadOpts::etag` field, which uses a conditional GET and returns `DownloadError::Unmodified` on match.

## Problem Creation of a timelines during a reconciliation can lead to unavailability if the user attempts to start a compute before the storage controller has notified cplane of the cut-over. ## Summary of changes Create timelines on all currently attached locations. For the latest location, we still look at the database (this is a previously). With this change we also look into the observed state to find *other* attached locations. Related #9144

NeonWALReader needs to know LSN before which WAL is not available locally, that is, basebackup LSN. Previously it was taken from WalpropShmemState, but that's racy, as walproposer sets its there only after successfull election. Get it directly with GetRedoStartLsn. Should fix flakiness of test_ondemand_wal_download_in_replication_slot_funcs etc. ref #9201

1. Adds local-proxy to compute image and vm spec 2. Updates local-proxy config processing, writing PID to a file eagerly 3. Updates compute-ctl to understand local proxy compute spec and to send SIGHUP to local-proxy over that pid. closes neondatabase/cloud#16867

Fixes (#9020) - Use the compute::COULD_NOT_CONNECT for connection error message; - Eliminate logging for one connection attempt; - Typo fix.

If peer safekeeper needs garbage collected segment it will be fetched now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.

## Problem See #9199 ## Summary of changes Fix update of hits/misses for LFC and prefetch introduced in 78938d1 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

#9266) rename console -> control_plane rename web -> console_redirect I think these names are a little more representative.

The prefetch-queue hash table uses a BufferTag struct as the hash key, and it's hashed using hash_bytes(). It's important that all the padding bytes in the key are cleared, because hash_bytes() will include them. I was getting compiler warnings like this on v14 and v15, when compiling with -Warray-bounds: In function ‘prfh_lookup_hash_internal’, inlined from ‘prfh_lookup’ at pg_install/v14/include/postgresql/server/lib/simplehash.h:821:9, inlined from ‘neon_read_at_lsnv’ at pgxn/neon/pagestore_smgr.c:2789:11, inlined from ‘neon_read_at_lsn’ at pgxn/neon/pagestore_smgr.c:2904:2: pg_install/v14/include/postgresql/server/storage/relfilenode.h:90:43: warning: array subscript ‘PrefetchRequest[0]’ is partly outside array bounds of ‘BufferTag[1]’ {aka ‘struct buftag[1]’} [-Warray-bounds] 89 | ((node1).relNode == (node2).relNode && \ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 90 | (node1).dbNode == (node2).dbNode && \ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~ 91 | (node1).spcNode == (node2).spcNode) | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pg_install/v14/include/postgresql/server/storage/buf_internals.h:116:9: note: in expansion of macro ‘RelFileNodeEquals’ 116 | RelFileNodeEquals((a).rnode, (b).rnode) && \ | ^~~~~~~~~~~~~~~~~ pgxn/neon/neon_pgversioncompat.h:25:31: note: in expansion of macro ‘BUFFERTAGS_EQUAL’ 25 | #define BufferTagsEqual(a, b) BUFFERTAGS_EQUAL(*(a), *(b)) | ^~~~~~~~~~~~~~~~ pgxn/neon/pagestore_smgr.c:220:34: note: in expansion of macro ‘BufferTagsEqual’ 220 | #define SH_EQUAL(tb, a, b) (BufferTagsEqual(&(a)->buftag, &(b)->buftag)) | ^~~~~~~~~~~~~~~ pg_install/v14/include/postgresql/server/lib/simplehash.h:280:77: note: in expansion of macro ‘SH_EQUAL’ 280 | #define SH_COMPARE_KEYS(tb, ahash, akey, b) (ahash == SH_GET_HASH(tb, b) && SH_EQUAL(tb, b->SH_KEY, akey)) | ^~~~~~~~ pg_install/v14/include/postgresql/server/lib/simplehash.h:799:21: note: in expansion of macro ‘SH_COMPARE_KEYS’ 799 | if (SH_COMPARE_KEYS(tb, hash, key, entry)) | ^~~~~~~~~~~~~~~ pgxn/neon/pagestore_smgr.c: In function ‘neon_read_at_lsn’: pgxn/neon/pagestore_smgr.c:2742:25: note: object ‘buftag’ of size 20 2742 | BufferTag buftag = {0}; | ^~~~~~ This commit silences those warnings, although it's not clear to me why the compiler complained like that in the first place. I found the issue with padding bytes while looking into those warnings, but that was coincidental, I don't think the padding bytes explain the warnings as such. In v16, the BUFFERTAGS_EQUAL macro was replaced with a static inline function, and that also silences the compiler warning. Not clear to me why.

…9031) Requires #9086 first to have `local_proxy_config`. This logic can still be reviewed implementation wise. Create JWT Auth functionality related roles without attributes and `neon_superuser` group. Read the JWT related roles from `local_proxy_config` `JWKS` settings and handle them differently than other console created roles.

…9294) In PostgreSQL v16, BUFFERTAGS_EQUAL was replaced with a static inline macro, BufferTagsEqual. Let's use the new name going forward, and have backwards-compatibility glue to allow using the new name on v14 and v15, rather than the other way round. This also makes BufferTagsEquals consistent with InitBufferTag, for which we were already using the new name.

I'm trying to debug a situation with the LR benchmark publisher not being in the correct state. This should aid in debugging, while just being generally useful. PR: #9265 Signed-off-by: Tristan Partin <tristan@neon.tech>

Update hyper and tonic again in the storage broker, this time with a fix for the issue that made us revert the update last time. The first commit is a revert of #9268, the second a fix for the issue. fixes #9231.

In short: Currently we reserve 75% of memory to the LFC, meaning that if we scale up to keep postgres using less than 25% of the compute's memory. This means that for certain memory-heavy workloads, we end up scaling much higher than is actually needed — in the worst case, up to 4x, although in practice it tends not to be quite so bad. Part of neondatabase/autoscaling#1030.

The neon_cli functions print the command that gets executed, which contains the same information. Before: 2024-10-07 22:32:28.884 INFO [neon_fixtures.py:3927] Stopping safekeeper 1 2024-10-07 22:32:28.884 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 1" 2024-10-07 22:32:28.989 INFO [neon_fixtures.py:3927] Stopping safekeeper 2 2024-10-07 22:32:28.989 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 2" 2024-10-07 22:32:29.93 INFO [neon_fixtures.py:3927] Stopping safekeeper 3 2024-10-07 22:32:29.94 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 3" 2024-10-07 22:32:29.251 INFO [neon_cli.py:450] Stopping pageserver with ['pageserver', 'stop', '--id=1'] 2024-10-07 22:32:29.251 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local pageserver stop --id=1" After: 2024-10-07 22:32:28.884 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 1" 2024-10-07 22:32:28.989 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 2" 2024-10-07 22:32:29.94 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local safekeeper stop 3" 2024-10-07 22:32:29.251 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local pageserver stop --id=1"

## Summary of changes CI: Collect stats for Github Workflows Runs

- PostGIS 3.5.0 - pgrouting 3.6.2 - h3 4.1.3 - unit 7.9 - pgjwt version (f3d82fd) - pg_hashids 1.2.1 - ip4r 2.4.2 - prefix 1.2.10 - postgresql-hll 2.18 - pg_roaringbitmap 0.5.4 - pg-semver 0.40.0 update support of extensions for v14-v16: - unit 7.7 -> 7.9 - pgjwt 9742dab -> f3d82fd --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

This seems to paper over a behavioral difference in Python 3.9 and Python 3.12 with how dataclasses work with mutable variables. On Python 3.12, I get the following error: ValueError: mutable default <class 'dict'> for field EXTRACTORS is not allowed: use default_factory This obviously doesn't occur in our testing environment. When I do what the error tells me, EXTRACTORS doesn't seem to exist as an attribute on the class in at least Python 3.9. The solution provided in this commit seems like the least amount of friction to keep the wheels turning. Signed-off-by: Tristan Partin <tristan@neon.tech>

Fixes some types, adds some types, and adds some override annotations. Signed-off-by: Tristan Partin <tristan@neon.tech>

… port (#9298) neondatabase/cloud#18349 Use the `-local-proxy` suffix to make sure we get the 10432 local_proxy port back from cplane.

github-actions · 2024-10-10T06:51:30Z

5094 tests run: 4887 passed, 0 failed, 207 skipped (full report)

Flaky tests (1)

Postgres 17

test_scrubber_physical_gc_ancestors_split: debug-x86-64

Code coverage* (full report)

functions: 31.4% (7546 of 24012 functions)
lines: 49.2% (60348 of 122535 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
306094a at 2024-10-10T06:51:29.975Z :recycle:}

cloneable · 2024-10-10T07:12:10Z

Relevant PRs:

danieltprice · 2024-10-10T20:01:42Z

Reviewed for changelog

hlinnaka and others added 30 commits October 3, 2024 10:05

chore: smaller layer changes (#9247)

dbef1b0

Address minor technical debt in Layer inspired by #9224: - layer usage as arg same as in spans - avoid one Weak::upgrade

Rename hyper 1.0 to hyper and hyper 0.14 to hyper0 (#9254)

9d93dd4

Follow-up of #9234 to give hyper 1.0 the version-free name, and the legacy version of hyper the one with the version number inside. As we move away from hyper 0.14, we can remove the `hyper0` name piece by piece. Part of #9255

safekeeper: fix panic in debug_dump. (#9097)

d785fcb

Panic was triggered only when dump selected no timelines. sentry report: https://neondatabase.sentry.io/issues/5832368589/

Revert hyper and tonic updates (#9268)

e3d6eca

chore: remove unnecessary comments in compute/Dockerfile.compute-node (…

2fac0b7

…#9253) See [this comment](#8888 (comment)).

chore: makes some onboarding document improvements (#9216)

4e9b32c

* I had to install `m4` in order to be able to run locally * The docs/docker.md was missing a pointer to where the compute node code is (Was originally on #8888 but I am pulling this out)

tests: Rename NeonLocalCli functions to match the 'neon_local' comman…

8ef0c38

…ds (#9195) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.

tests: Add a comment explaining the rules of NeonLocalCli wrappers (#…

52232dd

…9195)

Cargo.toml: enable sso for aws-config (#9261)

60fb840

## Problem The S3 tests couldn't use SSO authentication for local tests against S3. ## Summary of changes Enable the `sso` feature of `aws-config`. Also run `cargo hakari generate` which made some updates to `workspace_hack`.

remote_storage: add head_object integration test (#9274)

04a6222

proxy: exclude triple logging of connect compute errors (#9277)

2d248ae

Fixes (#9020) - Use the compute::COULD_NOT_CONNECT for connection error message; - Eliminate logging for one connection attempt; - Typo fix.

safekeeper: remove local WAL files ignoring peer_horizon_lsn. (#8900)

eae4470

If peer safekeeper needs garbage collected segment it will be fetched now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.

proxy: rename console -> control_plane, rename web -> console_redirect (

8cd7b5b

#9266) rename console -> control_plane rename web -> console_redirect I think these names are a little more representative.

proxy: Move module base files into module directory (#9297)

ad267d8

Improve logging on changes in a compute's status

6eba29c

I'm trying to debug a situation with the LR benchmark publisher not being in the correct state. This should aid in debugging, while just being generally useful. PR: #9265 Signed-off-by: Tristan Partin <tristan@neon.tech>

storage_broker: update hyper and tonic again (#9299)

912d47e

Update hyper and tonic again in the storage broker, this time with a fix for the issue that made us revert the update last time. The first commit is a revert of #9268, the second a fix for the issue. fixes #9231.

hlinnaka and others added 7 commits October 9, 2024 15:51

added workflow Report Workflow Stats (#9330)

108a211

## Summary of changes CI: Collect stats for Github Workflows Runs

local_proxy: integrate with pg_session_jwt extension (#9086)

7543406

Improve some typing in test_runner

d346458

Fixes some types, adds some types, and adds some override annotations. Signed-off-by: Tristan Partin <tristan@neon.tech>

add local-proxy suffix to wake-compute requests, respect the returned…

306094a

… port (#9298) neondatabase/cloud#18349 Use the `-local-proxy` suffix to make sure we get the 10432 local_proxy port back from cplane.

vipvap requested review from a team as code owners October 10, 2024 06:02

vipvap requested review from jcsp, conradludgate, knizhnik, petuhovskiy and clipperhouse and removed request for a team October 10, 2024 06:02

cloneable requested review from cloneable and removed request for conradludgate October 10, 2024 07:13

cloneable approved these changes Oct 10, 2024

View reviewed changes

cloneable merged commit a202b1b into release-proxy Oct 10, 2024
220 of 226 checks passed

cloneable deleted the rc/proxy/2024-10-10 branch October 10, 2024 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-10-10 #9341

Proxy release 2024-10-10 #9341

vipvap commented Oct 10, 2024

github-actions bot commented Oct 10, 2024

Postgres 17

cloneable commented Oct 10, 2024

danieltprice commented Oct 10, 2024

Proxy release 2024-10-10 #9341

Proxy release 2024-10-10 #9341

Conversation

vipvap commented Oct 10, 2024