forked from neondatabase/neon
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(rfc): Independent compute release flow (neondatabase#8881)
Related to neondatabase/cloud#11698
- Loading branch information
Showing
1 changed file
with
343 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,343 @@ | ||
# Independent compute release | ||
|
||
Created at: 2024-08-30. Author: Alexey Kondratov (@ololobus) | ||
|
||
## Summary | ||
|
||
This document proposes an approach to fully independent compute release flow. It attempts to | ||
cover the following features: | ||
|
||
- Process is automated as much as possible to minimize human errors. | ||
- Compute<->storage protocol compatibility is ensured. | ||
- A transparent release history is available with an easy rollback strategy. | ||
- Although not in the scope of this document, there is a viable way to extend the proposed release | ||
flow to achieve the canary and/or blue-green deployment strategies. | ||
|
||
## Motivation | ||
|
||
Previously, the compute release was tightly coupled to the storage release. This meant that once | ||
some storage nodes got restarted with a newer version, all new compute starts using these nodes | ||
automatically got a new version. Thus, two releases happen in parallel, which increases the blast | ||
radius and makes ownership fuzzy. | ||
|
||
Now, we practice a manual v0 independent compute release flow -- after getting a new compute release | ||
image and tag, we pin it region by region using Admin UI. It's better, but it still has its own flaws: | ||
|
||
1. It's a simple but fairly manual process, as you need to click through a few pages. | ||
2. It's prone to human errors, e.g., you could mistype or copy the wrong compute tag. | ||
3. We now require an additional approval in the Admin UI, which partially solves the 2., | ||
but also makes the whole process pretty annoying, as you constantly need to go back | ||
and forth between two people. | ||
|
||
## Non-goals | ||
|
||
It's not the goal of this document to propose a design for some general-purpose release tool like Helm. | ||
The document considers how the current compute fleet is orchestrated at Neon. Even if we later | ||
decide to split the control plane further (e.g., introduce a separate compute controller), the proposed | ||
release process shouldn't change much, i.e., the releases table and API will reside in | ||
one of the parts. | ||
|
||
Achieving the canary and/or blue-green deploy strategies is out of the scope of this document. They | ||
were kept in mind, though, so it's expected that the proposed approach will lay down the foundation | ||
for implementing them in future iterations. | ||
|
||
## Impacted components | ||
|
||
Compute, control plane, CI, observability (some Grafana dashboards may require changes). | ||
|
||
## Prior art | ||
|
||
One of the very close examples is how Helm tracks [releases history](https://helm.sh/docs/helm/helm_history/). | ||
|
||
In the code: | ||
|
||
- [Release](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/release.go#L20-L43) | ||
- [Release info](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/info.go#L24-L40) | ||
- [Release status](https://github.com/helm/helm/blob/2b30cf4b61d587d3f7594102bb202b787b9918db/pkg/release/status.go#L18-L42) | ||
|
||
TL;DR it has several important attributes: | ||
|
||
- Revision -- unique release ID/primary key. It is not the same as the application version, | ||
because the same version can be deployed several times, e.g., after a newer version rollback. | ||
- App version -- version of the application chart/code. | ||
- Config -- set of overrides to the default config of the application. | ||
- Status -- current status of the release in the history. | ||
- Timestamps -- tracks when a release was created and deployed. | ||
|
||
## Proposed implementation | ||
|
||
### Separate release branch | ||
|
||
We will use a separate release branch, `release-compute`, to have a clean history for releases and commits. | ||
In order to avoid confusion with storage releases, we will use a different prefix for compute [git release | ||
tags](https://github.com/neondatabase/neon/releases) -- `release-compute-XXXX`. We will use the same tag for | ||
Docker images as well. The `neondatabase/compute-node-v16:release-compute-XXXX` looks longer and a bit redundant, | ||
but it's better to have image and git tags in sync. | ||
|
||
Currently, control plane relies on the numeric compute and storage release versions to decide on compute->storage | ||
compatibility. Once we implement this proposal, we should drop this code as release numbers will be completely | ||
independent. The only constraint we want is that it must monotonically increase within the same release branch. | ||
|
||
### Compute config/settings manifest | ||
|
||
We will create a new sub-directory `compute` and file `compute/manifest.yaml` with a structure: | ||
|
||
```yaml | ||
pg_settings: | ||
# Common settings for primaries and secondaries of all versions. | ||
common: | ||
wal_log_hints: "off" | ||
max_wal_size: "1024" | ||
|
||
per_version: | ||
14: | ||
# Common settings for both replica and primary of version PG 14 | ||
common: | ||
shared_preload_libraries: "neon,pg_stat_statements,extension_x" | ||
15: | ||
common: | ||
shared_preload_libraries: "neon,pg_stat_statements,extension_x" | ||
# Settings that should be applied only to | ||
replica: | ||
# Available only starting Postgres 15th | ||
recovery_prefetch: "off" | ||
# ... | ||
17: | ||
common: | ||
# For example, if third-party `extension_x` is not yet available for PG 17 | ||
shared_preload_libraries: "neon,pg_stat_statements" | ||
replica: | ||
recovery_prefetch: "off" | ||
``` | ||
**N.B.** Setting value should be a string with `on|off` for booleans and a number (as a string) | ||
without units for all numeric settings. That's how the control plane currently operates. | ||
|
||
The priority of settings will be (a higher number is a higher priority): | ||
|
||
1. Any static and hard-coded settings in the control plane | ||
2. `pg_settings->common` | ||
3. Per-version `common` | ||
4. Per-version `replica` | ||
5. Any per-user/project/endpoint overrides in the control plane | ||
6. Any dynamic setting calculated based on the compute size | ||
|
||
**N.B.** For simplicity, we do not do any custom logic for `shared_preload_libraries`, so it's completely | ||
overridden if specified on some level. Make sure that you include all necessary extensions in it when you | ||
do any overrides. | ||
|
||
**N.B.** There is a tricky question about what to do with custom compute image pinning we sometimes | ||
do for particular projects and customers. That's usually some ad-hoc work and images are based on | ||
the latest compute image, so it's relatively safe to assume that we could use settings from the latest compute | ||
release. If for some reason that's not true, and further overrides are needed, it's also possible to do | ||
on the project level together with pinning the image, so it's on-call/engineer/support responsibility to | ||
ensure that compute starts with the specified custom image. The only real risk is that compute image will get | ||
stale and settings from new releases will drift away, so eventually it will get something incompatible, | ||
but i) this is some operational issue, as we do not want stale images anyway, and ii) base settings | ||
receive something really new so rarely that the chance of this happening is very low. If we want to solve it completely, | ||
then together with pinning the image we could also pin the matching release revision in the control plane. | ||
|
||
The compute team will own the content of `compute/manifest.yaml`. | ||
|
||
### Control plane: releases table | ||
|
||
In order to store information about releases, the control plane will use a table `compute_releases` with the following | ||
schema: | ||
|
||
```sql | ||
CREATE TABLE compute_releases ( | ||
-- Unique release ID | ||
-- N.B. Revision won't by synchronized across all regions, because all control planes are technically independent | ||
-- services. We have the same situation with Helm releases as well because they could be deployed and rolled back | ||
-- independently in different clusters. | ||
revision BIGSERIAL PRIMARY KEY, | ||
-- Numeric version of the compute image, e.g. 9057 | ||
version BIGINT NOT NULL, | ||
-- Compute image tag, e.g. `release-9057` | ||
tag TEXT NOT NULL, | ||
-- Current release status. Currently, it will be a simple enum | ||
-- * `deployed` -- release is deployed and used for new compute starts. | ||
-- Exactly one release can have this status at a time. | ||
-- * `superseded` -- release has been replaced by a newer one. | ||
-- But we can always extend it in the future when we need more statuses | ||
-- for more complex deployment strategies. | ||
status TEXT NOT NULL, | ||
-- Any additional metadata for compute in the corresponding release | ||
manifest JSONB NOT NULL, | ||
-- Timestamp when release record was created in the control plane database | ||
created_at TIMESTAMP NOT NULL DEFAULT now(), | ||
-- Timestamp when release deployment was finished | ||
deployed_at TIMESTAMP | ||
); | ||
``` | ||
|
||
We keep track of the old releases not only for the sake of audit, but also because we usually have ~30% of | ||
old computes started using the image from one of the previous releases. Yet, when users want to reconfigure | ||
them without restarting, the control plane needs to know what settings are applicable to them, so we also need | ||
information about the previous releases that are readily available. There could be some other auxiliary info | ||
needed as well: supported extensions, compute flags, etc. | ||
|
||
**N.B.** Here, we can end up in an ambiguous situation when the same compute image is deployed twice, e.g., | ||
it was deployed once, then rolled back, and then deployed again, potentially with a different manifest. Yet, | ||
we could've started some computes with the first deployment and some with the second. Thus, when we need to | ||
look up the manifest for the compute by its image tag, we will see two records in the table with the same tag, | ||
but different revision numbers. We can assume that this could happen only in case of rollbacks, so we | ||
can just take the latest revision for the given tag. | ||
|
||
### Control plane: management API | ||
|
||
The control plane will implement new API methods to manage releases: | ||
|
||
1. `POST /management/api/v2/compute_releases` to create a new release. With payload | ||
|
||
```json | ||
{ | ||
"version": 9057, | ||
"tag": "release-9057", | ||
"manifest": {} | ||
} | ||
``` | ||
|
||
and response | ||
|
||
```json | ||
{ | ||
"revision": 53, | ||
"version": 9057, | ||
"tag": "release-9057", | ||
"status": "deployed", | ||
"manifest": {}, | ||
"created_at": "2024-08-15T15:52:01.0000Z", | ||
"deployed_at": "2024-08-15T15:52:01.0000Z", | ||
} | ||
``` | ||
|
||
Here, we can actually mix-in custom (remote) extensions metadata into the `manifest`, so that the control plane | ||
will get information about all available extensions not bundled into compute image. The corresponding | ||
workflow in `neondatabase/build-custom-extensions` should produce it as an artifact and make | ||
it accessible to the workflow in the `neondatabase/infra`. See the complete release flow below. Doing that, | ||
we put a constraint that new custom extension requires new compute release, which is good for the safety, | ||
but is not exactly what we want operational-wise (we want to be able to deploy new extensions without new | ||
images). Yet, it can be solved incrementally: v0 -- do not do anything with extensions at all; | ||
v1 -- put them into the same manifest; v2 -- make them separate entities with their own lifecycle. | ||
|
||
**N.B.** This method is intended to be used in CI workflows, and CI/network can be flaky. It's reasonable | ||
to assume that we could retry the request several times, even though it's already succeeded. Although it's | ||
not a big deal to create several identical releases one-by-one, it's better to avoid it, so the control plane | ||
should check if the latest release is identical and just return `304 Not Modified` in this case. | ||
|
||
2. `POST /management/api/v2/compute_releases/rollback` to rollback to any previously deployed release. With payload | ||
including the revision of the release to rollback to: | ||
|
||
```json | ||
{ | ||
"revision": 52 | ||
} | ||
``` | ||
|
||
Rollback marks the current release as `superseded` and creates a new release with all the same data as the | ||
requested revision, but with a new revision number. | ||
|
||
This rollback API is not strictly needed, as we can just use `infra` repo workflow to deploy any | ||
available tag. It's still nice to have for on-call and any urgent matters, for example, if we need | ||
to rollback and GitHub is down. It's much easier to specify only the revision number vs. crafting | ||
all the necessary data for the new release payload. | ||
|
||
### Compute->storage compatibility tests | ||
|
||
In order to safely release new compute versions independently from storage, we need to ensure that the currently | ||
deployed storage is compatible with the new compute version. Currently, we maintain backward compatibility | ||
in storage, but newer computes may require a newer storage version. | ||
|
||
Remote end-to-end (e2e) tests [already accept](https://github.com/neondatabase/cloud/blob/e3468d433e0d73d02b7d7e738d027f509b522408/.github/workflows/testing.yml#L43-L48) | ||
`storage_image_tag` and `compute_image_tag` as separate inputs. That means that we could reuse e2e tests to ensure | ||
compatibility between storage and compute: | ||
|
||
1. Pick the latest storage release tag and use it as `storage_image_tag`. | ||
2. Pick a new compute tag built in the current compute release PR and use it as `compute_image_tag`. | ||
Here, we should use a temporary ECR image tag, because the final tag will be known only after the release PR is merged. | ||
3. Trigger e2e tests as usual. | ||
|
||
### Release flow | ||
|
||
```mermaid | ||
sequenceDiagram | ||
actor oncall as Compute on-call person | ||
participant neon as neondatabase/neon | ||
box private | ||
participant cloud as neondatabase/cloud | ||
participant exts as neondatabase/build-custom-extensions | ||
participant infra as neondatabase/infra | ||
end | ||
box cloud | ||
participant preprod as Pre-prod control plane | ||
participant prod as Production control plane | ||
participant k8s as Compute k8s | ||
end | ||
oncall ->> neon: Open release PR into release-compute | ||
activate neon | ||
neon ->> cloud: CI: trigger e2e compatibility tests | ||
activate cloud | ||
cloud -->> neon: CI: e2e tests pass | ||
deactivate cloud | ||
neon ->> neon: CI: pass PR checks, get approvals | ||
deactivate neon | ||
oncall ->> neon: Merge release PR into release-compute | ||
activate neon | ||
neon ->> neon: CI: pass checks, build and push images | ||
neon ->> exts: CI: trigger extensions build | ||
activate exts | ||
exts -->> neon: CI: extensions are ready | ||
deactivate exts | ||
neon ->> neon: CI: create release tag | ||
neon ->> infra: Trigger release workflow using the produced tag | ||
deactivate neon | ||
activate infra | ||
infra ->> infra: CI: pass checks | ||
infra ->> preprod: Release new compute image to pre-prod automatically <br/> POST /management/api/v2/compute_releases | ||
activate preprod | ||
preprod -->> infra: 200 OK | ||
deactivate preprod | ||
infra ->> infra: CI: wait for per-region production deploy approvals | ||
oncall ->> infra: CI: approve deploys region by region | ||
infra ->> k8s: Prewarm new compute image | ||
infra ->> prod: POST /management/api/v2/compute_releases | ||
activate prod | ||
prod -->> infra: 200 OK | ||
deactivate prod | ||
deactivate infra | ||
``` | ||
|
||
## Further work | ||
|
||
As briefly mentioned in other sections, eventually, we would like to use more complex deployment strategies. | ||
For example, we can pass a fraction of the total compute starts that should use the new release. Then we can | ||
mark the release as `partial` or `canary` and monitor its performance. If everything is fine, we can promote it | ||
to `deployed` status. If not, we can roll back to the previous one. | ||
|
||
## Alternatives | ||
|
||
In theory, we can try using Helm as-is: | ||
|
||
1. Write a compute Helm chart. That will actually have only some config map, which the control plane can access and read. | ||
N.B. We could reuse the control plane chart as well, but then it's not a fully independent release again and even more fuzzy. | ||
2. The control plane will read it and start using the new compute version for new starts. | ||
|
||
Drawbacks: | ||
|
||
1. Helm releases work best if the workload is controlled by the Helm chart itself. Then you can have different | ||
deployment strategies like rolling update or canary or blue/green deployments. At Neon, the compute starts are controlled | ||
by control plane, so it makes it much more tricky. | ||
2. Releases visibility will suffer, i.e. instead of a nice table in the control plane and Admin UI, we would need to use | ||
`helm` cli and/or K8s UIs like K8sLens. | ||
3. We do not restart all computes shortly after the new version release. This means that for some features and compatibility | ||
purpose (see above) control plane may need some auxiliary info from the previous releases. |