A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

solarisfire · 2024-10-05T12:42:08Z

Hi all,

Firstly, this project looks amazing, and I love the idea of persistent storage distributed across a Kubernetes cluster. However I'm struggling to see why at a time when the popularity of ARM servers in the data-center is growing, not to mention within home labs built on ARM based SBCs, the project still supports only "x86-64 CPU cores".

I've spent a few days on a personal Talos OS 5-node ARM based cluster, trying to pick apart why it can't run on ARM, trying to push through road blocks as I hit them.

The first few should be pretty easy to fix, and I'm hoping they're on some sort of roadmap to fix already.

Setting --set nodeSelector={} during helm deployment allows pods to start on arm workers so they're not trying to limit themselves to x86 nodes.

The first errors I start seeing through stern are:

openebs-loki-0 volume-permissions exec /bin/bash: exec format error
openebs-etcd-1 volume-permissions exec /bin/bash: exec format error
openebs-etcd-2 volume-permissions exec /bin/bash: exec format error
openebs-etcd-3 volume-permissions exec /bin/bash: exec format error

So 4 pods that are trying to run x86 images on ARM workers, that's never going to work.

So I go and check what images are being used here:

image: docker.io/bitnami/bitnami-shell:10
image: grafana/loki:2.6.1
image: docker.io/bitnami/bitnami-shell:11-debian-11-r63
image: docker.io/bitnami/etcd:3.5.6-debian-11-r10

So, where to start...

bitnami-shell has been deprecated, it's last release 8 months ago, but this is version 10, which has no arm64 image. If the project moved to bitnami-shell:11 it has an arm64 image. Better yet the project should probably transition to bitnami/os-shell which has arm64 images and isn't scheduled to be removed in the future (bitnami-shell has a message saying "already released container images will persist in the registries until July 2024.", and it's well past that, so they could be removed at any time).

grafana/loki has arm64 images, however this project is using 2.6.1 which is 2 years old, I guess this is probably a case of "if it ain't broke don't fix it", however not moving this along could result in future issues.

bitnami-shell, 11 this time, but hard coded to use r63, which has no arm64 image. Moving to anything above r90 (or removing the hard coded revision), and an arm64 version of the image is available. This should probably be transitioned to os-shell too, for the reasons stated above.

bitnami/etcd isn't deprecated, however 3.5.6-debian-11-r10 was released December 2022, newer images have arm64 support and newer revisions have moved to debian-12 (3.5.16-debian-12-r2 being the latest).

Using such old images exposes users to some fairly serious security concerns.

These images are affected by numerous CVEs of varying severity, what can be done to uplift them to more secure versions.

This is as far as I've gotten, I'm now working through getting etcd to run correctly and the wife wants me to got out for lunch.

solarisfire · 2024-10-05T22:40:49Z

Quick question.

I've noticed a lot of the pods are custom:

openebs-operator-diskpool
openebs-obs-callhome
openens-localpv-provisioner
openebs-csi-controller
openebs-api-rest
openebs-agent-core

Is the code not available for me to attempt to build arm64 versions of these images? I can't find it anywhere!

tiagolobocastro · 2024-10-06T23:04:12Z

Hey @solarisfire, thanks for taking an interest in the project!

I'm happy to say the aforementioned bitnami-shell* deprecated images are no longer used by develop branches, and so that will be part of the next minor release.

As for the "proper" etcd and Loki images we certainly can consider upgrading to newest compatible images (ie images with no breaking changes).
I don't think we have update of etcd/loki planned CC @avishnu
For the moment, granted not ideal but you can override the image via helm values.

As for the ARM images, there's some previous history, the main sticky point was that we don't have the hardware to test ARM images, so this makes it a bit troublesome to build and even more to support any ARM specific issues which may arise (granted this would be more of an issue for the data-plane, rather than the control-plane).

Maybe we can start tacking this one problem at a time, and start on the building problem first.

solarisfire · 2024-10-07T10:52:19Z

I spent the morning creating a build script that runs against Hetzner cloud. It currently builds arm64 variants of all of the pods and pushes them to https://hub.docker.com/repositories/solarisfire

Costs for the server used are €0.046 per hour, and the build took 26 minutes, so only cost me about €0.02.

https://gitlab.solarisfire.com/solarisfire/openebs-arm-image-builder/-/blob/main/build.py

Similar infrastructure could be used to conduct any testing needed, I just need to understand the testing methodology and how the infrastructure needs to be laid out.

solarisfire · 2024-10-07T10:53:02Z

And the build really will use all the cores...

solarisfire · 2024-10-07T12:36:52Z

So I now have a Talos OS cluster, with 3 control plane nodes, and 3 worker nodes. OpenEBS seems to be running without error:

I had to use a lot of overrides to the helm file:

image:
  repo: solarisfire

nodeSelector:
  kubernetes.io/arch: arm64

csi:
  image:
    registry: docker.io
    repo: solarisfire

mayastor:
  image:
    repo: solarisfire
    tag: develop
  loki-stack:
    enabled: false
  nodeSelector:
    kubernetes.io/arch: arm64
  etcd:
    image:
      tag: 3.5.16
    volumePermissions:
      image:
        repository: bitnami/os-shell
        tag: 11-debian-11-r112
    clusterDomain: cluster.solarisfire.com
  csi:
    node:
      initContainers:
        enabled: false

etcd:
  image:
    repository: bitnami/etcd
    tag: 3.5.16
  replicaCount: 1
  volumePermissions:
    image:
      repository: bitnami/os-shell

engines:
  local:
    lvm:
      enabled: false
    zfs:
      enabled: false
  replicated:
    mayastor:
      enabled: true

solarisfire · 2024-10-07T12:45:53Z

Provisioned a test application, and these are the logs:

openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.843175       1 controller.go:1366] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": started
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.861393       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "stalwart/stalwart-mail-pvc"
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.872242       1 provisioner_hostpath.go:77] Creating volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d at node with labels {map[kubernetes.io/hostname:worker-1]}, path:/var/openebs/local/pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d,ImagePullSecrets:[]
+ init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init
- init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner 2024-10-07T12:42:41.933Z       INFO    app/provisioner_hostpath.go:215         {"eventcode": "local.pv.provision.success", "msg": "Successfully provisioned Local PV", "rname": "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d", "storagetype": "hostpath"}
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933346       1 controller.go:1449] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": volume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" provisioned
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933380       1 controller.go:1462] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": succeeded
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933620       1 volume_store.go:212] Trying to save persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d"
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941127       1 volume_store.go:219] persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" saved
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941924       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d

solarisfire · 2024-10-07T14:13:13Z

Interestingly, io-engine pods aren't being started on any of the nodes:

openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.531457Z ERROR operator_diskpool::context: Unable to find io-engine node worker-1
openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.536118Z ERROR operator_diskpool::context: Unable to find io-engine node worker-3
openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.554351Z ERROR operator_diskpool::context: Unable to find io-engine node worker-2

tiagolobocastro · 2024-10-07T14:16:09Z

I spent the morning creating a build script that runs against Hetzner cloud. It currently builds arm64 variants of all of the pods and pushes them to https://hub.docker.com/repositories/solarisfire

That's awesome @solarisfire!
Btw think you might be missing the openebs/mayastor image build there?

Similar infrastructure could be used to conduct any testing needed, I just need to understand the testing methodology and how the infrastructure needs to be laid out.

For the testing we have a per-repo CI which ensures the code in the repo is correct. We use jenkins and bors for this.

Then we have a more complete e2e suite which was originally in internal repo, but which we've started to open up:
https://github.com/openebs/openebs-e2e
We haven't yet merged the mayastor tests in there, but it's part of the plans, so everything will be available, and anyone should be able to run and contribute to the extended test suite.

Btw, would it be possible for you to test the init-images from the develop chart of mayastor, to ensure they are working for arm?
for loki:
docker.io/openebs/alpine-sh:4.1.0
and for etcd:
repository: openebs/alpine-bash
tag: 4.1.0

tiagolobocastro · 2024-10-07T14:17:43Z

Interestingly, io-engine pods aren't being started on any of the nodes:

openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.531457Z ERROR operator_diskpool::context: Unable to find io-engine node worker-1 openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.536118Z ERROR operator_diskpool::context: Unable to find io-engine node worker-3 openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.554351Z ERROR operator_diskpool::context: Unable to find io-engine node worker-2

Yeah I think you need to add openebs/mayastor to your list of repos.
For the dataplane, you need to prepare the nodes:
https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-installation#preparing-the-cluster

solarisfire · 2024-10-07T14:32:53Z

Easier said than done as I'm running Talos OS and cannot get to a console...

I've already run into a few issues with their openebs documentation:

https://www.talos.dev/v1.8/kubernetes-guides/configuration/replicated-local-storage-with-openebs/

Has this config patch:

machine:
  sysctls:
    vm.nr_hugepages: "1024"
  nodeLabels:
    openebs.io/engine: mayastor
  kubelet:
    extraMounts:
      - destination: /var/local/openebs
        type: bind
        source: /var/local/openebs
        options:
          - rbind
          - rshared
          - rw

I've found that directory should be /var/openebs/local

It does force the openebs.io/engine: mayastor label onto all the nodes though, which is handy.

It states:

The storage class named openebs-hostpath is used to create storage that is replicated across all of your nodes.

However the openebs-hostpath class is not replicated across all of the nodes as far as I'm aware.

solarisfire · 2024-10-07T15:49:03Z

Hmmmm, this mayastor-io-engine pod is going to be trouble...

build_log.txt

solarisfire · 2024-10-07T16:46:24Z

If I download and build spdk manually it builds fine...

I think something about the nix environment is upsetting it...

solarisfire · 2024-10-08T09:41:36Z

Turns out the image will build fine with the code from the release/v2.7.0 branch, however it fails to build on ARM from either the develop or staging branches (with develop being the default). It does however build fine on amd64 architecture.

I've raised it as a bug in openebs/mayastor#1751 - However I believe the issue may lie upstream in spdk-rs (Although the openebs run their own fork of spdk-rs??)

I had to resize my worker nodes to give them more RAM, but I do now have 3 healthy mayastor-io-engine pods on my arm cluster.

BastienM · 2024-11-10T14:44:07Z

Fell on this issue while investigating the hard-coded kubernetes/arch=amd64 selector that prevented pods from being scheduled.

More or less the same hardware topology on my side, arm64 SBCs (Turing RK1) sporting Talos (v1.8.2).

Looking forward to having an arm64 compatible release, let me know if I can help.

solarisfire · 2024-11-10T18:57:41Z

I have it working with the following overrides:

image:
  repo: solarisfire

nodeSelector:
  kubernetes.io/arch: arm64

csi:
  image:
    registry: docker.io
    repo: solarisfire

mayastor:
  image:
    repo: solarisfire
    tag: develop
  loki-stack:
    enabled: false
  nodeSelector:
    kubernetes.io/arch: arm64
  etcd:
    image:
      tag: 3.5.16
    volumePermissions:
      image:
        repository: bitnami/os-shell
        tag: 11-debian-11-r112
    clusterDomain: cluster.solarisfire.com
  csi:
    node:
      initContainers:
        enabled: false
  io_engine:
    nodeSelector: arm64

etcd:
  image:
    repository: bitnami/etcd
    tag: 3.5.16
  replicaCount: 1
  volumePermissions:
    image:
      repository: bitnami/os-shell

engines:
  local:
    lvm:
      enabled: false
    zfs:
      enabled: false
  replicated:
    mayastor:
      enabled: true

Which gets installed with:

helm upgrade --install openebs --create-namespace --namespace openebs -f openebs-helm-override.yaml openebs/openebs

I haven't done an image build in a while though.

The biggest issue by far is spdk-rs inside mayastor engine, which broke compatibility with arm a while ago and nobody who's knowledgeable with that part of the project seems to want to take responsibility for it.

I have an old version successfully built and inside my docker hub, and it works, but people's mileage may vary.

BastienM · 2024-11-11T09:59:25Z

Thanks for the heads-up.

Although I would rather want to know first if the team is ok with having the architecture officially supported (providing arm-compatible images and such).

Hacking the deployment, in itself, is alright for testing purposes. But I don't see myself using OpenEBS in the long term if the arch is not being officially supported.
The issue with spdk-rs is good example of what I am speaking about, something could break at any time in the tool chain and prevent maintenance in time.

If the team is alright with that, I am more than willing to help with attaining that goal.

tiagolobocastro · 2024-11-12T14:40:44Z

But I don't see myself using OpenEBS in the long term if the arch is not being officially supported.

We would like to support arm, but would be difficult to manage this with the existing team.

If the team is alright with that, I am more than willing to help with attaining that goal.

Yes we'd be very happy for someone to help us there.
And in a general note, if some of you are willing to be the maintainers for arm builds, then there's a chance for us to officially support arm builds :)

BastienM · 2024-11-16T11:08:59Z

@tiagolobocastro I am getting started up (joining the slack channel, forking/fetching repos, discovering the codebase).

(I wanted to get my hands on Nix since a long time, so two birds, one stone)

maxwnewcomer · 2024-12-03T01:08:19Z

Also interested in this feature. Please let me know where I could help out!

BastienM · 2024-12-05T21:13:09Z

Personal issues are preventing me from really getting started on that topic.

I only had time to get through the maze of finding how images were built for mayastor, which is handled via openebs/mayastor-dependencies/scripts/release.sh and more specifically the build_images function.

The next thing in my head was getting to know more about the build chain and especially the builder (which seems to managed by a Jenkins instance, but I couldn't find any information at the time). And then discuss/decide about how to approach the multi-arch support (e.g. all-in-one nix config or one per arch).

On my side, I will probably not be able to do more concrete work until January. But will remain available to chat/discuss on the topic.

tiagolobocastro · 2024-12-06T10:58:34Z

Sorry about the maze @BastienM, I'll add docs for building the entire set of images across all repos.
And sure there's no pressure, thanks for helping out! You can reach us on slack as well if you're stuck.

@maxwnewcomer, from what I've been told from ARM users, the arm build is no longer working.
We can break the problem down to:

fix the arm build
build and push ARM images
CI tests on ARM

tiagolobocastro · 2024-12-06T16:22:45Z

I've raised PR with the instructions for building all images: openebs/mayastor#1779
You can view the file here

BastienM · 2024-12-06T20:36:01Z

Thank you for the extra doc @tiagolobocastro, it makes the overall building workflow easier to grasp.

I did join the Slack channel (#openebs-dev ?) already, will probably pour my questions there when getting back on my feet.

maxwnewcomer · 2024-12-13T14:04:36Z

Sorry for the late response, just came back from some family time. Wondering if this would fall under the OEP umbrella (something along the lines of "Verify/Add OpenEBS arm64 Support"). I'm new to the community and just listened to @tiagolobocastro on the monthly update where y'all mentioned you wanted to force OEPs on new features. Although, I don't fully know the history of the project and this may be a bug/regression. Interested in your thoughts.

tiagolobocastro · 2024-12-13T14:33:46Z

An OEP would be nice and may be a good place to discuss. We can also propose some milestones there, ex we can start by building control-plane images first, then data-plane images, and then tackle the testing of arm images (mainly focused on the data-plane)
I think the local engines (or some at least) already have multi-arch images as they only have control-plane components.

maxwnewcomer mentioned this issue Dec 13, 2024

[ OEP 3817 ] - Support arm64 Deployments #3817

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

solarisfire commented Oct 5, 2024

solarisfire commented Oct 5, 2024

tiagolobocastro commented Oct 6, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

tiagolobocastro commented Oct 7, 2024

tiagolobocastro commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 8, 2024

BastienM commented Nov 10, 2024

solarisfire commented Nov 10, 2024

BastienM commented Nov 11, 2024 •

edited

Loading

tiagolobocastro commented Nov 12, 2024

BastienM commented Nov 16, 2024

maxwnewcomer commented Dec 3, 2024

BastienM commented Dec 5, 2024 •

edited

Loading

tiagolobocastro commented Dec 6, 2024

tiagolobocastro commented Dec 6, 2024

BastienM commented Dec 6, 2024

maxwnewcomer commented Dec 13, 2024

tiagolobocastro commented Dec 13, 2024

A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

Comments

solarisfire commented Oct 5, 2024

solarisfire commented Oct 5, 2024

tiagolobocastro commented Oct 6, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

tiagolobocastro commented Oct 7, 2024

tiagolobocastro commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 7, 2024

solarisfire commented Oct 8, 2024

BastienM commented Nov 10, 2024

solarisfire commented Nov 10, 2024

BastienM commented Nov 11, 2024 • edited Loading

tiagolobocastro commented Nov 12, 2024

BastienM commented Nov 16, 2024

maxwnewcomer commented Dec 3, 2024

BastienM commented Dec 5, 2024 • edited Loading

tiagolobocastro commented Dec 6, 2024

tiagolobocastro commented Dec 6, 2024

BastienM commented Dec 6, 2024

maxwnewcomer commented Dec 13, 2024

tiagolobocastro commented Dec 13, 2024

BastienM commented Nov 11, 2024 •

edited

Loading

BastienM commented Dec 5, 2024 •

edited

Loading