Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789

Open
solarisfire opened this issue Oct 5, 2024 · 25 comments

Comments

@solarisfire
Copy link

Hi all,

Firstly, this project looks amazing, and I love the idea of persistent storage distributed across a Kubernetes cluster. However I'm struggling to see why at a time when the popularity of ARM servers in the data-center is growing, not to mention within home labs built on ARM based SBCs, the project still supports only "x86-64 CPU cores".

I've spent a few days on a personal Talos OS 5-node ARM based cluster, trying to pick apart why it can't run on ARM, trying to push through road blocks as I hit them.

The first few should be pretty easy to fix, and I'm hoping they're on some sort of roadmap to fix already.

Setting --set nodeSelector={} during helm deployment allows pods to start on arm workers so they're not trying to limit themselves to x86 nodes.

The first errors I start seeing through stern are:

openebs-loki-0 volume-permissions exec /bin/bash: exec format error
openebs-etcd-1 volume-permissions exec /bin/bash: exec format error
openebs-etcd-2 volume-permissions exec /bin/bash: exec format error
openebs-etcd-3 volume-permissions exec /bin/bash: exec format error

So 4 pods that are trying to run x86 images on ARM workers, that's never going to work.

So I go and check what images are being used here:

image: docker.io/bitnami/bitnami-shell:10
image: grafana/loki:2.6.1
image: docker.io/bitnami/bitnami-shell:11-debian-11-r63
image: docker.io/bitnami/etcd:3.5.6-debian-11-r10

So, where to start...

bitnami-shell has been deprecated, it's last release 8 months ago, but this is version 10, which has no arm64 image. If the project moved to bitnami-shell:11 it has an arm64 image. Better yet the project should probably transition to bitnami/os-shell which has arm64 images and isn't scheduled to be removed in the future (bitnami-shell has a message saying "already released container images will persist in the registries until July 2024.", and it's well past that, so they could be removed at any time).

grafana/loki has arm64 images, however this project is using 2.6.1 which is 2 years old, I guess this is probably a case of "if it ain't broke don't fix it", however not moving this along could result in future issues.

bitnami-shell, 11 this time, but hard coded to use r63, which has no arm64 image. Moving to anything above r90 (or removing the hard coded revision), and an arm64 version of the image is available. This should probably be transitioned to os-shell too, for the reasons stated above.

bitnami/etcd isn't deprecated, however 3.5.6-debian-11-r10 was released December 2022, newer images have arm64 support and newer revisions have moved to debian-12 (3.5.16-debian-12-r2 being the latest).

Using such old images exposes users to some fairly serious security concerns.

These images are affected by numerous CVEs of varying severity, what can be done to uplift them to more secure versions.

This is as far as I've gotten, I'm now working through getting etcd to run correctly and the wife wants me to got out for lunch.

@solarisfire
Copy link
Author

Quick question.

I've noticed a lot of the pods are custom:

  • openebs-operator-diskpool
  • openebs-obs-callhome
  • openens-localpv-provisioner
  • openebs-csi-controller
  • openebs-api-rest
  • openebs-agent-core

Is the code not available for me to attempt to build arm64 versions of these images? I can't find it anywhere!

@tiagolobocastro
Copy link
Contributor

Hey @solarisfire, thanks for taking an interest in the project!

I'm happy to say the aforementioned bitnami-shell* deprecated images are no longer used by develop branches, and so that will be part of the next minor release.

As for the "proper" etcd and Loki images we certainly can consider upgrading to newest compatible images (ie images with no breaking changes).
I don't think we have update of etcd/loki planned CC @avishnu
For the moment, granted not ideal but you can override the image via helm values.

As for the ARM images, there's some previous history, the main sticky point was that we don't have the hardware to test ARM images, so this makes it a bit troublesome to build and even more to support any ARM specific issues which may arise (granted this would be more of an issue for the data-plane, rather than the control-plane).

Maybe we can start tacking this one problem at a time, and start on the building problem first.

@solarisfire
Copy link
Author

I spent the morning creating a build script that runs against Hetzner cloud. It currently builds arm64 variants of all of the pods and pushes them to https://hub.docker.com/repositories/solarisfire

Costs for the server used are €0.046 per hour, and the build took 26 minutes, so only cost me about €0.02.

https://gitlab.solarisfire.com/solarisfire/openebs-arm-image-builder/-/blob/main/build.py

Similar infrastructure could be used to conduct any testing needed, I just need to understand the testing methodology and how the infrastructure needs to be laid out.

@solarisfire
Copy link
Author

And the build really will use all the cores...

image

@solarisfire
Copy link
Author

So I now have a Talos OS cluster, with 3 control plane nodes, and 3 worker nodes. OpenEBS seems to be running without error:

image

I had to use a lot of overrides to the helm file:

image:
  repo: solarisfire

nodeSelector:
  kubernetes.io/arch: arm64

csi:
  image:
    registry: docker.io
    repo: solarisfire

mayastor:
  image:
    repo: solarisfire
    tag: develop
  loki-stack:
    enabled: false
  nodeSelector:
    kubernetes.io/arch: arm64
  etcd:
    image:
      tag: 3.5.16
    volumePermissions:
      image:
        repository: bitnami/os-shell
        tag: 11-debian-11-r112
    clusterDomain: cluster.solarisfire.com
  csi:
    node:
      initContainers:
        enabled: false

etcd:
  image:
    repository: bitnami/etcd
    tag: 3.5.16
  replicaCount: 1
  volumePermissions:
    image:
      repository: bitnami/os-shell

engines:
  local:
    lvm:
      enabled: false
    zfs:
      enabled: false
  replicated:
    mayastor:
      enabled: true

@solarisfire
Copy link
Author

Provisioned a test application, and these are the logs:

openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.843175       1 controller.go:1366] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": started
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.861393       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "stalwart/stalwart-mail-pvc"
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.872242       1 provisioner_hostpath.go:77] Creating volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d at node with labels {map[kubernetes.io/hostname:worker-1]}, path:/var/openebs/local/pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d,ImagePullSecrets:[]
+ init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init
- init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner 2024-10-07T12:42:41.933Z       INFO    app/provisioner_hostpath.go:215         {"eventcode": "local.pv.provision.success", "msg": "Successfully provisioned Local PV", "rname": "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d", "storagetype": "hostpath"}
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933346       1 controller.go:1449] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": volume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" provisioned
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933380       1 controller.go:1462] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": succeeded
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933620       1 volume_store.go:212] Trying to save persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d"
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941127       1 volume_store.go:219] persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" saved
openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941924       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d

@solarisfire
Copy link
Author

Interestingly, io-engine pods aren't being started on any of the nodes:

openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.531457Z ERROR operator_diskpool::context: Unable to find io-engine node worker-1
openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.536118Z ERROR operator_diskpool::context: Unable to find io-engine node worker-3
openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.554351Z ERROR operator_diskpool::context: Unable to find io-engine node worker-2

@tiagolobocastro
Copy link
Contributor

I spent the morning creating a build script that runs against Hetzner cloud. It currently builds arm64 variants of all of the pods and pushes them to https://hub.docker.com/repositories/solarisfire

That's awesome @solarisfire!
Btw think you might be missing the openebs/mayastor image build there?

Similar infrastructure could be used to conduct any testing needed, I just need to understand the testing methodology and how the infrastructure needs to be laid out.

For the testing we have a per-repo CI which ensures the code in the repo is correct. We use jenkins and bors for this.

Then we have a more complete e2e suite which was originally in internal repo, but which we've started to open up:
https://github.com/openebs/openebs-e2e
We haven't yet merged the mayastor tests in there, but it's part of the plans, so everything will be available, and anyone should be able to run and contribute to the extended test suite.

Btw, would it be possible for you to test the init-images from the develop chart of mayastor, to ensure they are working for arm?
for loki:
docker.io/openebs/alpine-sh:4.1.0
and for etcd:
repository: openebs/alpine-bash
tag: 4.1.0

@tiagolobocastro
Copy link
Contributor

Interestingly, io-engine pods aren't being started on any of the nodes:

openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.531457Z ERROR operator_diskpool::context: Unable to find io-engine node worker-1 openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.536118Z ERROR operator_diskpool::context: Unable to find io-engine node worker-3 openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.554351Z ERROR operator_diskpool::context: Unable to find io-engine node worker-2

Yeah I think you need to add openebs/mayastor to your list of repos.
For the dataplane, you need to prepare the nodes:
https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/rs-installation#preparing-the-cluster

@solarisfire
Copy link
Author

Easier said than done as I'm running Talos OS and cannot get to a console...

I've already run into a few issues with their openebs documentation:

https://www.talos.dev/v1.8/kubernetes-guides/configuration/replicated-local-storage-with-openebs/

Has this config patch:

machine:
  sysctls:
    vm.nr_hugepages: "1024"
  nodeLabels:
    openebs.io/engine: mayastor
  kubelet:
    extraMounts:
      - destination: /var/local/openebs
        type: bind
        source: /var/local/openebs
        options:
          - rbind
          - rshared
          - rw

I've found that directory should be /var/openebs/local

It does force the openebs.io/engine: mayastor label onto all the nodes though, which is handy.

It states:

The storage class named openebs-hostpath is used to create storage that is replicated across all of your nodes.

However the openebs-hostpath class is not replicated across all of the nodes as far as I'm aware.

@solarisfire
Copy link
Author

Hmmmm, this mayastor-io-engine pod is going to be trouble...

build_log.txt

@solarisfire
Copy link
Author

If I download and build spdk manually it builds fine...

I think something about the nix environment is upsetting it...

@solarisfire
Copy link
Author

Turns out the image will build fine with the code from the release/v2.7.0 branch, however it fails to build on ARM from either the develop or staging branches (with develop being the default). It does however build fine on amd64 architecture.

I've raised it as a bug in openebs/mayastor#1751 - However I believe the issue may lie upstream in spdk-rs (Although the openebs run their own fork of spdk-rs??)

I had to resize my worker nodes to give them more RAM, but I do now have 3 healthy mayastor-io-engine pods on my arm cluster.

image

@BastienM
Copy link

Fell on this issue while investigating the hard-coded kubernetes/arch=amd64 selector that prevented pods from being scheduled.

More or less the same hardware topology on my side, arm64 SBCs (Turing RK1) sporting Talos (v1.8.2).

Looking forward to having an arm64 compatible release, let me know if I can help.

@solarisfire
Copy link
Author

I have it working with the following overrides:

image:
  repo: solarisfire

nodeSelector:
  kubernetes.io/arch: arm64

csi:
  image:
    registry: docker.io
    repo: solarisfire

mayastor:
  image:
    repo: solarisfire
    tag: develop
  loki-stack:
    enabled: false
  nodeSelector:
    kubernetes.io/arch: arm64
  etcd:
    image:
      tag: 3.5.16
    volumePermissions:
      image:
        repository: bitnami/os-shell
        tag: 11-debian-11-r112
    clusterDomain: cluster.solarisfire.com
  csi:
    node:
      initContainers:
        enabled: false
  io_engine:
    nodeSelector: arm64

etcd:
  image:
    repository: bitnami/etcd
    tag: 3.5.16
  replicaCount: 1
  volumePermissions:
    image:
      repository: bitnami/os-shell

engines:
  local:
    lvm:
      enabled: false
    zfs:
      enabled: false
  replicated:
    mayastor:
      enabled: true

Which gets installed with:

helm upgrade --install openebs --create-namespace --namespace openebs -f openebs-helm-override.yaml openebs/openebs

I haven't done an image build in a while though.

The biggest issue by far is spdk-rs inside mayastor engine, which broke compatibility with arm a while ago and nobody who's knowledgeable with that part of the project seems to want to take responsibility for it.

I have an old version successfully built and inside my docker hub, and it works, but people's mileage may vary.

@BastienM
Copy link

BastienM commented Nov 11, 2024

Thanks for the heads-up.

Although I would rather want to know first if the team is ok with having the architecture officially supported (providing arm-compatible images and such).

Hacking the deployment, in itself, is alright for testing purposes. But I don't see myself using OpenEBS in the long term if the arch is not being officially supported.
The issue with spdk-rs is good example of what I am speaking about, something could break at any time in the tool chain and prevent maintenance in time.

If the team is alright with that, I am more than willing to help with attaining that goal.

@tiagolobocastro
Copy link
Contributor

But I don't see myself using OpenEBS in the long term if the arch is not being officially supported.

We would like to support arm, but would be difficult to manage this with the existing team.

If the team is alright with that, I am more than willing to help with attaining that goal.

Yes we'd be very happy for someone to help us there.
And in a general note, if some of you are willing to be the maintainers for arm builds, then there's a chance for us to officially support arm builds :)

@BastienM
Copy link

@tiagolobocastro I am getting started up (joining the slack channel, forking/fetching repos, discovering the codebase).

(I wanted to get my hands on Nix since a long time, so two birds, one stone)

@maxwnewcomer
Copy link

Also interested in this feature. Please let me know where I could help out!

@BastienM
Copy link

BastienM commented Dec 5, 2024

Personal issues are preventing me from really getting started on that topic.

I only had time to get through the maze of finding how images were built for mayastor, which is handled via openebs/mayastor-dependencies/scripts/release.sh and more specifically the build_images function.

The next thing in my head was getting to know more about the build chain and especially the builder (which seems to managed by a Jenkins instance, but I couldn't find any information at the time). And then discuss/decide about how to approach the multi-arch support (e.g. all-in-one nix config or one per arch).

On my side, I will probably not be able to do more concrete work until January. But will remain available to chat/discuss on the topic.

@tiagolobocastro
Copy link
Contributor

Sorry about the maze @BastienM, I'll add docs for building the entire set of images across all repos.
And sure there's no pressure, thanks for helping out! You can reach us on slack as well if you're stuck.

@maxwnewcomer, from what I've been told from ARM users, the arm build is no longer working.
We can break the problem down to:

  1. fix the arm build
  2. build and push ARM images
  3. CI tests on ARM

@tiagolobocastro
Copy link
Contributor

I've raised PR with the instructions for building all images: openebs/mayastor#1779
You can view the file here

@BastienM
Copy link

BastienM commented Dec 6, 2024

Thank you for the extra doc @tiagolobocastro, it makes the overall building workflow easier to grasp.

I did join the Slack channel (#openebs-dev ?) already, will probably pour my questions there when getting back on my feet.

@maxwnewcomer
Copy link

Sorry for the late response, just came back from some family time. Wondering if this would fall under the OEP umbrella (something along the lines of "Verify/Add OpenEBS arm64 Support"). I'm new to the community and just listened to @tiagolobocastro on the monthly update where y'all mentioned you wanted to force OEPs on new features. Although, I don't fully know the history of the project and this may be a bug/regression. Interested in your thoughts.

@tiagolobocastro
Copy link
Contributor

An OEP would be nice and may be a good place to discuss. We can also propose some milestones there, ex we can start by building control-plane images first, then data-plane images, and then tackle the testing of arm images (mainly focused on the data-plane)
I think the local engines (or some at least) already have multi-arch images as they only have control-plane components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants