-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new push to make openebs compatible with ARM clusters (And security concerns around images) #3789
Comments
Quick question. I've noticed a lot of the pods are custom:
Is the code not available for me to attempt to build arm64 versions of these images? I can't find it anywhere! |
Hey @solarisfire, thanks for taking an interest in the project! I'm happy to say the aforementioned bitnami-shell* deprecated images are no longer used by develop branches, and so that will be part of the next minor release. As for the "proper" etcd and Loki images we certainly can consider upgrading to newest compatible images (ie images with no breaking changes). As for the ARM images, there's some previous history, the main sticky point was that we don't have the hardware to test ARM images, so this makes it a bit troublesome to build and even more to support any ARM specific issues which may arise (granted this would be more of an issue for the data-plane, rather than the control-plane). Maybe we can start tacking this one problem at a time, and start on the building problem first. |
I spent the morning creating a build script that runs against Hetzner cloud. It currently builds arm64 variants of all of the pods and pushes them to https://hub.docker.com/repositories/solarisfire Costs for the server used are €0.046 per hour, and the build took 26 minutes, so only cost me about €0.02. https://gitlab.solarisfire.com/solarisfire/openebs-arm-image-builder/-/blob/main/build.py Similar infrastructure could be used to conduct any testing needed, I just need to understand the testing methodology and how the infrastructure needs to be laid out. |
So I now have a Talos OS cluster, with 3 control plane nodes, and 3 worker nodes. OpenEBS seems to be running without error: I had to use a lot of overrides to the helm file: image: repo: solarisfire nodeSelector: kubernetes.io/arch: arm64 csi: image: registry: docker.io repo: solarisfire mayastor: image: repo: solarisfire tag: develop loki-stack: enabled: false nodeSelector: kubernetes.io/arch: arm64 etcd: image: tag: 3.5.16 volumePermissions: image: repository: bitnami/os-shell tag: 11-debian-11-r112 clusterDomain: cluster.solarisfire.com csi: node: initContainers: enabled: false etcd: image: repository: bitnami/etcd tag: 3.5.16 replicaCount: 1 volumePermissions: image: repository: bitnami/os-shell engines: local: lvm: enabled: false zfs: enabled: false replicated: mayastor: enabled: true |
Provisioned a test application, and these are the logs: openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.843175 1 controller.go:1366] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": started openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.861393 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "stalwart/stalwart-mail-pvc" openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:38.872242 1 provisioner_hostpath.go:77] Creating volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d at node with labels {map[kubernetes.io/hostname:worker-1]}, path:/var/openebs/local/pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d,ImagePullSecrets:[] + init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init - init-pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d › local-path-init openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner 2024-10-07T12:42:41.933Z INFO app/provisioner_hostpath.go:215 {"eventcode": "local.pv.provision.success", "msg": "Successfully provisioned Local PV", "rname": "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d", "storagetype": "hostpath"} openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933346 1 controller.go:1449] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": volume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" provisioned openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933380 1 controller.go:1462] provision "stalwart/stalwart-mail-pvc" class "openebs-hostpath": succeeded openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.933620 1 volume_store.go:212] Trying to save persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941127 1 volume_store.go:219] persistentvolume "pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d" saved openebs-localpv-provisioner-65db9df479-g2bwc openebs-localpv-provisioner I1007 12:42:41.941924 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"stalwart", Name:"stalwart-mail-pvc", UID:"bbec9410-09f1-4762-9533-2fa47c2b3c7d", APIVersion:"v1", ResourceVersion:"250642", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-bbec9410-09f1-4762-9533-2fa47c2b3c7d |
Interestingly, io-engine pods aren't being started on any of the nodes: openebs-operator-diskpool-84d5cc6c96-2kthl operator-diskpool 2024-10-07T14:05:02.531457Z ERROR operator_diskpool::context: Unable to find io-engine node worker-1 |
That's awesome @solarisfire!
For the testing we have a per-repo CI which ensures the code in the repo is correct. We use jenkins and bors for this. Then we have a more complete e2e suite which was originally in internal repo, but which we've started to open up: Btw, would it be possible for you to test the init-images from the develop chart of mayastor, to ensure they are working for arm? |
Yeah I think you need to add openebs/mayastor to your list of repos. |
Easier said than done as I'm running Talos OS and cannot get to a console... I've already run into a few issues with their openebs documentation: https://www.talos.dev/v1.8/kubernetes-guides/configuration/replicated-local-storage-with-openebs/ Has this config patch: machine: sysctls: vm.nr_hugepages: "1024" nodeLabels: openebs.io/engine: mayastor kubelet: extraMounts: - destination: /var/local/openebs type: bind source: /var/local/openebs options: - rbind - rshared - rw I've found that directory should be /var/openebs/local It does force the openebs.io/engine: mayastor label onto all the nodes though, which is handy. It states:
However the openebs-hostpath class is not replicated across all of the nodes as far as I'm aware. |
Hmmmm, this mayastor-io-engine pod is going to be trouble... |
If I download and build spdk manually it builds fine... I think something about the nix environment is upsetting it... |
Turns out the image will build fine with the code from the release/v2.7.0 branch, however it fails to build on ARM from either the develop or staging branches (with develop being the default). It does however build fine on amd64 architecture. I've raised it as a bug in openebs/mayastor#1751 - However I believe the issue may lie upstream in spdk-rs (Although the openebs run their own fork of spdk-rs??) I had to resize my worker nodes to give them more RAM, but I do now have 3 healthy mayastor-io-engine pods on my arm cluster. |
Fell on this issue while investigating the hard-coded More or less the same hardware topology on my side, arm64 SBCs (Turing RK1) sporting Talos ( Looking forward to having an arm64 compatible release, let me know if I can help. |
I have it working with the following overrides: image: repo: solarisfire nodeSelector: kubernetes.io/arch: arm64 csi: image: registry: docker.io repo: solarisfire mayastor: image: repo: solarisfire tag: develop loki-stack: enabled: false nodeSelector: kubernetes.io/arch: arm64 etcd: image: tag: 3.5.16 volumePermissions: image: repository: bitnami/os-shell tag: 11-debian-11-r112 clusterDomain: cluster.solarisfire.com csi: node: initContainers: enabled: false io_engine: nodeSelector: arm64 etcd: image: repository: bitnami/etcd tag: 3.5.16 replicaCount: 1 volumePermissions: image: repository: bitnami/os-shell engines: local: lvm: enabled: false zfs: enabled: false replicated: mayastor: enabled: true Which gets installed with: helm upgrade --install openebs --create-namespace --namespace openebs -f openebs-helm-override.yaml openebs/openebs I haven't done an image build in a while though. The biggest issue by far is spdk-rs inside mayastor engine, which broke compatibility with arm a while ago and nobody who's knowledgeable with that part of the project seems to want to take responsibility for it. I have an old version successfully built and inside my docker hub, and it works, but people's mileage may vary. |
Thanks for the heads-up. Although I would rather want to know first if the team is ok with having the architecture officially supported (providing arm-compatible images and such). Hacking the deployment, in itself, is alright for testing purposes. But I don't see myself using OpenEBS in the long term if the arch is not being officially supported. If the team is alright with that, I am more than willing to help with attaining that goal. |
We would like to support arm, but would be difficult to manage this with the existing team.
Yes we'd be very happy for someone to help us there. |
@tiagolobocastro I am getting started up (joining the slack channel, forking/fetching repos, discovering the codebase). (I wanted to get my hands on Nix since a long time, so two birds, one stone) |
Also interested in this feature. Please let me know where I could help out! |
Personal issues are preventing me from really getting started on that topic. I only had time to get through the maze of finding how images were built for The next thing in my head was getting to know more about the build chain and especially the builder (which seems to managed by a Jenkins instance, but I couldn't find any information at the time). And then discuss/decide about how to approach the multi-arch support (e.g. all-in-one nix config or one per arch). On my side, I will probably not be able to do more concrete work until January. But will remain available to chat/discuss on the topic. |
Sorry about the maze @BastienM, I'll add docs for building the entire set of images across all repos. @maxwnewcomer, from what I've been told from ARM users, the arm build is no longer working.
|
I've raised PR with the instructions for building all images: openebs/mayastor#1779 |
Thank you for the extra doc @tiagolobocastro, it makes the overall building workflow easier to grasp. I did join the Slack channel ( |
Sorry for the late response, just came back from some family time. Wondering if this would fall under the OEP umbrella (something along the lines of "Verify/Add OpenEBS arm64 Support"). I'm new to the community and just listened to @tiagolobocastro on the monthly update where y'all mentioned you wanted to force OEPs on new features. Although, I don't fully know the history of the project and this may be a bug/regression. Interested in your thoughts. |
An OEP would be nice and may be a good place to discuss. We can also propose some milestones there, ex we can start by building control-plane images first, then data-plane images, and then tackle the testing of arm images (mainly focused on the data-plane) |
Hi all,
Firstly, this project looks amazing, and I love the idea of persistent storage distributed across a Kubernetes cluster. However I'm struggling to see why at a time when the popularity of ARM servers in the data-center is growing, not to mention within home labs built on ARM based SBCs, the project still supports only "x86-64 CPU cores".
I've spent a few days on a personal Talos OS 5-node ARM based cluster, trying to pick apart why it can't run on ARM, trying to push through road blocks as I hit them.
The first few should be pretty easy to fix, and I'm hoping they're on some sort of roadmap to fix already.
Setting --set nodeSelector={} during helm deployment allows pods to start on arm workers so they're not trying to limit themselves to x86 nodes.
The first errors I start seeing through stern are:
So 4 pods that are trying to run x86 images on ARM workers, that's never going to work.
So I go and check what images are being used here:
image: docker.io/bitnami/bitnami-shell:10
image: grafana/loki:2.6.1
image: docker.io/bitnami/bitnami-shell:11-debian-11-r63
image: docker.io/bitnami/etcd:3.5.6-debian-11-r10
So, where to start...
bitnami-shell has been deprecated, it's last release 8 months ago, but this is version 10, which has no arm64 image. If the project moved to bitnami-shell:11 it has an arm64 image. Better yet the project should probably transition to bitnami/os-shell which has arm64 images and isn't scheduled to be removed in the future (bitnami-shell has a message saying "already released container images will persist in the registries until July 2024.", and it's well past that, so they could be removed at any time).
grafana/loki has arm64 images, however this project is using 2.6.1 which is 2 years old, I guess this is probably a case of "if it ain't broke don't fix it", however not moving this along could result in future issues.
bitnami-shell, 11 this time, but hard coded to use r63, which has no arm64 image. Moving to anything above r90 (or removing the hard coded revision), and an arm64 version of the image is available. This should probably be transitioned to os-shell too, for the reasons stated above.
bitnami/etcd isn't deprecated, however 3.5.6-debian-11-r10 was released December 2022, newer images have arm64 support and newer revisions have moved to debian-12 (3.5.16-debian-12-r2 being the latest).
Using such old images exposes users to some fairly serious security concerns.
These images are affected by numerous CVEs of varying severity, what can be done to uplift them to more secure versions.
This is as far as I've gotten, I'm now working through getting etcd to run correctly and the wife wants me to got out for lunch.
The text was updated successfully, but these errors were encountered: