-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KB]NVMe PCIe - Slow Virtual Machine Performance #3356
Comments
Please list more detailed information:
Harvester utilizes Longhorn as K8s CSI driver, VM's image and volume are backed by Longhorn volume (refer https://longhorn.io/docs/1.3.2/concepts/) By default, there are 3 replicas for each volume, the writing from VM will be replicated, it depends on the CPU, Network, Storage speed. When Harvester v1.1.1 is used, you can try to create StorageClass with only 1 replica to test the performance (https://docs.harvesterhci.io/v1.1/advanced/storageclass/#creating-a-storage-class), which decreases reliability but improves performance. You can also test them in a single-node cluster, v1.1.1 version is not required. Thanks. |
Hi Jian, Many thanks for responding. Here is some of the basic info which you have requested. Harvester Version: 1.1.1 I have tried the following: Trial No1: Windows 2022 Server VM, Volume Single SSD, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remains poor, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s. Trial No2: Windows 2022 Server VM, Volume Single NVMe, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remain poor much the same as SSD, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s. I then ran a basic CrystalDiskMark - default settings with the following defaults. Are the above results normal for this current version of Harvester and longhorn? I also notice a heavy amount of VM latency whilst opening apps, takes a number of seconds to launch them. I feel like I'm missing something within the config, such as cloud init, or am I not using the right drivers? There doesn't appear to be much to the Harvester storage interface when it comes to config, not unless there is something under the CLI which need enabling / editing to help with improving VM performance. Worth noting that I have found a few longhorn posts which suggest that the storage platform stability and recoverability have taken precedence over performance. Assuming what we have here today is the best we can get from the backend storage? Performance and Scalability Report for Longhorn v1.0 Very disappointing performance, is this expected? Have not gone down the path of implementing additional nodes in the cluster yet. Would this assist with VM performance based on the longhorn replica architecture? Maybe not writes, but at least reads, which may reduce latency? Keen to hear your thoughts. |
@git8d thanks for your detailed test and feedback, I will discuss with Longhorn engineer, it may take some days. |
@git8d I have a doubt about your environment, does your bare metal server have only Please also observe the CPU usage from Harvester GUI of cluster and VM, check them when doing test, thanks. |
@git8d When possible, please also test the storage performance in a Linux based VM. Then we will have more clues, thanks. |
@w13915984028, thanks for coming back to me. Yeah, the bare metal server is on a single socket with 8 cores. I'm able to run 3-4 VM's with 4vCPU's. Below are some further tests which I have performed. Introduced a new Harvester bare metal server
Test result below with Grafana metrics. The results below were surprising that the new node performed better than the previous node Below is the Grafana output for the VM test. Notice that the CPU is low, it is shown in the Yellow colour. Now back to the original node with the same VM. Ran another test under NVMe, Now we have a test that is worse than previous tests, really odd. Shown below is a strange Network IO spike, why? The storage is local with a single replica. The below appears to show that the network is a bottleneck, which might explain the slow NVMe IO. Thoughts? |
@w13915984028, what tool do you recommend to use under a Linux VM? |
@git8d The network spike looks really tricky, per your description, the test should not be related to network, but seems do. Could you add more details about your test case, how it is done in windows, with which software (CrystalDsikMark, ...) and how file is copied |
Hi @w13915984028, please see below. Hopefully this is enough info. Let me know if you need more. Below is Harvester local disks. Below is the test VM disk config Below you can see two types of disks, both sit under NVMe, but they differ with their Bus types. Below is the NVMe storage class. How are the tests performed?
|
Longhorn SPDK. I know that the underlying storage is run by Longhorn, and have reference material that show cases that design reliability was developed over performance. I can see that Longhorn SPDK is a project, but appears to be early days and may help in this situation. Assuming this will be integrated with Harvester at some point? Longhorn CLI for SPDK Improving Longhorn Performance With SPDK - Keith Lucas & David Ko, SUSE |
@w13915984028, note that I have since tested on several platforms with a variety of performance throughput. The results below are the best outcomes, and are based on using the right Controller, Bus and drivers. |
@git8d Thanks for all of those further tests and the detailed information. Are the last two tests, Proxmox, XCP-ng, based on the same/similar single-node environment, coping files inbetween the same single VM on those 2 platform? |
Hi @w13915984028 , the tests were performed under the exact same hardware. The VMs were run under each platform (proxmox / xcp-ng) with the Crytsalmark being run under each VM in their respective tests. Both proxmox and xcp-ng were built as standalone servers. All tests were local to each platform under the VM itself. Note that none of the tests involved file copies or network tests. Let me know if you need any more info. |
I want to mention that longhorn is now v1.4.0 and includes this fix longhorn/longhorn#3957. I was wondering how performance might be with a single replica when now using a local socket instead of a tcp connection. |
@git8d thanks for your last update, we are discussing in the team, and will spend some time to analyze the possible bottleneck. |
@w13915984028 , thanks, keep us posted. |
I am also seeing very similar results. Performance is extremely slow. In my case I should be seeing about 7ish Gb/s read speeds (which is the case if I run an OS directly on the hardware) but in harvester I only get a fraction the speeds, which can be seen below. I am using the latest stable viostor disk driver on Windows server 2022. If it helps, tomorrow I could set up a PV with I should mention that I've tried using Rook/Ceph with this config, just slightly modified to specify the drives with a deviceFilter and the speeds seem more acceptable there. Harvester Version: 1.1.1 |
Can you share the changes you made? I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node. What did you use as the disk type or file system for Proxmox? Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. Did you compare this vs a single-node Ceph/RBD or another distributed storage? |
@abonillabeeche, thanks for responding. Can you share the changes you made? I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node. What did you use as the disk type or file system for Proxmox? Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. Did you compare this vs a single-node Ceph/RBD or another distributed storage? Perhaps a better approach would be for us to standardise a series of testing scenarios, then we might nail the issue, potentially. Would you be able to recommend test cases for Harvester? I will run the tests and supply the output so we can try and be on the same page regarding setup, config etc. Let me know how you would like to proceed. |
I've just tried Longhorn+KubeVirt on a bare metal Debian 11 (bullseye) Kubernetes cluster on the same hardware that I tested above and I get slightly better speeds. I deployed the cluster using kubeadm. Kubernetes version: Using a single best-effort local Longhorn replica: Using Longhorn strict-local data locality: Using rook.io ceph with this config + a deviceFilter yields WAY faster sequential speeds: |
@abonillabeeche , how should we proceed with troubleshooting. Let me know when you can. |
@davidpanic would you mind sharing a little more detail on the Ceph setup? I'd like to try this as a possible interim alternative to the native longhorn. Any help is appreciated! |
@git8d Disks: 2x SAMSUNG MZ1L21T9HCLS-00A07 (1.75 TiB, 1920383410176 bytes) @ /dev/nvme0n1 and /dev/nvme1n1 Cluster manifest (click to expand)kind: ConfigMap
apiVersion: v1
metadata:
name: rook-config-override
namespace: rook-ceph # namespace:cluster
data:
config: |
[global]
osd_pool_default_size = 1
mon_warn_on_pool_no_redundancy = false
bdev_flock_retry = 20
bluefs_buffered_io = false
mon_data_avail_warn = 10
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: cluster
namespace: rook-ceph # namespace:cluster
spec:
dataDirHostPath: /var/lib/rook
cephVersion:
image: quay.io/ceph/ceph:v17
allowUnsupported: true
mon:
count: 1
allowMultiplePerNode: true
mgr:
count: 1
allowMultiplePerNode: true
dashboard:
enabled: true
crashCollector:
disable: true
storage:
useAllNodes: false
useAllDevices: false
nodes:
- name: n1
devices:
- name: /dev/nvme0n1
- name: /dev/nvme1n1
healthCheck:
daemonHealth:
mon:
interval: 45s
timeout: 600s
priorityClassNames:
all: system-node-critical
mgr: system-cluster-critical
disruptionManagement:
managePodBudgets: true
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: builtin-mgr
namespace: rook-ceph # namespace:cluster
spec:
name: .mgr
replicated:
size: 1
requireSafeReplicaSize: false Replica pool and storage class manifest (click to expand)apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph # namespace:cluster
spec:
failureDomain: osd
replicated:
size: 1
# Disallow setting pool with replica 1, this could lead to data loss without recovery.
# Make sure you're *ABSOLUTELY CERTAIN* that is what you want
requireSafeReplicaSize: false
# gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
# for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
#targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com # driver:namespace:operator
parameters:
# clusterID is the namespace where the rook cluster is running
# If you change this namespace, also change the namespace below where the secret namespaces are defined
clusterID: rook-ceph # namespace:cluster
# If you want to use erasure coded pool with RBD, you need to create
# two pools. one erasure coded and one replicated.
# You need to specify the replicated pool here in the `pool` parameter, it is
# used for the metadata of the images.
# The erasure coded pool must be set as the `dataPool` parameter below.
#dataPool: ec-data-pool
pool: replicapool
# RBD image format. Defaults to "2".
imageFormat: "2"
# RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
imageFeatures: layering
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`.
csi.storage.k8s.io/fstype: ext4
# uncomment the following to use rbd-nbd as mounter on supported nodes
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete Keep in mind that you should use other manifests in production, this is for TESTING ONLY as it runs on one node and without data redundancy! Read the docs for more info on why. |
Similar to the above, I have dedicated MGMT bonded network at 20GbE 9000 MTU cross hosts LAN, I am seeing very poor disk IO for VMs with the native container storage. The storage is the same on all 3 Hosts using PERC H730 1Gb cache RAID5 SSDs that bare metal on Windows happily exceed 800MB/s IO. With the VM on HVR, no difference in speed with a NIC disabled. Other than moving to Rook Ceph storage https://harvesterhci.io/kb , is there any other way to natively improve VM disk performance with Harvester? |
There is a knowledge base article posted here now https://harvesterhci.io/kb/best_practices_for_optimizing_longhorn_disk_performance |
moving to 1.4.0 to evaluate this with v2 storage engine |
Hi.
Wondering if someone could pls assist with some understanding of Harvester, SSD and NVMe's best practice under a Virtual Machine guest.
What's the problem?
Windows or Linux virtual machines running under Harvester 1.1.1 have slow HDD performance using Gen3 NVMe's using a using a PCIe slot x8.
What is expected?
Virtual machines write speeds hover around 50-100MB/s, but am expecting near hardware performance of 1000+ MB/s.
What version of Windows?
Windows 2022 server and or Linux Ubuntu 22.04.
Harvester - Virtual Volume Bus Type
Using Virtio or SATA produces the same slow performance.
How were the VM's created?
Using the Harvester VM templates with vmdp:2.5.3.
I have read a number of articles pertaining to slow NVMe performance with KVM, VMware etc but no real solutions. Any help or guidance is appreciated.
The text was updated successfully, but these errors were encountered: