[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

git-day · 2023-01-11T10:41:00Z

Hi.

Wondering if someone could pls assist with some understanding of Harvester, SSD and NVMe's best practice under a Virtual Machine guest.

What's the problem?
Windows or Linux virtual machines running under Harvester 1.1.1 have slow HDD performance using Gen3 NVMe's using a using a PCIe slot x8.

What is expected?
Virtual machines write speeds hover around 50-100MB/s, but am expecting near hardware performance of 1000+ MB/s.

What version of Windows?
Windows 2022 server and or Linux Ubuntu 22.04.

Harvester - Virtual Volume Bus Type
Using Virtio or SATA produces the same slow performance.

How were the VM's created?
Using the Harvester VM templates with vmdp:2.5.3.

I have read a number of articles pertaining to slow NVMe performance with KVM, VMware etc but no real solutions. Any help or guidance is appreciated.

w13915984028 · 2023-01-12T08:51:20Z

Please list more detailed information:

Harvester Version
Cluster scale, how many NODEs, bare metal or virtual machine, how those NODEs are connected (physical switch or virtual switch), 
network card speed, 
CPU cores, speed, 
memory size 
etc.

Harvester utilizes Longhorn as K8s CSI driver, VM's image and volume are backed by Longhorn volume (refer https://longhorn.io/docs/1.3.2/concepts/)

By default, there are 3 replicas for each volume, the writing from VM will be replicated, it depends on the CPU, Network, Storage speed.

When Harvester v1.1.1 is used, you can try to create StorageClass with only 1 replica to test the performance (https://docs.harvesterhci.io/v1.1/advanced/storageclass/#creating-a-storage-class), which decreases reliability but improves performance.

You can also test them in a single-node cluster, v1.1.1 version is not required.

Thanks.

git-day · 2023-01-13T23:11:20Z

Hi Jian,

Many thanks for responding. Here is some of the basic info which you have requested.

Harvester Version: 1.1.1
Cluster scale: Single cluster, single node
How many NODEs: One
Bare metal or virtual machine: Bare metal
How those NODEs are connected (physical switch or virtual switch): Physical Switch
Network card speed: 1Gb
CPU cores, speed: 8 Cores 3.6GHz
Memory size: 32GB

I have tried the following:

Trial No1: Windows 2022 Server VM, Volume Single SSD, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remains poor, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s.

Trial No2: Windows 2022 Server VM, Volume Single NVMe, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remain poor much the same as SSD, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s.

I then ran a basic CrystalDiskMark - default settings with the following defaults.

SSD VM

NVMe VM

Are the above results normal for this current version of Harvester and longhorn? I also notice a heavy amount of VM latency whilst opening apps, takes a number of seconds to launch them.

I feel like I'm missing something within the config, such as cloud init, or am I not using the right drivers? There doesn't appear to be much to the Harvester storage interface when it comes to config, not unless there is something under the CLI which need enabling / editing to help with improving VM performance.

Worth noting that I have found a few longhorn posts which suggest that the storage platform stability and recoverability have taken precedence over performance. Assuming what we have here today is the best we can get from the backend storage?

Performance and Scalability Report for Longhorn v1.0
https://longhorn.io/blog/performance-scalability-report-aug-2020/

Very disappointing performance, is this expected?
longhorn/longhorn#3037

Have not gone down the path of implementing additional nodes in the cluster yet. Would this assist with VM performance based on the longhorn replica architecture? Maybe not writes, but at least reads, which may reduce latency?

Keen to hear your thoughts.

w13915984028 · 2023-01-18T08:15:47Z

@git8d thanks for your detailed test and feedback, I will discuss with Longhorn engineer, it may take some days.

w13915984028 · 2023-01-18T10:58:26Z

@git8d I have a doubt about your environment, does your bare metal server have only CPU cores, speed: 8 Cores 3.6GHz ? with such few CPU cores (a blank cluster itself has already reserved certain amount of CPU), is it sill OK to deploy windows VM with 4 vCPUs. I suspect the performance bottleneck may be related to the limited CPU cores.

Please also observe the CPU usage from Harvester GUI of cluster and VM, check them when doing test, thanks.

w13915984028 · 2023-01-19T08:53:02Z

@git8d When possible, please also test the storage performance in a Linux based VM. Then we will have more clues, thanks.

git-day · 2023-01-19T11:27:36Z

@w13915984028, thanks for coming back to me. Yeah, the bare metal server is on a single socket with 8 cores. I'm able to run 3-4 VM's with 4vCPU's.

Below are some further tests which I have performed.

Introduced a new Harvester bare metal server

Same specs as the previous bare metal server, exception is the NVMe. This server has an SSD for the OS and 2 x SSD's for use in the storage class.
SSD's and node have been tagged accordingly.
New storage class created with a single replica on a single SSD
New Windows VM created that is pinned to the new node and storage class using a single SSD

Test result below with Grafana metrics.

The results below were surprising that the new node performed better than the previous node

Below is the Grafana output for the VM test. Notice that the CPU is low, it is shown in the Yellow colour.

Now back to the original node with the same VM. Ran another test under NVMe, Now we have a test that is worse than previous tests, really odd.

Shown below is a strange Network IO spike, why? The storage is local with a single replica. The below appears to show that the network is a bottleneck, which might explain the slow NVMe IO.

Thoughts?

git-day · 2023-01-19T11:31:32Z

@git8d When possible, please also test the storage performance in a Linux based VM. Then we will have more clues, thanks.

@w13915984028, what tool do you recommend to use under a Linux VM?

w13915984028 · 2023-01-23T20:43:24Z

@git8d The network spike looks really tricky, per your description, the test should not be related to network, but seems do.

Could you add more details about your test case, how it is done in windows, with which software (CrystalDsikMark, ...) and how file is copied locally ? A diagram is better, things under the hood will help us, thanks.

git-day · 2023-01-28T03:06:48Z

Hi @w13915984028, please see below. Hopefully this is enough info. Let me know if you need more.

Below is Harvester local disks.

Below is the test VM disk config

Below you can see two types of disks, both sit under NVMe, but they differ with their Bus types.

Below is the NVMe storage class.

How are the tests performed?

CrystalDsikMark is installed on the VM. It runs using the default settings.
All tests are performed under the VM, nothing runs over the network as such.

Physical host diag

git-day · 2023-01-28T03:12:54Z

Longhorn SPDK.

I know that the underlying storage is run by Longhorn, and have reference material that show cases that design reliability was developed over performance. I can see that Longhorn SPDK is a project, but appears to be early days and may help in this situation. Assuming this will be integrated with Harvester at some point?

Longhorn CLI for SPDK
https://github.com/longhorn/longhorn-spdk-engine

Improving Longhorn Performance With SPDK - Keith Lucas & David Ko, SUSE
https://www.youtube.com/watch?v=ve-wXQZNjlg

git-day · 2023-01-28T03:29:07Z

@w13915984028, note that I have since tested on several platforms with a variety of performance throughput. The results below are the best outcomes, and are based on using the right Controller, Bus and drivers.

XCP-ng Single NVMe (VM)

Proxmox Single NVMe (VM)

w13915984028 · 2023-01-31T09:44:49Z

@git8d Thanks for all of those further tests and the detailed information.

Are the last two tests, Proxmox, XCP-ng, based on the same/similar single-node environment, coping files inbetween the same single VM on those 2 platform?

git-day · 2023-01-31T10:42:00Z

Hi @w13915984028 , the tests were performed under the exact same hardware. The VMs were run under each platform (proxmox / xcp-ng) with the Crytsalmark being run under each VM in their respective tests.

Both proxmox and xcp-ng were built as standalone servers. All tests were local to each platform under the VM itself.

Note that none of the tests involved file copies or network tests.

Let me know if you need any more info.

TheoBassaw · 2023-02-13T02:55:29Z

I want to mention that longhorn is now v1.4.0 and includes this fix longhorn/longhorn#3957. I was wondering how performance might be with a single replica when now using a local socket instead of a tcp connection.

w13915984028 · 2023-02-13T08:46:58Z

@git8d thanks for your last update, we are discussing in the team, and will spend some time to analyze the possible bottleneck.

git-day · 2023-02-13T11:34:54Z

@w13915984028 , thanks, keep us posted.

davidpanic · 2023-02-15T15:53:04Z

I am also seeing very similar results. Performance is extremely slow. In my case I should be seeing about 7ish Gb/s read speeds (which is the case if I run an OS directly on the hardware) but in harvester I only get a fraction the speeds, which can be seen below. I am using the latest stable viostor disk driver on Windows server 2022.

If it helps, tomorrow I could set up a PV with volumeMode: Block directly on the drives and run the tests again, to confirm longhorn is the bottleneck.

I should mention that I've tried using Rook/Ceph with this config, just slightly modified to specify the drives with a deviceFilter and the speeds seem more acceptable there.

Harvester Version: 1.1.1
Cluster scale: Single cluster, single node
How many NODEs: One
Bare metal or virtual machine: Bare metal
How those NODEs are connected (physical switch or virtual switch): Physical Switch
Network card speed: 20 Gb/s, two direct attach SFP ports, confirmed working with other OS
CPU cores, speed: AMD EPYC 7543 - 32 Cores / 64 Threads @ 3.7GHz
Memory size: 256GB

abonillabeeche · 2023-02-17T14:29:31Z

@w13915984028, note that I have since tested on several platforms with a variety of performance throughput. The results below are the best outcomes, and are based on using the right Controller, Bus and drivers.

Can you share the changes you made? I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node.

XCP-ng Single NVMe (VM)

Proxmox Single NVMe (VM)

What did you use as the disk type or file system for Proxmox? Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. Did you compare this vs a single-node Ceph/RBD or another distributed storage?

git-day · 2023-02-24T11:51:47Z

@abonillabeeche, thanks for responding.

Can you share the changes you made?
when you ask to share the changes, which are you referring to? Harvester or the other Hypervisors?

I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node.
Could you be a little more explicit, what do you mean by this statement. Where did you try this? Pls share the details if you can.

What did you use as the disk type or file system for Proxmox?
What I used for Proxmox is a mix of ext4 and ZFS, both had differing results, but vastly better performance than those shared from Harvester.

Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation.
I agree, but when i was testing, i kept to the bare minimum, i.e, i enable only a single replica.

Did you compare this vs a single-node Ceph/RBD or another distributed storage?
Is this question related to Harvester, if so, then no i did not compare.

Perhaps a better approach would be for us to standardise a series of testing scenarios, then we might nail the issue, potentially. Would you be able to recommend test cases for Harvester? I will run the tests and supply the output so we can try and be on the same page regarding setup, config etc.

Let me know how you would like to proceed.

davidpanic · 2023-02-27T16:18:05Z

I've just tried Longhorn+KubeVirt on a bare metal Debian 11 (bullseye) Kubernetes cluster on the same hardware that I tested above and I get slightly better speeds.

I deployed the cluster using kubeadm.

Kubernetes version: 1.26.1
Longhorn version: 1.4.0
KubeVirt HyperConverged Cluster Operator version: 1.8.0

Using a single best-effort local Longhorn replica:

Using Longhorn strict-local data locality:

Using rook.io ceph with this config + a deviceFilter yields WAY faster sequential speeds:

git-day · 2023-03-08T02:54:33Z

@abonillabeeche, thanks for responding.

Can you share the changes you made? when you ask to share the changes, which are you referring to? Harvester or the other Hypervisors?

I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node. Could you be a little more explicit, what do you mean by this statement. Where did you try this? Pls share the details if you can.

What did you use as the disk type or file system for Proxmox? What I used for Proxmox is a mix of ext4 and ZFS, both had differing results, but vastly better performance than those shared from Harvester.

Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. I agree, but when i was testing, i kept to the bare minimum, i.e, i enable only a single replica.

Did you compare this vs a single-node Ceph/RBD or another distributed storage? Is this question related to Harvester, if so, then no i did not compare.

Perhaps a better approach would be for us to standardise a series of testing scenarios, then we might nail the issue, potentially. Would you be able to recommend test cases for Harvester? I will run the tests and supply the output so we can try and be on the same page regarding setup, config etc.

Let me know how you would like to proceed.

@abonillabeeche , how should we proceed with troubleshooting. Let me know when you can.

git-day · 2023-03-08T03:00:07Z

@davidpanic would you mind sharing a little more detail on the Ceph setup? I'd like to try this as a possible interim alternative to the native longhorn. Any help is appreciated!

davidpanic · 2023-03-08T09:57:54Z

@git8d

Disks: 2x SAMSUNG MZ1L21T9HCLS-00A07 (1.75 TiB, 1920383410176 bytes) @ /dev/nvme0n1 and /dev/nvme1n1

Cluster manifest (click to expand)

kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-config-override
  namespace: rook-ceph # namespace:cluster
data:
  config: |
    [global]
    osd_pool_default_size = 1
    mon_warn_on_pool_no_redundancy = false
    bdev_flock_retry = 20
    bluefs_buffered_io = false
    mon_data_avail_warn = 10
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: cluster
  namespace: rook-ceph # namespace:cluster
spec:
  dataDirHostPath: /var/lib/rook
  cephVersion:
    image: quay.io/ceph/ceph:v17
    allowUnsupported: true
  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
  dashboard:
    enabled: true
  crashCollector:
    disable: true
  storage:
    useAllNodes: false
    useAllDevices: false
    nodes:
    - name: n1
      devices:
      - name: /dev/nvme0n1
      - name: /dev/nvme1n1
  healthCheck:
    daemonHealth:
      mon:
        interval: 45s
        timeout: 600s
  priorityClassNames:
    all: system-node-critical
    mgr: system-cluster-critical
  disruptionManagement:
    managePodBudgets: true
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: builtin-mgr
  namespace: rook-ceph # namespace:cluster
spec:
  name: .mgr
  replicated:
    size: 1
    requireSafeReplicaSize: false

Replica pool and storage class manifest (click to expand)

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph # namespace:cluster
spec:
  failureDomain: osd
  replicated:
    size: 1
    # Disallow setting pool with replica 1, this could lead to data loss without recovery.
    # Make sure you're *ABSOLUTELY CERTAIN* that is what you want
    requireSafeReplicaSize: false
    # gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
    # for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
    #targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com # driver:namespace:operator
parameters:
  # clusterID is the namespace where the rook cluster is running
  # If you change this namespace, also change the namespace below where the secret namespaces are defined
  clusterID: rook-ceph # namespace:cluster

  # If you want to use erasure coded pool with RBD, you need to create
  # two pools. one erasure coded and one replicated.
  # You need to specify the replicated pool here in the `pool` parameter, it is
  # used for the metadata of the images.
  # The erasure coded pool must be set as the `dataPool` parameter below.
  #dataPool: ec-data-pool
  pool: replicapool

  # RBD image format. Defaults to "2".
  imageFormat: "2"

  # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
  imageFeatures: layering

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
  # Specify the filesystem type of the volume. If not specified, csi-provisioner
  # will set default as `ext4`.
  csi.storage.k8s.io/fstype: ext4
# uncomment the following to use rbd-nbd as mounter on supported nodes
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete

Keep in mind that you should use other manifests in production, this is for TESTING ONLY as it runs on one node and without data redundancy! Read the docs for more info on why.

Abend1 · 2023-10-09T02:27:42Z

Similar to the above, I have dedicated MGMT bonded network at 20GbE 9000 MTU cross hosts LAN, I am seeing very poor disk IO for VMs with the native container storage. The storage is the same on all 3 Hosts using PERC H730 1Gb cache RAID5 SSDs that bare metal on Windows happily exceed 800MB/s IO. With the VM on HVR, no difference in speed with a NIC disabled. Other than moving to Rook Ceph storage https://harvesterhci.io/kb , is there any other way to natively improve VM disk performance with Harvester?

markhillgit · 2024-01-30T02:48:02Z

There is a knowledge base article posted here now https://harvesterhci.io/kb/best_practices_for_optimizing_longhorn_disk_performance

markhillgit · 2024-01-30T02:48:26Z

moving to 1.4.0 to evaluate this with v2 storage engine

git-day added the kind/question label Jan 11, 2023

w13915984028 mentioned this issue Feb 14, 2023

Harvester to-be-investigated issue tracking w13915984028/harvester-develop-summary#3

Open

guangbochen added the require/investigate Identified the issue but require further investigation for resolution (won't be stale) label Feb 17, 2023

guangbochen added this to the v1.2.0 milestone Feb 17, 2023

guangbochen added the priority/1 Highly recommended to fix in this release label Feb 17, 2023

guangbochen assigned Vicente-Cheng Feb 22, 2023

guangbochen added the require/knowledge-base label May 10, 2023

guangbochen changed the title ~~NVMe PCIe - Slow Virtual Machine Performance~~ [KB] NVMe PCIe - Slow Virtual Machine Performance Jun 20, 2023

guangbochen changed the title ~~[KB] NVMe PCIe - Slow Virtual Machine Performance~~ [KB]NVMe PCIe - Slow Virtual Machine Performance Jun 20, 2023

guangbochen modified the milestones: v1.2.0, v1.2.1 Jun 29, 2023

bk201 modified the milestones: v1.2.2, v1.3.0 Nov 30, 2023

jillian-maroket mentioned this issue Dec 27, 2023

Create KB article (Best Practices for Optimizing Longhorn Disk Performance) harvester/harvesterhci.io#51

Merged

markhillgit modified the milestones: v1.3.0, v1.4.0 Jan 30, 2024

rebeccazzzz added this to Community Issue Review Aug 21, 2024

rebeccazzzz moved this to Resolved/Scheduled in Community Issue Review Aug 21, 2024

bk201 modified the milestones: v1.4.0, v1.5.0 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

git-day commented Jan 11, 2023

w13915984028 commented Jan 12, 2023

git-day commented Jan 13, 2023

w13915984028 commented Jan 18, 2023

w13915984028 commented Jan 18, 2023

w13915984028 commented Jan 19, 2023

git-day commented Jan 19, 2023 •

edited

Loading

git-day commented Jan 19, 2023

w13915984028 commented Jan 23, 2023

git-day commented Jan 28, 2023

git-day commented Jan 28, 2023

git-day commented Jan 28, 2023

w13915984028 commented Jan 31, 2023

git-day commented Jan 31, 2023

TheoBassaw commented Feb 13, 2023

w13915984028 commented Feb 13, 2023

git-day commented Feb 13, 2023

davidpanic commented Feb 15, 2023 •

edited

Loading

abonillabeeche commented Feb 17, 2023

git-day commented Feb 24, 2023

davidpanic commented Feb 27, 2023

git-day commented Mar 8, 2023

git-day commented Mar 8, 2023

davidpanic commented Mar 8, 2023

Abend1 commented Oct 9, 2023 •

edited

Loading

markhillgit commented Jan 30, 2024

markhillgit commented Jan 30, 2024

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

Comments

git-day commented Jan 11, 2023

w13915984028 commented Jan 12, 2023

git-day commented Jan 13, 2023

w13915984028 commented Jan 18, 2023

w13915984028 commented Jan 18, 2023

w13915984028 commented Jan 19, 2023

git-day commented Jan 19, 2023 • edited Loading

git-day commented Jan 19, 2023

w13915984028 commented Jan 23, 2023

git-day commented Jan 28, 2023

git-day commented Jan 28, 2023

git-day commented Jan 28, 2023

w13915984028 commented Jan 31, 2023

git-day commented Jan 31, 2023

TheoBassaw commented Feb 13, 2023

w13915984028 commented Feb 13, 2023

git-day commented Feb 13, 2023

davidpanic commented Feb 15, 2023 • edited Loading

abonillabeeche commented Feb 17, 2023

git-day commented Feb 24, 2023

davidpanic commented Feb 27, 2023

git-day commented Mar 8, 2023

git-day commented Mar 8, 2023

davidpanic commented Mar 8, 2023

Abend1 commented Oct 9, 2023 • edited Loading

markhillgit commented Jan 30, 2024

markhillgit commented Jan 30, 2024

git-day commented Jan 19, 2023 •

edited

Loading

davidpanic commented Feb 15, 2023 •

edited

Loading

Abend1 commented Oct 9, 2023 •

edited

Loading