Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

Open
git-day opened this issue Jan 11, 2023 · 26 comments
Open

[KB]NVMe PCIe - Slow Virtual Machine Performance #3356

git-day opened this issue Jan 11, 2023 · 26 comments
Assignees
Labels
kind/question priority/1 Highly recommended to fix in this release require/investigate Identified the issue but require further investigation for resolution (won't be stale) require/knowledge-base
Milestone

Comments

@git-day
Copy link

git-day commented Jan 11, 2023

Hi.

Wondering if someone could pls assist with some understanding of Harvester, SSD and NVMe's best practice under a Virtual Machine guest.

What's the problem?
Windows or Linux virtual machines running under Harvester 1.1.1 have slow HDD performance using Gen3 NVMe's using a using a PCIe slot x8.

What is expected?
Virtual machines write speeds hover around 50-100MB/s, but am expecting near hardware performance of 1000+ MB/s.

What version of Windows?
Windows 2022 server and or Linux Ubuntu 22.04.

Harvester - Virtual Volume Bus Type
Using Virtio or SATA produces the same slow performance.

How were the VM's created?
Using the Harvester VM templates with vmdp:2.5.3.

I have read a number of articles pertaining to slow NVMe performance with KVM, VMware etc but no real solutions. Any help or guidance is appreciated.

@w13915984028
Copy link
Member

Please list more detailed information:

Harvester Version
Cluster scale, how many NODEs, bare metal or virtual machine, how those NODEs are connected (physical switch or virtual switch), 
network card speed, 
CPU cores, speed, 
memory size 
etc.

Harvester utilizes Longhorn as K8s CSI driver, VM's image and volume are backed by Longhorn volume (refer https://longhorn.io/docs/1.3.2/concepts/)

By default, there are 3 replicas for each volume, the writing from VM will be replicated, it depends on the CPU, Network, Storage speed.

When Harvester v1.1.1 is used, you can try to create StorageClass with only 1 replica to test the performance (https://docs.harvesterhci.io/v1.1/advanced/storageclass/#creating-a-storage-class), which decreases reliability but improves performance.

You can also test them in a single-node cluster, v1.1.1 version is not required.

Thanks.

@git-day
Copy link
Author

git-day commented Jan 13, 2023

Hi Jian,

Many thanks for responding. Here is some of the basic info which you have requested.

Harvester Version: 1.1.1
Cluster scale: Single cluster, single node
How many NODEs: One
Bare metal or virtual machine: Bare metal
How those NODEs are connected (physical switch or virtual switch): Physical Switch
Network card speed: 1Gb
CPU cores, speed: 8 Cores 3.6GHz
Memory size: 32GB

I have tried the following:

Trial No1: Windows 2022 Server VM, Volume Single SSD, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remains poor, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s.

Trial No2: Windows 2022 Server VM, Volume Single NVMe, Volume Single replica, Single cluster, Single node. VM specs are 4 vCPU, 4GB RAM, 150GB disk. Default Windows template and VMDP drivers which came with Harvester. Speeds remain poor much the same as SSD, 30-50MB/s with a copy going between VM's hosted under Harvester. Doing a local copy (local Read/Write), write starts at 200MB/s then drops to 50MB/s.

I then ran a basic CrystalDiskMark - default settings with the following defaults.

SSD VM
image

NVMe VM
image

Are the above results normal for this current version of Harvester and longhorn? I also notice a heavy amount of VM latency whilst opening apps, takes a number of seconds to launch them.

I feel like I'm missing something within the config, such as cloud init, or am I not using the right drivers? There doesn't appear to be much to the Harvester storage interface when it comes to config, not unless there is something under the CLI which need enabling / editing to help with improving VM performance.

Worth noting that I have found a few longhorn posts which suggest that the storage platform stability and recoverability have taken precedence over performance. Assuming what we have here today is the best we can get from the backend storage?

Performance and Scalability Report for Longhorn v1.0
https://longhorn.io/blog/performance-scalability-report-aug-2020/

Very disappointing performance, is this expected?
longhorn/longhorn#3037

Have not gone down the path of implementing additional nodes in the cluster yet. Would this assist with VM performance based on the longhorn replica architecture? Maybe not writes, but at least reads, which may reduce latency?

Keen to hear your thoughts.

@w13915984028
Copy link
Member

@git8d thanks for your detailed test and feedback, I will discuss with Longhorn engineer, it may take some days.

@w13915984028
Copy link
Member

@git8d I have a doubt about your environment, does your bare metal server have only CPU cores, speed: 8 Cores 3.6GHz ? with such few CPU cores (a blank cluster itself has already reserved certain amount of CPU), is it sill OK to deploy windows VM with 4 vCPUs. I suspect the performance bottleneck may be related to the limited CPU cores.

Please also observe the CPU usage from Harvester GUI of cluster and VM, check them when doing test, thanks.

@w13915984028
Copy link
Member

@git8d When possible, please also test the storage performance in a Linux based VM. Then we will have more clues, thanks.

@git-day
Copy link
Author

git-day commented Jan 19, 2023

@w13915984028, thanks for coming back to me. Yeah, the bare metal server is on a single socket with 8 cores. I'm able to run 3-4 VM's with 4vCPU's.

Below are some further tests which I have performed.

Introduced a new Harvester bare metal server

  • Same specs as the previous bare metal server, exception is the NVMe. This server has an SSD for the OS and 2 x SSD's for use in the storage class.
  • SSD's and node have been tagged accordingly.
  • New storage class created with a single replica on a single SSD
  • New Windows VM created that is pinned to the new node and storage class using a single SSD

Test result below with Grafana metrics.

The results below were surprising that the new node performed better than the previous node

image

Below is the Grafana output for the VM test. Notice that the CPU is low, it is shown in the Yellow colour.
image

Now back to the original node with the same VM. Ran another test under NVMe, Now we have a test that is worse than previous tests, really odd.
image

Shown below is a strange Network IO spike, why? The storage is local with a single replica. The below appears to show that the network is a bottleneck, which might explain the slow NVMe IO.
image

Thoughts?

@git-day
Copy link
Author

git-day commented Jan 19, 2023

@git8d When possible, please also test the storage performance in a Linux based VM. Then we will have more clues, thanks.

@w13915984028, what tool do you recommend to use under a Linux VM?

@w13915984028
Copy link
Member

@git8d The network spike looks really tricky, per your description, the test should not be related to network, but seems do.

Could you add more details about your test case, how it is done in windows, with which software (CrystalDsikMark, ...) and how file is copied locally ? A diagram is better, things under the hood will help us, thanks.

@git-day
Copy link
Author

git-day commented Jan 28, 2023

Hi @w13915984028, please see below. Hopefully this is enough info. Let me know if you need more.

Below is Harvester local disks.
image

Below is the test VM disk config
image

Below you can see two types of disks, both sit under NVMe, but they differ with their Bus types.
image

Below is the NVMe storage class.
image
image

How are the tests performed?

  • CrystalDsikMark is installed on the VM. It runs using the default settings.
  • All tests are performed under the VM, nothing runs over the network as such.

Physical host diag
image

@git-day
Copy link
Author

git-day commented Jan 28, 2023

Longhorn SPDK.

I know that the underlying storage is run by Longhorn, and have reference material that show cases that design reliability was developed over performance. I can see that Longhorn SPDK is a project, but appears to be early days and may help in this situation. Assuming this will be integrated with Harvester at some point?

Longhorn CLI for SPDK
https://github.com/longhorn/longhorn-spdk-engine

Improving Longhorn Performance With SPDK - Keith Lucas & David Ko, SUSE
https://www.youtube.com/watch?v=ve-wXQZNjlg

@git-day
Copy link
Author

git-day commented Jan 28, 2023

@w13915984028, note that I have since tested on several platforms with a variety of performance throughput. The results below are the best outcomes, and are based on using the right Controller, Bus and drivers.

XCP-ng Single NVMe (VM)
image

Proxmox Single NVMe (VM)
image

@w13915984028
Copy link
Member

@git8d Thanks for all of those further tests and the detailed information.

Are the last two tests, Proxmox, XCP-ng, based on the same/similar single-node environment, coping files inbetween the same single VM on those 2 platform?

@git-day
Copy link
Author

git-day commented Jan 31, 2023

Hi @w13915984028 , the tests were performed under the exact same hardware. The VMs were run under each platform (proxmox / xcp-ng) with the Crytsalmark being run under each VM in their respective tests.

Both proxmox and xcp-ng were built as standalone servers. All tests were local to each platform under the VM itself.

Note that none of the tests involved file copies or network tests.

Let me know if you need any more info.

@TheoBassaw
Copy link

I want to mention that longhorn is now v1.4.0 and includes this fix longhorn/longhorn#3957. I was wondering how performance might be with a single replica when now using a local socket instead of a tcp connection.

@w13915984028
Copy link
Member

@git8d thanks for your last update, we are discussing in the team, and will spend some time to analyze the possible bottleneck.

@git-day
Copy link
Author

git-day commented Feb 13, 2023

@w13915984028 , thanks, keep us posted.

@davidpanic
Copy link

davidpanic commented Feb 15, 2023

I am also seeing very similar results. Performance is extremely slow. In my case I should be seeing about 7ish Gb/s read speeds (which is the case if I run an OS directly on the hardware) but in harvester I only get a fraction the speeds, which can be seen below. I am using the latest stable viostor disk driver on Windows server 2022.

image

If it helps, tomorrow I could set up a PV with volumeMode: Block directly on the drives and run the tests again, to confirm longhorn is the bottleneck.

I should mention that I've tried using Rook/Ceph with this config, just slightly modified to specify the drives with a deviceFilter and the speeds seem more acceptable there.

Harvester Version: 1.1.1
Cluster scale: Single cluster, single node
How many NODEs: One
Bare metal or virtual machine: Bare metal
How those NODEs are connected (physical switch or virtual switch): Physical Switch
Network card speed: 20 Gb/s, two direct attach SFP ports, confirmed working with other OS
CPU cores, speed: AMD EPYC 7543 - 32 Cores / 64 Threads @ 3.7GHz
Memory size: 256GB

@guangbochen guangbochen added the require/investigate Identified the issue but require further investigation for resolution (won't be stale) label Feb 17, 2023
@guangbochen guangbochen added this to the v1.2.0 milestone Feb 17, 2023
@guangbochen guangbochen added the priority/1 Highly recommended to fix in this release label Feb 17, 2023
@abonillabeeche
Copy link

@w13915984028, note that I have since tested on several platforms with a variety of performance throughput. The results below are the best outcomes, and are based on using the right Controller, Bus and drivers.

Can you share the changes you made? I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node.

XCP-ng Single NVMe (VM) image

Proxmox Single NVMe (VM) image

What did you use as the disk type or file system for Proxmox? Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. Did you compare this vs a single-node Ceph/RBD or another distributed storage?

@git-day
Copy link
Author

git-day commented Feb 24, 2023

@abonillabeeche, thanks for responding.

Can you share the changes you made?
when you ask to share the changes, which are you referring to? Harvester or the other Hypervisors?

I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node.
Could you be a little more explicit, what do you mean by this statement. Where did you try this? Pls share the details if you can.

What did you use as the disk type or file system for Proxmox?
What I used for Proxmox is a mix of ext4 and ZFS, both had differing results, but vastly better performance than those shared from Harvester.

Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation.
I agree, but when i was testing, i kept to the bare minimum, i.e, i enable only a single replica.

Did you compare this vs a single-node Ceph/RBD or another distributed storage?
Is this question related to Harvester, if so, then no i did not compare.

Perhaps a better approach would be for us to standardise a series of testing scenarios, then we might nail the issue, potentially. Would you be able to recommend test cases for Harvester? I will run the tests and supply the output so we can try and be on the same page regarding setup, config etc.

Let me know how you would like to proceed.

@davidpanic
Copy link

I've just tried Longhorn+KubeVirt on a bare metal Debian 11 (bullseye) Kubernetes cluster on the same hardware that I tested above and I get slightly better speeds.

I deployed the cluster using kubeadm.

Kubernetes version: 1.26.1
Longhorn version: 1.4.0
KubeVirt HyperConverged Cluster Operator version: 1.8.0

Using a single best-effort local Longhorn replica:
image

Using Longhorn strict-local data locality:
image

Using rook.io ceph with this config + a deviceFilter yields WAY faster sequential speeds:
image

@git-day
Copy link
Author

git-day commented Mar 8, 2023

@abonillabeeche, thanks for responding.

Can you share the changes you made? when you ask to share the changes, which are you referring to? Harvester or the other Hypervisors?

I tried with a consumer-grade NVMe and have the same results as below with the VirtIO drivers from VMDP, a single Replica configured in Longhorn on a single node. Could you be a little more explicit, what do you mean by this statement. Where did you try this? Pls share the details if you can.

What did you use as the disk type or file system for Proxmox? What I used for Proxmox is a mix of ext4 and ZFS, both had differing results, but vastly better performance than those shared from Harvester.

Comparing direct XFS/ext4 vs Longhorn which has distributed built-in its design, may provide the incorrect expectation. I agree, but when i was testing, i kept to the bare minimum, i.e, i enable only a single replica.

Did you compare this vs a single-node Ceph/RBD or another distributed storage? Is this question related to Harvester, if so, then no i did not compare.

Perhaps a better approach would be for us to standardise a series of testing scenarios, then we might nail the issue, potentially. Would you be able to recommend test cases for Harvester? I will run the tests and supply the output so we can try and be on the same page regarding setup, config etc.

Let me know how you would like to proceed.

@abonillabeeche , how should we proceed with troubleshooting. Let me know when you can.

@git-day
Copy link
Author

git-day commented Mar 8, 2023

@davidpanic would you mind sharing a little more detail on the Ceph setup? I'd like to try this as a possible interim alternative to the native longhorn. Any help is appreciated!

@davidpanic
Copy link

@git8d

Disks: 2x SAMSUNG MZ1L21T9HCLS-00A07 (1.75 TiB, 1920383410176 bytes) @ /dev/nvme0n1 and /dev/nvme1n1

Cluster manifest (click to expand)
kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-config-override
  namespace: rook-ceph # namespace:cluster
data:
  config: |
    [global]
    osd_pool_default_size = 1
    mon_warn_on_pool_no_redundancy = false
    bdev_flock_retry = 20
    bluefs_buffered_io = false
    mon_data_avail_warn = 10
---
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: cluster
  namespace: rook-ceph # namespace:cluster
spec:
  dataDirHostPath: /var/lib/rook
  cephVersion:
    image: quay.io/ceph/ceph:v17
    allowUnsupported: true
  mon:
    count: 1
    allowMultiplePerNode: true
  mgr:
    count: 1
    allowMultiplePerNode: true
  dashboard:
    enabled: true
  crashCollector:
    disable: true
  storage:
    useAllNodes: false
    useAllDevices: false
    nodes:
    - name: n1
      devices:
      - name: /dev/nvme0n1
      - name: /dev/nvme1n1
  healthCheck:
    daemonHealth:
      mon:
        interval: 45s
        timeout: 600s
  priorityClassNames:
    all: system-node-critical
    mgr: system-cluster-critical
  disruptionManagement:
    managePodBudgets: true
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: builtin-mgr
  namespace: rook-ceph # namespace:cluster
spec:
  name: .mgr
  replicated:
    size: 1
    requireSafeReplicaSize: false
Replica pool and storage class manifest (click to expand)
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph # namespace:cluster
spec:
  failureDomain: osd
  replicated:
    size: 1
    # Disallow setting pool with replica 1, this could lead to data loss without recovery.
    # Make sure you're *ABSOLUTELY CERTAIN* that is what you want
    requireSafeReplicaSize: false
    # gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
    # for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
    #targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com # driver:namespace:operator
parameters:
  # clusterID is the namespace where the rook cluster is running
  # If you change this namespace, also change the namespace below where the secret namespaces are defined
  clusterID: rook-ceph # namespace:cluster

  # If you want to use erasure coded pool with RBD, you need to create
  # two pools. one erasure coded and one replicated.
  # You need to specify the replicated pool here in the `pool` parameter, it is
  # used for the metadata of the images.
  # The erasure coded pool must be set as the `dataPool` parameter below.
  #dataPool: ec-data-pool
  pool: replicapool

  # RBD image format. Defaults to "2".
  imageFormat: "2"

  # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
  imageFeatures: layering

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
  # Specify the filesystem type of the volume. If not specified, csi-provisioner
  # will set default as `ext4`.
  csi.storage.k8s.io/fstype: ext4
# uncomment the following to use rbd-nbd as mounter on supported nodes
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete

Keep in mind that you should use other manifests in production, this is for TESTING ONLY as it runs on one node and without data redundancy! Read the docs for more info on why.

@guangbochen guangbochen changed the title NVMe PCIe - Slow Virtual Machine Performance [KB] NVMe PCIe - Slow Virtual Machine Performance Jun 20, 2023
@guangbochen guangbochen changed the title [KB] NVMe PCIe - Slow Virtual Machine Performance [KB]NVMe PCIe - Slow Virtual Machine Performance Jun 20, 2023
@guangbochen guangbochen modified the milestones: v1.2.0, v1.2.1 Jun 29, 2023
@Abend1
Copy link

Abend1 commented Oct 9, 2023

Similar to the above, I have dedicated MGMT bonded network at 20GbE 9000 MTU cross hosts LAN, I am seeing very poor disk IO for VMs with the native container storage. The storage is the same on all 3 Hosts using PERC H730 1Gb cache RAID5 SSDs that bare metal on Windows happily exceed 800MB/s IO. With the VM on HVR, no difference in speed with a NIC disabled. Other than moving to Rook Ceph storage https://harvesterhci.io/kb , is there any other way to natively improve VM disk performance with Harvester?

image

@markhillgit
Copy link

There is a knowledge base article posted here now https://harvesterhci.io/kb/best_practices_for_optimizing_longhorn_disk_performance

@markhillgit
Copy link

moving to 1.4.0 to evaluate this with v2 storage engine

@markhillgit markhillgit modified the milestones: v1.3.0, v1.4.0 Jan 30, 2024
@rebeccazzzz rebeccazzzz moved this to Resolved/Scheduled in Community Issue Review Aug 21, 2024
@bk201 bk201 modified the milestones: v1.4.0, v1.5.0 Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question priority/1 Highly recommended to fix in this release require/investigate Identified the issue but require further investigation for resolution (won't be stale) require/knowledge-base
Projects
Status: Resolved/Scheduled
Development

No branches or pull requests

10 participants