[DNM] Architectural LEP for sharded data layout #5444

l-mb · 2023-02-23T13:27:23Z

This is an initial draft for discussing the architecture of sharding Longhorn volumes across multiple devices/nodes (based on top of SPDK).

Signed-off-by: Lars Marowsky-Bree <lmb@suse.com>

enhancements/20230221-spdk-sharding.md

shuo-wu · 2023-03-02T10:33:50Z

enhancements/20230221-spdk-sharding.md

+  shard_count` envelope (which might have been over-allocated at
+  creation time), LH can simply adjust the exposed size and return.
+
+- If `shard_size == stripe_size`, it is straightforward to allocating


You mean shard_size can be different from stripe_size? Then how does Longhorn allocate RAID0 stripes to the disk shards? Or is shard only a concept in Longhorn and Longhorn won't do anything for the actual disk (when setting/changing shard)?
For example, if stripe_size = 1.5 * shard_size, will Longhorn round-up stripe_size to 2 * shard_size at the beginning of the allocation? (Here it's not about the volume expansion.)

shuo-wu · 2023-03-02T10:44:19Z

enhancements/20230221-spdk-sharding.md

+
+## Future Work
+
+- **SPDK**: API call to add or remove a member device for `raid0`


Is this a kind of rebalance? And is this allowed for running volumes?

This is not rebalancing. This is about shrinking the volume once the space has been compacted.

shuo-wu · 2023-03-02T10:46:46Z

enhancements/20230221-spdk-sharding.md

+- **SPDK**: Is there one SPDK management process per backend
+  device/blobstore, or one per node that manages several?


One target on one node. Each engine/replica process for one bdev. In this LEP, the replica should be stripe replicas, which add more complexity for the managing.

derekbit · 2023-03-09T15:00:27Z

enhancements/20230221-spdk-sharding.md

+  - If the `volume_size % shard_size != 0`, the last shard might only be
+    partially used.


Are there any concerns about implementation and maintenance if volume_size % shard_size != 0?

I don't see any.

derekbit · 2023-03-09T15:05:13Z

enhancements/20230221-spdk-sharding.md

+  sufficient capacity.
+- If not, each shard may need to individually be migrated to a device
+  that can hold the new size.
+- This somewhat negatively impacts the allocation granularity.


Can you elaborate more on the allocation granularity and the negative impact?

Compare a shard size that's 5% vs one that is 10% of our backend devices.

We can allocate a 5% shard once all our devices are 90% full, and distribute them across multiple devices as needed.

If they're at 10%, we can't.

So smaller shard sizes allow more effective use of space, at the cost of slightly more overhead. So the question is always the trade-off.

derekbit · 2023-03-09T15:06:15Z

enhancements/20230221-spdk-sharding.md

+Growing may be feasible within certain limits:
+- if the backing storage devices holding each shard still have
+  sufficient capacity.
+- If not, each shard may need to individually be migrated to a device


Can it support online share migration?

Sure, a RAID1 bdev can "migrate" a shard by allocating the new location and then removing the old one, as discussed in other places in the document

derekbit · 2023-03-09T15:22:00Z

enhancements/20230221-spdk-sharding.md

+
+- At allocation time, Longhorn instantiates the top-level `raid0` as the
+  aggregation layer. (See `Notes` section on a brief discussion of raid0
+  vs concatenation, and the `Growing a volume` section on a discussion


A little confuse about the shard_size and stripe_size.
Would you be able to provide the definition of stripe_size and the relationship between shard_size?

stripe_size is what RAID0 uses when distributing the data across its member devices.

shard_size is the allocation size of the member devices and the raid1s.

They can technically be different, as discussed in the LEP, which has some trade-offs.

This proposal focuses on a first pass integration with Longhorn. Further steps shall be needed to fully integrate s3gw as the S3 Frontend. Signed-off-by: Joao Eduardo Luis <joao@suse.com>

Architectural LEP for sharded data layout

67d57b5

Signed-off-by: Lars Marowsky-Bree <lmb@suse.com>

DamiaSan reviewed Feb 28, 2023

View reviewed changes

enhancements/20230221-spdk-sharding.md Show resolved Hide resolved

l-mb marked this pull request as draft February 28, 2023 11:09

shuo-wu reviewed Mar 2, 2023

View reviewed changes

innobead assigned l-mb Mar 9, 2023

derekbit reviewed Mar 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Architectural LEP for sharded data layout #5444

[DNM] Architectural LEP for sharded data layout #5444

l-mb commented Feb 23, 2023

shuo-wu Mar 2, 2023

shuo-wu Mar 2, 2023

l-mb Mar 21, 2023

shuo-wu Mar 2, 2023

derekbit Mar 9, 2023

l-mb Mar 21, 2023

derekbit Mar 9, 2023

l-mb Mar 21, 2023

derekbit Mar 9, 2023

l-mb Mar 21, 2023

derekbit Mar 9, 2023

l-mb Mar 21, 2023


		## Future Work

		- SPDK: API call to add or remove a member device for `raid0`

		- SPDK: Is there one SPDK management process per backend
		device/blobstore, or one per node that manages several?

		- If the `volume_size % shard_size != 0`, the last shard might only be
		partially used.

[DNM] Architectural LEP for sharded data layout #5444

Are you sure you want to change the base?

[DNM] Architectural LEP for sharded data layout #5444

Conversation

l-mb commented Feb 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment