Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Architectural LEP for sharded data layout #5444

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

l-mb
Copy link

@l-mb l-mb commented Feb 23, 2023

This is an initial draft for discussing the architecture of sharding Longhorn volumes across multiple devices/nodes (based on top of SPDK).

Signed-off-by: Lars Marowsky-Bree <lmb@suse.com>
@l-mb l-mb marked this pull request as draft February 28, 2023 11:09
enhancements/20230221-spdk-sharding.md Show resolved Hide resolved
shard_count` envelope (which might have been over-allocated at
creation time), LH can simply adjust the exposed size and return.

- If `shard_size == stripe_size`, it is straightforward to allocating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean shard_size can be different from stripe_size? Then how does Longhorn allocate RAID0 stripes to the disk shards? Or is shard only a concept in Longhorn and Longhorn won't do anything for the actual disk (when setting/changing shard)?
For example, if stripe_size = 1.5 * shard_size, will Longhorn round-up stripe_size to 2 * shard_size at the beginning of the allocation? (Here it's not about the volume expansion.)


## Future Work

- **SPDK**: API call to add or remove a member device for `raid0`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a kind of rebalance? And is this allowed for running volumes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not rebalancing. This is about shrinking the volume once the space has been compacted.

Comment on lines +362 to +363
- **SPDK**: Is there one SPDK management process per backend
device/blobstore, or one per node that manages several?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One target on one node. Each engine/replica process for one bdev. In this LEP, the replica should be stripe replicas, which add more complexity for the managing.

Comment on lines +153 to +154
- If the `volume_size % shard_size != 0`, the last shard might only be
partially used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concerns about implementation and maintenance if volume_size % shard_size != 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any.

sufficient capacity.
- If not, each shard may need to individually be migrated to a device
that can hold the new size.
- This somewhat negatively impacts the allocation granularity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on the allocation granularity and the negative impact?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare a shard size that's 5% vs one that is 10% of our backend devices.

We can allocate a 5% shard once all our devices are 90% full, and distribute them across multiple devices as needed.

If they're at 10%, we can't.

So smaller shard sizes allow more effective use of space, at the cost of slightly more overhead. So the question is always the trade-off.

Growing may be feasible within certain limits:
- if the backing storage devices holding each shard still have
sufficient capacity.
- If not, each shard may need to individually be migrated to a device
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it support online share migration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a RAID1 bdev can "migrate" a shard by allocating the new location and then removing the old one, as discussed in other places in the document


- At allocation time, Longhorn instantiates the top-level `raid0` as the
aggregation layer. (See `Notes` section on a brief discussion of raid0
vs concatenation, and the `Growing a volume` section on a discussion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confuse about the shard_size and stripe_size.
Would you be able to provide the definition of stripe_size and the relationship between shard_size?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripe_size is what RAID0 uses when distributing the data across its member devices.

shard_size is the allocation size of the member devices and the raid1s.

They can technically be different, as discussed in the LEP, which has some trade-offs.

l-mb referenced this pull request in jecluis/longhorn Apr 25, 2023
This proposal focuses on a first pass integration with Longhorn. Further
steps shall be needed to fully integrate s3gw as the S3 Frontend.

Signed-off-by: Joao Eduardo Luis <joao@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New Issues
Development

Successfully merging this pull request may close these issues.

4 participants