-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] Architectural LEP for sharded data layout #5444
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Lars Marowsky-Bree <lmb@suse.com>
shard_count` envelope (which might have been over-allocated at | ||
creation time), LH can simply adjust the exposed size and return. | ||
|
||
- If `shard_size == stripe_size`, it is straightforward to allocating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean shard_size
can be different from stripe_size
? Then how does Longhorn allocate RAID0 stripes to the disk shards? Or is shard
only a concept in Longhorn and Longhorn won't do anything for the actual disk (when setting/changing shard)?
For example, if stripe_size
= 1.5 * shard_size
, will Longhorn round-up stripe_size
to 2 * shard_size
at the beginning of the allocation? (Here it's not about the volume expansion.)
|
||
## Future Work | ||
|
||
- **SPDK**: API call to add or remove a member device for `raid0` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a kind of rebalance? And is this allowed for running volumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not rebalancing. This is about shrinking the volume once the space has been compacted.
- **SPDK**: Is there one SPDK management process per backend | ||
device/blobstore, or one per node that manages several? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One target on one node. Each engine/replica process for one bdev. In this LEP, the replica should be stripe replicas, which add more complexity for the managing.
- If the `volume_size % shard_size != 0`, the last shard might only be | ||
partially used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any concerns about implementation and maintenance if volume_size % shard_size != 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any.
sufficient capacity. | ||
- If not, each shard may need to individually be migrated to a device | ||
that can hold the new size. | ||
- This somewhat negatively impacts the allocation granularity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on the allocation granularity and the negative impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare a shard size that's 5% vs one that is 10% of our backend devices.
We can allocate a 5% shard once all our devices are 90% full, and distribute them across multiple devices as needed.
If they're at 10%, we can't.
So smaller shard sizes allow more effective use of space, at the cost of slightly more overhead. So the question is always the trade-off.
Growing may be feasible within certain limits: | ||
- if the backing storage devices holding each shard still have | ||
sufficient capacity. | ||
- If not, each shard may need to individually be migrated to a device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it support online share migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, a RAID1 bdev can "migrate" a shard by allocating the new location and then removing the old one, as discussed in other places in the document
|
||
- At allocation time, Longhorn instantiates the top-level `raid0` as the | ||
aggregation layer. (See `Notes` section on a brief discussion of raid0 | ||
vs concatenation, and the `Growing a volume` section on a discussion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confuse about the shard_size
and stripe_size
.
Would you be able to provide the definition of stripe_size
and the relationship between shard_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stripe_size is what RAID0 uses when distributing the data across its member devices.
shard_size is the allocation size of the member devices and the raid1s.
They can technically be different, as discussed in the LEP, which has some trade-offs.
This proposal focuses on a first pass integration with Longhorn. Further steps shall be needed to fully integrate s3gw as the S3 Frontend. Signed-off-by: Joao Eduardo Luis <joao@suse.com>
This is an initial draft for discussing the architecture of sharding Longhorn volumes across multiple devices/nodes (based on top of SPDK).