-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). #223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Can you also add a unit test for each shuffling algorithm (naive, py1b, py1s, py2s) to validate the output sample order with an expected sample order?
shuffle_seed (int): Seed for Deterministic data shuffling. Defaults to ``9176``. | ||
shuffle_block_size (int): Unit of shuffle. Defaults to ``1 << 18``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more information here on what shuffle_block_size
do ?
@@ -157,8 +158,9 @@ class StreamingDataset(IterableDataset): | |||
partitioned over the workers. Defaults to ``None``. | |||
shuffle (bool): Whether to iterate over the samples in randomized order. Defaults to | |||
``False``. | |||
shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1s``. | |||
shuffle_algo (str): Which shuffling algorithm to use. Defaults to ``py1b``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a one to line line of brief statement explaining when to use which naive, py1b, py1s, py2s
algorithm ?
If we change the default to |
|
Based on the offline discussion, James will be creating a follow up PR to address the above two comments
|
Description of changes:
py1b
is a shuffling algorithm that performs the final shuffle over fixed-size blocks instead of intra-shard likepy1s
. These units are presumably larger or much larger than single shards, leading to better shuffledness at the cost of having to download more shards to make progress.naive
is a shuffling algorithm that naively shuffles all-to-all. This is useful for single-node training on small data, where you want the most random shuffle possible. Statistically, this algorithm will result in all nodes downloading all shards, with those downloads all happening at the start of the epoch, bringing training to a crawl.Also update the default shuffle algo from py1s to py1b.
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)