Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed predownload value to zero issue #383

Merged
merged 2 commits into from
Aug 16, 2023

Conversation

karan6181
Copy link
Collaborator

Description of changes:

Before

  • Setting predownload=0 results in indefinite hang.

After

  • Change the logic of downloading samples to n + 1 samples as compared to n where n is the predownload value. Hence predownload=0 means that the streaming dataset downloads 1 sample at a time during each iter -> getitem call.
    - Consequences
    • Each worker downloads one extra sample before it waits for sample ahead check. I think downloading one extra sample won't have much impact on the dataset iter time.

Issue #, if available:

STR-121

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

@karan6181 karan6181 requested review from knighton and snarayan21 and removed request for knighton August 15, 2023 17:08
@snarayan21 snarayan21 merged commit b7863a3 into mosaicml:main Aug 16, 2023
@karan6181 karan6181 deleted the fix_predownload_zero branch August 16, 2023 17:58
Copy link
Contributor

@knighton knighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

Noting some future todos:

  • Stop replicating every streaming dataset arg and corresponding docstring in each of the over 9000 streaming datasets subclasses here (in the preferred one of the various ways to do this)
  • Add some mechanism that prevents the progress of each worker from getting unreasonably out of sync with each other

predownload (int, optional): Target number of samples to download per worker in advance
of current sample. Workers will attempt to download ahead by this many samples during,
but not before, training. Recommendation is to provide a value greater than per device
batch size to ensure at-least per device batch size number of samples cached locally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at-least -> at least

@karan6181
Copy link
Collaborator Author

Add some mechanism that prevents the progress of each worker from getting unreasonably out of sync with each other

Is the progress related to download the shard file or reading the sample or both or anything else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants