-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hugging Face Hub Integration! #4193
Conversation
WalkthroughThe new Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #4193 +/- ##
============================================
+ Coverage 16.00% 99.24% +83.24%
============================================
Files 734 35 -699
Lines 82223 15236 -66987
Branches 1119 0 -1119
============================================
+ Hits 13159 15121 +1962
+ Misses 69064 115 -68949
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobmarks this is looking very nice! definitely have my 👍 to add docs to the HF integration page so we can ship this 🚢
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Status
Configuration used: CodeRabbit UI
Files selected for processing (1)
- docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
- docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (4)
docs/source/integrations/huggingface.rst (4)
948-1006
: Consider adding a note about dataset size limitations or recommendations when pushing datasets to the Hugging Face Hub.While the documentation provides comprehensive instructions on pushing datasets to the Hugging Face Hub, it might be helpful for users to be aware of any size limitations or best practices regarding dataset sizes. This could prevent potential issues or confusion during the upload process.
1149-1278
: Ensure the documentation on loading datasets from the Hugging Face Hub includes information on handling large datasets efficiently.The section on loading datasets from the Hugging Face Hub is detailed and covers a wide range of use cases. However, it would be beneficial to include tips or best practices for efficiently handling large datasets, such as using the
max_samples
parameter to limit the number of samples loaded initially.
1410-1449
: Highlight the importance of thenum_workers
parameter for optimizing download speeds.The section on configuring the download process when loading datasets from the Hugging Face Hub is informative. It might be helpful to emphasize the significance of the
num_workers
parameter more explicitly, as adjusting this parameter can significantly impact download speeds, especially for large datasets.
1811-1925
: Consider adding examples for error handling when loading datasets from the Hugging Face Hub.The advanced examples section is comprehensive and covers a variety of complex scenarios. Including examples or guidance on error handling when issues arise during dataset loading (e.g., due to incorrect configurations or access permissions) would be beneficial for users, providing them with strategies to troubleshoot and resolve common issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Status
Configuration used: CodeRabbit UI
Files selected for processing (1)
- docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
- docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (1)
docs/source/integrations/huggingface.rst (1)
911-915
: Consider adding a section on troubleshooting common issues when integrating with the Hugging Face Hub.
This could include common error messages and their solutions, tips for debugging, and links to relevant Hugging Face documentation or forums for additional support.
…ftyone into hf-hub-integration-v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Status
Configuration used: CodeRabbit UI
Files selected for processing (1)
- docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
- docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (4)
docs/source/integrations/huggingface.rst (4)
922-944
: Ensure the Hugging Face Hub Python client version specified in the installation command is up-to-date and compatible with the latest features used in the integration.- pip install "huggingface_hub>=0.20.0" + pip install "huggingface_hub>=0.21.0"Also, verify that the instructions for obtaining an access token are still accurate according to the latest Hugging Face website layout and settings.
951-1006
: Confirm that thepush_to_hub()
function's behavior, especially regarding the auto-generation of the dataset card and thefiftyone.yml
config file, aligns with the latest implementation details. Ensure that the examples provided are clear, accurate, and reflect the current best practices for pushing datasets to the Hugging Face Hub.
1145-1173
: Ensure the instructions for loading datasets from the Hugging Face Hub, especially the handling of private or gated datasets, are up-to-date and reflect the current capabilities of theload_from_hub()
function. Verify that the examples provided are comprehensive, covering a wide range of use cases, and clearly demonstrate how to use the function effectively.Verification successful
The provided context from the shell script output confirms that the
HF_TOKEN
environment variable is indeed used within thefiftyone.utils.huggingface.py
file. Specifically, it is used to optionally authenticate requests to the Hugging Face Hub by either directly passing atoken
parameter or by fetching the token from theHF_TOKEN
environment variable if it's set. This aligns with the documentation snippet under review, which mentions the use of theHF_TOKEN
environment variable for loading private or gated datasets from the Hugging Face Hub.Given this confirmation, the instructions and examples provided in the documentation for loading datasets from the Hugging Face Hub, including the handling of private or gated datasets using the
HF_TOKEN
, are accurate and reflect the current capabilities of theload_from_hub()
function.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the handling of private or gated datasets in load_from_hub function grep -A 10 "def load_from_hub" $(fd --type f "huggingface.py")Length of output: 297
Script:
#!/bin/bash # Search for references to HF_TOKEN in the codebase to understand its usage in loading datasets rg "HF_TOKEN" $(fd --type f)Length of output: 108443
1773-1882
: Review the advanced examples for pushing and loading datasets to ensure they accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub. Confirm that the configurations and parameters mentioned are up-to-date and clearly explained, providing users with the knowledge needed to leverage the integration's full potential.Verification successful
The extracted content from the
docs/source/integrations/huggingface.rst
file matches the provided documentation snippet in the review context. This includes detailed examples and configurations for advanced usage of the FiftyOne integration with the Hugging Face Hub, covering aspects such as loading datasets from revisions, handling datasets with multiple media fields, customizing the download process, and loading private or gated datasets.Based on this, the advanced examples for pushing and loading datasets accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub, and the configurations and parameters mentioned are up-to-date and clearly explained.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify advanced examples and configurations for pushing and loading datasets grep -A 20 "Advanced examples" $(fd --type f "huggingface.py")Length of output: 92
Script:
#!/bin/bash # Locate and verify the advanced examples and configurations within RST documentation files fd --type f --extension rst | xargs grep -l "Advanced examples"Length of output: 106
Script:
#!/bin/bash # Extract and review the "Advanced examples" section from the identified RST documentation file sed -n '/Advanced examples/,/^\*\*/p' docs/source/integrations/huggingface.rstLength of output: 463
Script:
#!/bin/bash # Attempt to extract a more comprehensive portion of the "Advanced examples" section for review sed -n '/Advanced examples/,/^\*\*\*/p' docs/source/integrations/huggingface.rstLength of output: 3712
Hugging Face Hub Upgrades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Review Status
Configuration used: CodeRabbit UI
Files ignored due to path filters (1)
docs/source/images/integrations/hf_data_card_preview.jpg
is excluded by!**/*.jpg
,!**/*.jpg
Files selected for processing (2)
- docs/source/integrations/huggingface.rst (15 hunks)
- fiftyone/utils/huggingface.py (1 hunks)
Files not summarized due to errors (1)
- docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
def push_to_hub( | ||
dataset, | ||
repo_name, | ||
description=None, | ||
license=None, | ||
tags=None, | ||
private=False, | ||
exist_ok=False, | ||
dataset_type=None, | ||
min_fiftyone_version=None, | ||
label_field=None, | ||
frame_labels_field=None, | ||
token=None, | ||
preview_path=None, | ||
**data_card_kwargs, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the push_to_hub
function parameters and return type for improved readability and type checking.
- def push_to_hub(
+ def push_to_hub(
+ dataset: fod.Dataset,
+ repo_name: str,
+ description: Optional[str] = None,
+ license: Optional[str] = None,
+ tags: Optional[List[str]] = None,
+ private: bool = False,
+ exist_ok: bool = False,
+ dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+ min_fiftyone_version: Optional[str] = None,
+ label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+ frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+ token: Optional[str] = None,
+ preview_path: Optional[str] = None,
+ **data_card_kwargs,
+ ) -> None:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def push_to_hub( | |
dataset, | |
repo_name, | |
description=None, | |
license=None, | |
tags=None, | |
private=False, | |
exist_ok=False, | |
dataset_type=None, | |
min_fiftyone_version=None, | |
label_field=None, | |
frame_labels_field=None, | |
token=None, | |
preview_path=None, | |
**data_card_kwargs, | |
): | |
def push_to_hub( | |
dataset: fod.Dataset, | |
repo_name: str, | |
description: Optional[str] = None, | |
license: Optional[str] = None, | |
tags: Optional[List[str]] = None, | |
private: bool = False, | |
exist_ok: bool = False, | |
dataset_type: Optional[Type[fot.DatasetFormat]] = None, | |
min_fiftyone_version: Optional[str] = None, | |
label_field: Optional[Union[str, List[str], Dict[str, str]]] = None, | |
frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None, | |
token: Optional[str] = None, | |
preview_path: Optional[str] = None, | |
**data_card_kwargs, | |
) -> None: |
def load_from_hub( | ||
repo_id, | ||
revision=None, | ||
split=None, | ||
splits=None, | ||
subset=None, | ||
subsets=None, | ||
max_samples=None, | ||
batch_size=None, | ||
num_workers=None, | ||
overwrite=False, | ||
persistent=False, | ||
name=None, | ||
token=None, | ||
config_file=None, | ||
**kwargs, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the load_from_hub
function parameters and return type for improved readability and type checking.
- def load_from_hub(
+ def load_from_hub(
+ repo_id: str,
+ revision: Optional[str] = None,
+ split: Optional[str] = None,
+ splits: Optional[List[str]] = None,
+ subset: Optional[str] = None,
+ subsets: Optional[List[str]] = None,
+ max_samples: Optional[int] = None,
+ batch_size: Optional[int] = None,
+ num_workers: Optional[int] = None,
+ overwrite: bool = False,
+ persistent: bool = False,
+ name: Optional[str] = None,
+ token: Optional[str] = None,
+ config_file: Optional[str] = None,
+ **kwargs,
+ ) -> fod.Dataset:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def load_from_hub( | |
repo_id, | |
revision=None, | |
split=None, | |
splits=None, | |
subset=None, | |
subsets=None, | |
max_samples=None, | |
batch_size=None, | |
num_workers=None, | |
overwrite=False, | |
persistent=False, | |
name=None, | |
token=None, | |
config_file=None, | |
**kwargs, | |
): | |
def load_from_hub( | |
repo_id: str, | |
revision: Optional[str] = None, | |
split: Optional[str] = None, | |
splits: Optional[List[str]] = None, | |
subset: Optional[str] = None, | |
subsets: Optional[List[str]] = None, | |
max_samples: Optional[int] = None, | |
batch_size: Optional[int] = None, | |
num_workers: Optional[int] = None, | |
overwrite: bool = False, | |
persistent: bool = False, | |
name: Optional[str] = None, | |
token: Optional[str] = None, | |
config_file: Optional[str] = None, | |
**kwargs, | |
) -> fod.Dataset: |
class HFHubDatasetConfig(Config): | ||
"""Config for a Hugging Face Hub dataset. | ||
|
||
Args: | ||
name: the name of the dataset | ||
repo_type: the type of the repository | ||
repo_id: the identifier of the repository | ||
revision: the revision of the dataset | ||
filename: the name of the file | ||
format: the format of the dataset | ||
tags: the tags of the dataset | ||
license: the license of the dataset | ||
description: the description of the dataset | ||
fiftyone: the fiftyone version requirement of the dataset | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the HFHubDatasetConfig
class __init__
method parameters for consistency and clarity.
- def __init__(self, **kwargs):
+ def __init__(
+ self,
+ name: Optional[str] = None,
+ repo_type: Optional[str] = None,
+ repo_id: Optional[str] = None,
+ revision: Optional[str] = None,
+ filename: Optional[str] = None,
+ format: Optional[str] = None,
+ tags: Optional[List[str]] = None,
+ license: Optional[str] = None,
+ description: Optional[str] = None,
+ fiftyone: Optional[Dict[str, Any]] = None,
+ **kwargs,
+ ):
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
class HFHubDatasetConfig(Config): | |
"""Config for a Hugging Face Hub dataset. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
""" | |
class HFHubDatasetConfig(Config): | |
"""Config for a Hugging Face Hub dataset. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
""" | |
def __init__( | |
self, | |
name: Optional[str] = None, | |
repo_type: Optional[str] = None, | |
repo_id: Optional[str] = None, | |
revision: Optional[str] = None, | |
filename: Optional[str] = None, | |
format: Optional[str] = None, | |
tags: Optional[List[str]] = None, | |
license: Optional[str] = None, | |
description: Optional[str] = None, | |
fiftyone: Optional[Dict[str, Any]] = None, | |
**kwargs, | |
): |
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | ||
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | ||
|
||
Args: | ||
name: the name of the dataset | ||
repo_type: the type of the repository | ||
repo_id: the identifier of the repository | ||
revision: the revision of the dataset | ||
filename: the name of the file | ||
format: the format of the dataset | ||
tags: the tags of the dataset | ||
license: the license of the dataset | ||
description: the description of the dataset | ||
fiftyone: the fiftyone version requirement of the dataset | ||
label_fields: the label fields of the dataset | ||
media_type: the media type of the dataset | ||
default_media_fields: the default media fields of the dataset | ||
additional_media_fields: the additional media fields of the dataset | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the HFHubParquetFilesDatasetConfig
class __init__
method parameters for improved readability and type checking.
- def __init__(self, **kwargs):
+ def __init__(
+ self,
+ media_type: str = DEFAULT_MEDIA_TYPE,
+ label_fields: Optional[Dict[str, List[str]]] = None,
+ default_media_fields: Optional[Dict[str, str]] = None,
+ additional_media_fields: Optional[Dict[str, str]] = None,
+ **kwargs,
+ ):
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | |
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
label_fields: the label fields of the dataset | |
media_type: the media type of the dataset | |
default_media_fields: the default media fields of the dataset | |
additional_media_fields: the additional media fields of the dataset | |
""" | |
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | |
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
label_fields: the label fields of the dataset | |
media_type: the media type of the dataset | |
default_media_fields: the default media fields of the dataset | |
additional_media_fields: the additional media fields of the dataset | |
""" | |
def __init__( | |
self, | |
media_type: str = DEFAULT_MEDIA_TYPE, | |
label_fields: Optional[Dict[str, List[str]]] = None, | |
default_media_fields: Optional[Dict[str, str]] = None, | |
additional_media_fields: Optional[Dict[str, str]] = None, | |
**kwargs, | |
): |
[HF Hub Integration] Upload and Download without Progress Bars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def push_to_hub( | ||
dataset, | ||
repo_name, | ||
description=None, | ||
license=None, | ||
tags=None, | ||
private=False, | ||
exist_ok=False, | ||
dataset_type=None, | ||
min_fiftyone_version=None, | ||
label_field=None, | ||
frame_labels_field=None, | ||
token=None, | ||
preview_path=None, | ||
**data_card_kwargs, | ||
): | ||
"""Push a FiftyOne dataset to the Hugging Face Hub. | ||
|
||
Args: | ||
dataset: a FiftyOne dataset | ||
repo_name: the name of the dataset repo to create. The repo ID will be | ||
``{your_username}/{repo_name}`` | ||
description (None): a description of the dataset | ||
license (None): the license of the dataset | ||
tags (None): a list of tags for the dataset | ||
private (True): whether the repo should be private | ||
exist_ok (False): if True, do not raise an error if repo already exists. | ||
dataset_type (None): the type of the dataset to create | ||
min_fiftyone_version (None): the minimum version of FiftyOne required | ||
to load the dataset. For example ``"0.23.0"``. | ||
label_field (None): controls the label field(s) to export. Only | ||
applicable to labeled datasets. Can be any of the following: | ||
|
||
- the name of a label field to export | ||
- a glob pattern of label field(s) to export | ||
- a list or tuple of label field(s) to export | ||
- a dictionary mapping label field names to keys to use when | ||
constructing the label dictionaries to pass to the exporter | ||
frame_labels_field (None): controls the frame label field(s) to export. | ||
The "frames." prefix is optional. Only applicable to labeled video | ||
datasets. Can be any of the following: | ||
|
||
- the name of a frame label field to export | ||
- a glob pattern of frame label field(s) to export | ||
- a list or tuple of frame label field(s) to export | ||
- a dictionary mapping frame label field names to keys to use when | ||
constructing the frame label dictionaries to pass to the exporter | ||
token (None): a Hugging Face API token to use. May also be provided via | ||
the ``HF_TOKEN`` environment variable | ||
preview_path (None): a path to a preview image or video to display on | ||
the readme of the dataset repo. | ||
data_card_kwargs: additional keyword arguments to pass to the | ||
`DatasetCard` constructor | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the push_to_hub
function parameters and return type for improved readability and type checking.
- def push_to_hub(
+ def push_to_hub(
+ dataset: fod.Dataset,
+ repo_name: str,
+ description: Optional[str] = None,
+ license: Optional[str] = None,
+ tags: Optional[List[str]] = None,
+ private: bool = False,
+ exist_ok: bool = False,
+ dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+ min_fiftyone_version: Optional[str] = None,
+ label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+ frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+ token: Optional[str] = None,
+ preview_path: Optional[str] = None,
+ **data_card_kwargs,
+ ) -> None:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def push_to_hub( | |
dataset, | |
repo_name, | |
description=None, | |
license=None, | |
tags=None, | |
private=False, | |
exist_ok=False, | |
dataset_type=None, | |
min_fiftyone_version=None, | |
label_field=None, | |
frame_labels_field=None, | |
token=None, | |
preview_path=None, | |
**data_card_kwargs, | |
): | |
"""Push a FiftyOne dataset to the Hugging Face Hub. | |
Args: | |
dataset: a FiftyOne dataset | |
repo_name: the name of the dataset repo to create. The repo ID will be | |
``{your_username}/{repo_name}`` | |
description (None): a description of the dataset | |
license (None): the license of the dataset | |
tags (None): a list of tags for the dataset | |
private (True): whether the repo should be private | |
exist_ok (False): if True, do not raise an error if repo already exists. | |
dataset_type (None): the type of the dataset to create | |
min_fiftyone_version (None): the minimum version of FiftyOne required | |
to load the dataset. For example ``"0.23.0"``. | |
label_field (None): controls the label field(s) to export. Only | |
applicable to labeled datasets. Can be any of the following: | |
- the name of a label field to export | |
- a glob pattern of label field(s) to export | |
- a list or tuple of label field(s) to export | |
- a dictionary mapping label field names to keys to use when | |
constructing the label dictionaries to pass to the exporter | |
frame_labels_field (None): controls the frame label field(s) to export. | |
The "frames." prefix is optional. Only applicable to labeled video | |
datasets. Can be any of the following: | |
- the name of a frame label field to export | |
- a glob pattern of frame label field(s) to export | |
- a list or tuple of frame label field(s) to export | |
- a dictionary mapping frame label field names to keys to use when | |
constructing the frame label dictionaries to pass to the exporter | |
token (None): a Hugging Face API token to use. May also be provided via | |
the ``HF_TOKEN`` environment variable | |
preview_path (None): a path to a preview image or video to display on | |
the readme of the dataset repo. | |
data_card_kwargs: additional keyword arguments to pass to the | |
`DatasetCard` constructor | |
""" | |
def push_to_hub( | |
dataset: fod.Dataset, | |
repo_name: str, | |
description: Optional[str] = None, | |
license: Optional[str] = None, | |
tags: Optional[List[str]] = None, | |
private: bool = False, | |
exist_ok: bool = False, | |
dataset_type: Optional[Type[fot.DatasetFormat]] = None, | |
min_fiftyone_version: Optional[str] = None, | |
label_field: Optional[Union[str, List[str], Dict[str, str]]] = None, | |
frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None, | |
token: Optional[str] = None, | |
preview_path: Optional[str] = None, | |
**data_card_kwargs, | |
) -> None: | |
"""Push a FiftyOne dataset to the Hugging Face Hub. | |
Args: | |
dataset: a FiftyOne dataset | |
repo_name: the name of the dataset repo to create. The repo ID will be | |
``{your_username}/{repo_name}`` | |
description (None): a description of the dataset | |
license (None): the license of the dataset | |
tags (None): a list of tags for the dataset | |
private (True): whether the repo should be private | |
exist_ok (False): if True, do not raise an error if repo already exists. | |
dataset_type (None): the type of the dataset to create | |
min_fiftyone_version (None): the minimum version of FiftyOne required | |
to load the dataset. For example ``"0.23.0"``. | |
label_field (None): controls the label field(s) to export. Only | |
applicable to labeled datasets. Can be any of the following: | |
- the name of a label field to export | |
- a glob pattern of label field(s) to export | |
- a list or tuple of label field(s) to export | |
- a dictionary mapping label field names to keys to use when | |
constructing the label dictionaries to pass to the exporter | |
frame_labels_field (None): controls the frame label field(s) to export. | |
The "frames." prefix is optional. Only applicable to labeled video | |
datasets. Can be any of the following: | |
- the name of a frame label field to export | |
- a glob pattern of frame label field(s) to export | |
- a list or tuple of frame label field(s) to export | |
- a dictionary mapping frame label field names to keys to use when | |
constructing the frame label dictionaries to pass to the exporter | |
token (None): a Hugging Face API token to use. May also be provided via | |
the ``HF_TOKEN`` environment variable | |
preview_path (None): a path to a preview image or video to display on | |
the readme of the dataset repo. | |
data_card_kwargs: additional keyword arguments to pass to the | |
`DatasetCard` constructor | |
""" |
def load_from_hub( | ||
repo_id, | ||
revision=None, | ||
split=None, | ||
splits=None, | ||
subset=None, | ||
subsets=None, | ||
max_samples=None, | ||
batch_size=None, | ||
num_workers=None, | ||
overwrite=False, | ||
persistent=False, | ||
name=None, | ||
token=None, | ||
config_file=None, | ||
**kwargs, | ||
): | ||
"""Loads a dataset from the Hugging Face Hub into FiftyOne. | ||
|
||
Args: | ||
repo_id: the Hugging Face Hub identifier of the dataset | ||
revision (None): the revision of the dataset to load | ||
split (None): the split of the dataset to load | ||
splits (None): the splits of the dataset to load | ||
subset (None): the subset of the dataset to load | ||
subsets (None): the subsets of the dataset to load | ||
max_samples (None): the maximum number of samples to load | ||
batch_size (None): the batch size to use when loading samples | ||
num_workers (None): a suggested number of threads to use when | ||
downloading media | ||
overwrite (True): whether to overwrite an existing dataset with the | ||
same name | ||
persistent (False): whether the dataset should be persistent | ||
name (None): an optional name to give the dataset | ||
token (None): a Hugging Face API token to use. May also be provided via | ||
the ``HF_TOKEN`` environment variable | ||
config_file (None): the path to a config file on disk specifying how to | ||
load the dataset if the repo has no ``fiftyone.yml`` file | ||
**kwargs: keyword arguments specifying config parameters to load the | ||
dataset if the repo has no ``fiftyone.yml`` file | ||
|
||
Returns: | ||
a :class:`fiftyone.core.dataset.Dataset` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the load_from_hub
function parameters and return type for improved readability and type checking.
- def load_from_hub(
+ def load_from_hub(
+ repo_id: str,
+ revision: Optional[str] = None,
+ split: Optional[str] = None,
+ splits: Optional[List[str]] = None,
+ subset: Optional[str] = None,
+ subsets: Optional[List[str]] = None,
+ max_samples: Optional[int] = None,
+ batch_size: Optional[int] = None,
+ num_workers: Optional[int] = None,
+ overwrite: bool = False,
+ persistent: bool = False,
+ name: Optional[str] = None,
+ token: Optional[str] = None,
+ config_file: Optional[str] = None,
+ **kwargs,
+ ) -> fod.Dataset:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def load_from_hub( | |
repo_id, | |
revision=None, | |
split=None, | |
splits=None, | |
subset=None, | |
subsets=None, | |
max_samples=None, | |
batch_size=None, | |
num_workers=None, | |
overwrite=False, | |
persistent=False, | |
name=None, | |
token=None, | |
config_file=None, | |
**kwargs, | |
): | |
"""Loads a dataset from the Hugging Face Hub into FiftyOne. | |
Args: | |
repo_id: the Hugging Face Hub identifier of the dataset | |
revision (None): the revision of the dataset to load | |
split (None): the split of the dataset to load | |
splits (None): the splits of the dataset to load | |
subset (None): the subset of the dataset to load | |
subsets (None): the subsets of the dataset to load | |
max_samples (None): the maximum number of samples to load | |
batch_size (None): the batch size to use when loading samples | |
num_workers (None): a suggested number of threads to use when | |
downloading media | |
overwrite (True): whether to overwrite an existing dataset with the | |
same name | |
persistent (False): whether the dataset should be persistent | |
name (None): an optional name to give the dataset | |
token (None): a Hugging Face API token to use. May also be provided via | |
the ``HF_TOKEN`` environment variable | |
config_file (None): the path to a config file on disk specifying how to | |
load the dataset if the repo has no ``fiftyone.yml`` file | |
**kwargs: keyword arguments specifying config parameters to load the | |
dataset if the repo has no ``fiftyone.yml`` file | |
Returns: | |
a :class:`fiftyone.core.dataset.Dataset` | |
def load_from_hub( | |
repo_id: str, | |
revision: Optional[str] = None, | |
split: Optional[str] = None, | |
splits: Optional[List[str]] = None, | |
subset: Optional[str] = None, | |
subsets: Optional[List[str]] = None, | |
max_samples: Optional[int] = None, | |
batch_size: Optional[int] = None, | |
num_workers: Optional[int] = None, | |
overwrite: bool = False, | |
persistent: bool = False, | |
name: Optional[str] = None, | |
token: Optional[str] = None, | |
config_file: Optional[str] = None, | |
**kwargs, | |
) -> fod.Dataset: | |
"""Loads a dataset from the Hugging Face Hub into FiftyOne. | |
Args: | |
repo_id: the Hugging Face Hub identifier of the dataset | |
revision (None): the revision of the dataset to load | |
split (None): the split of the dataset to load | |
splits (None): the splits of the dataset to load | |
subset (None): the subset of the dataset to load | |
subsets (None): the subsets of the dataset to load | |
max_samples (None): the maximum number of samples to load | |
batch_size (None): the batch size to use when loading samples | |
num_workers (None): a suggested number of threads to use when | |
downloading media | |
overwrite (True): whether to overwrite an existing dataset with the | |
same name | |
persistent (False): whether the dataset should be persistent | |
name (None): an optional name to give the dataset | |
token (None): a Hugging Face API token to use. May also be provided via | |
the ``HF_TOKEN`` environment variable | |
config_file (None): the path to a config file on disk specifying how to | |
load the dataset if the repo has no ``fiftyone.yml`` file | |
**kwargs: keyword arguments specifying config parameters to load the | |
dataset if the repo has no ``fiftyone.yml`` file | |
Returns: | |
a :class:`fiftyone.core.dataset.Dataset` |
class HFHubDatasetConfig(Config): | ||
"""Config for a Hugging Face Hub dataset. | ||
|
||
Args: | ||
name: the name of the dataset | ||
repo_type: the type of the repository | ||
repo_id: the identifier of the repository | ||
revision: the revision of the dataset | ||
filename: the name of the file | ||
format: the format of the dataset | ||
tags: the tags of the dataset | ||
license: the license of the dataset | ||
description: the description of the dataset | ||
fiftyone: the fiftyone version requirement of the dataset | ||
""" | ||
|
||
def __init__(self, **kwargs): | ||
## Internals | ||
self._repo_type = kwargs.get("repo_type", None) | ||
self._repo_id = kwargs.get("repo_id", None) | ||
self._revision = kwargs.get("revision", None) | ||
self._filename = kwargs.get("filename", None) | ||
self._format = kwargs.get("format", None) | ||
|
||
## Dataset metadata | ||
self.tags = kwargs.get("tags", []) | ||
if isinstance(self.tags, str): | ||
self.tags = [t.strip() for t in self.tags.split(",")] | ||
elif isinstance(self.tags, list): | ||
self.tags = [t.strip() for t in self.tags] | ||
self.license = kwargs.get("license", None) | ||
self.description = kwargs.get("description", None) | ||
self._get_fiftyone_version(kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the HFHubDatasetConfig
class __init__
method parameters for consistency and clarity.
- def __init__(self, **kwargs):
+ def __init__(
+ self,
+ name: Optional[str] = None,
+ repo_type: Optional[str] = None,
+ repo_id: Optional[str] = None,
+ revision: Optional[str] = None,
+ filename: Optional[str] = None,
+ format: Optional[str] = None,
+ tags: Optional[List[str]] = None,
+ license: Optional[str] = None,
+ description: Optional[str] = None,
+ fiftyone: Optional[Dict[str, Any]] = None,
+ **kwargs,
+ ):
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
class HFHubDatasetConfig(Config): | |
"""Config for a Hugging Face Hub dataset. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
""" | |
def __init__(self, **kwargs): | |
## Internals | |
self._repo_type = kwargs.get("repo_type", None) | |
self._repo_id = kwargs.get("repo_id", None) | |
self._revision = kwargs.get("revision", None) | |
self._filename = kwargs.get("filename", None) | |
self._format = kwargs.get("format", None) | |
## Dataset metadata | |
self.tags = kwargs.get("tags", []) | |
if isinstance(self.tags, str): | |
self.tags = [t.strip() for t in self.tags.split(",")] | |
elif isinstance(self.tags, list): | |
self.tags = [t.strip() for t in self.tags] | |
self.license = kwargs.get("license", None) | |
self.description = kwargs.get("description", None) | |
self._get_fiftyone_version(kwargs) | |
class HFHubDatasetConfig(Config): | |
"""Config for a Hugging Face Hub dataset. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
""" | |
def __init__( | |
self, | |
name: Optional[str] = None, | |
repo_type: Optional[str] = None, | |
repo_id: Optional[str] = None, | |
revision: Optional[str] = None, | |
filename: Optional[str] = None, | |
format: Optional[str] = None, | |
tags: Optional[List[str]] = None, | |
license: Optional[str] = None, | |
description: Optional[str] = None, | |
fiftyone: Optional[Dict[str, Any]] = None, | |
**kwargs, | |
): | |
## Internals | |
self._repo_type = kwargs.get("repo_type", None) | |
self._repo_id = kwargs.get("repo_id", None) | |
self._revision = kwargs.get("revision", None) | |
self._filename = kwargs.get("filename", None) | |
self._format = kwargs.get("format", None) | |
## Dataset metadata | |
self.tags = kwargs.get("tags", []) | |
if isinstance(self.tags, str): | |
self.tags = [t.strip() for t in self.tags.split(",")] | |
elif isinstance(self.tags, list): | |
self.tags = [t.strip() for t in self.tags] | |
self.license = kwargs.get("license", None) | |
self.description = kwargs.get("description", None) | |
self._get_fiftyone_version(kwargs) |
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | ||
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | ||
|
||
Args: | ||
name: the name of the dataset | ||
repo_type: the type of the repository | ||
repo_id: the identifier of the repository | ||
revision: the revision of the dataset | ||
filename: the name of the file | ||
format: the format of the dataset | ||
tags: the tags of the dataset | ||
license: the license of the dataset | ||
description: the description of the dataset | ||
fiftyone: the fiftyone version requirement of the dataset | ||
label_fields: the label fields of the dataset | ||
media_type: the media type of the dataset | ||
default_media_fields: the default media fields of the dataset | ||
additional_media_fields: the additional media fields of the dataset | ||
""" | ||
|
||
def __init__(self, **kwargs): | ||
super().__init__(**kwargs) | ||
|
||
self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE) | ||
|
||
self._build_name(kwargs) | ||
self._build_media_fields_dict(kwargs) | ||
self._build_label_fields_dict(kwargs) | ||
self._build_allowed_splits(kwargs) | ||
self._build_allowed_subsets(kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type hints to the HFHubParquetFilesDatasetConfig
class __init__
method parameters for improved readability and type checking.
- def __init__(self, **kwargs):
+ def __init__(
+ self,
+ media_type: str = DEFAULT_MEDIA_TYPE,
+ label_fields: Optional[Dict[str, List[str]]] = None,
+ default_media_fields: Optional[Dict[str, str]] = None,
+ additional_media_fields: Optional[Dict[str, str]] = None,
+ **kwargs,
+ ):
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | |
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
label_fields: the label fields of the dataset | |
media_type: the media type of the dataset | |
default_media_fields: the default media fields of the dataset | |
additional_media_fields: the additional media fields of the dataset | |
""" | |
def __init__(self, **kwargs): | |
super().__init__(**kwargs) | |
self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE) | |
self._build_name(kwargs) | |
self._build_media_fields_dict(kwargs) | |
self._build_label_fields_dict(kwargs) | |
self._build_allowed_splits(kwargs) | |
self._build_allowed_subsets(kwargs) | |
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig): | |
"""Config for a Hugging Face Hub dataset that is stored as parquet files. | |
Args: | |
name: the name of the dataset | |
repo_type: the type of the repository | |
repo_id: the identifier of the repository | |
revision: the revision of the dataset | |
filename: the name of the file | |
format: the format of the dataset | |
tags: the tags of the dataset | |
license: the license of the dataset | |
description: the description of the dataset | |
fiftyone: the fiftyone version requirement of the dataset | |
label_fields: the label fields of the dataset | |
media_type: the media type of the dataset | |
default_media_fields: the default media fields of the dataset | |
additional_media_fields: the additional media fields of the dataset | |
""" | |
def __init__( | |
self, | |
media_type: str = DEFAULT_MEDIA_TYPE, | |
label_fields: Optional[Dict[str, List[str]]] = None, | |
default_media_fields: Optional[Dict[str, str]] = None, | |
additional_media_fields: Optional[Dict[str, str]] = None, | |
**kwargs, | |
): | |
super().__init__(**kwargs) | |
self.media_type = media_type | |
self._build_name(kwargs) | |
self._build_media_fields_dict(kwargs) | |
self._build_label_fields_dict(kwargs) | |
self._build_allowed_splits(kwargs) | |
self._build_allowed_subsets(kwargs) | |
def _ensure_dataset_compatibility(config): | ||
req_str = config.version | ||
if req_str is None: | ||
return | ||
|
||
try: | ||
req = Requirement(req_str) | ||
except: | ||
logger.warning( | ||
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}" | ||
) | ||
return | ||
|
||
if not req.specifier.contains(foc.VERSION): | ||
logger.warning( | ||
f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider logging a clear message or raising a warning when the dataset version requirement is not met, in addition to the current error handling strategy.
if not req.specifier.contains(foc.VERSION):
+ logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.")
exception = ImportError(
f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible"
)
fou.handle_error(exception, error_level)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
def _ensure_dataset_compatibility(config): | |
req_str = config.version | |
if req_str is None: | |
return | |
try: | |
req = Requirement(req_str) | |
except: | |
logger.warning( | |
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}" | |
) | |
return | |
if not req.specifier.contains(foc.VERSION): | |
logger.warning( | |
f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}" | |
) | |
def _ensure_dataset_compatibility(config): | |
req_str = config.version | |
if req_str is None: | |
return | |
try: | |
req = Requirement(req_str) | |
except: | |
logger.warning( | |
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}" | |
) | |
return | |
if not req.specifier.contains(foc.VERSION): | |
logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.") | |
exception = ImportError( | |
f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible" | |
) | |
fou.handle_error(exception, error_level) | |
Ensure user is on latest version of FO (0.23.7) so the integration works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review Status
Configuration used: CodeRabbit UI
Files selected for processing (1)
- fiftyone/utils/huggingface.py (1 hunks)
Additional comments not posted (5)
fiftyone/utils/huggingface.py (5)
62-115
: Consider adding type hints to thepush_to_hub
function parameters and return type for improved readability and type checking.- def push_to_hub( + def push_to_hub( + dataset: fod.Dataset, + repo_name: str, + description: Optional[str] = None, + license: Optional[str] = None, + tags: Optional[List[str]] = None, + private: bool = False, + exist_ok: bool = False, + dataset_type: Optional[Type[fot.DatasetFormat]] = None, + min_fiftyone_version: Optional[str] = None, + label_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + token: Optional[str] = None, + preview_path: Optional[str] = None, + **data_card_kwargs, + ) -> None:
212-254
: Consider adding type hints to theload_from_hub
function parameters and return type for improved readability and type checking.- def load_from_hub( + def load_from_hub( + repo_id: str, + revision: Optional[str] = None, + split: Optional[str] = None, + splits: Optional[List[str]] = None, + subset: Optional[str] = None, + subsets: Optional[List[str]] = None, + max_samples: Optional[int] = None, + batch_size: Optional[int] = None, + num_workers: Optional[int] = None, + overwrite: bool = False, + persistent: bool = False, + name: Optional[str] = None, + token: Optional[str] = None, + config_file: Optional[str] = None, + **kwargs, + ) -> fod.Dataset:
276-309
: Consider adding type hints to theHFHubDatasetConfig
class__init__
method parameters for consistency and clarity.- def __init__(self, **kwargs): + def __init__( + self, + name: Optional[str] = None, + repo_type: Optional[str] = None, + repo_id: Optional[str] = None, + revision: Optional[str] = None, + filename: Optional[str] = None, + format: Optional[str] = None, + tags: Optional[List[str]] = None, + license: Optional[str] = None, + description: Optional[str] = None, + fiftyone: Optional[Dict[str, Any]] = None, + **kwargs, + ):
479-509
: Consider adding type hints to theHFHubParquetFilesDatasetConfig
class__init__
method parameters for improved readability and type checking.- def __init__(self, **kwargs): + def __init__( + self, + media_type: str = DEFAULT_MEDIA_TYPE, + label_fields: Optional[Dict[str, List[str]]] = None, + default_media_fields: Optional[Dict[str, str]] = None, + additional_media_fields: Optional[Dict[str, str]] = None, + **kwargs, + ):
613-630
: The addition of a warning log when the dataset version requirement is not met is a good practice for informing users about potential compatibility issues.
Hugging Face Integration — FiftyOne 0.24.0 documentation.pdf
Overview
This integration introduces two functions:
load_from_hub()
andpush_to_hub()
.The major architectural changes from the previous version are as follows:
fiftyone.yml
,fiftyone.yaml
, or a custom local yaml config file (can be extended to github as well) specifying the format of the dataset to be converted into FiftyOne. This is in contrast to the previous version of the integration, which had a default converter that tried its best but was not guaranteed to succeed, and custom Python loading script execution, which we can still add in at some point, but is not necessarily the lowest barrier to entrydatasets
library, loading the dataset with theirload_dataset()
function, and then converting — which was limiting our flexibility, requiring massive downloads in many cases, and resulting in duplication of a ton of files, this version uses the HF Datasets Server API to directly request and load the relevant data within needing to go through thedatasets
Python library. This gives a few key additional advantages, as I will document below.fiftyone.yml
file locally.Additional improvements:
description
,license
, andrepo_id
from the HF config file are added to the FiftyOne dataset'sinfo
dictionary, and all tags listed in the config are added to the dataset'stags
.fiftyone
Loading from the Hub
The
load_from_hub()
utility in thehf_hub
utils allows you to load datasets from the Hugging Face Hub that are in either:datasets
library, and accessible via the Datasets Server APIWhen you use
load_from_hub()
, you must specify therepo_id
, which specifies the organization and repo on Hugging Face Hub where the dataset can be found. This is the only positional argument.The loading config also needs to be specified in one of three ways:
fiftyone.yml
orfiftyone.yaml
file in the Hugging Face repo itselfconfig_file
keyword argument to specify the location (locally) of the config file to useload_from_hub()
call via keywords.The only required element is a
format
specifier. For Parquet datasets, you can useformat="ParquetFilesDataset"
orformat="parquet"
for short. For FiftyOne formats, use the name of the class. For instance, for a dataset in the formatfiftyone.types.dataset_types.COCODetectionDataset
, useformat="COCODetectionDataset"
.Loading Arguments
Additionally, the user can specify options:
revision
: the revision (or version commit) or the dataset to loadsplit
orsplits
: which of the available splits they want to load.subset
orsubsets
: which of the available subsets they want to load. Many datasets on the hub have multiple subsets. As an example, check out the newyorker_caption_contest, which has 30 subsets.max_samples
: the maximum number of samples per <split, subset> pair to load. This can be useful if you want to rapidly get a feel for the dataset without downloading 100s of GBs of data.batch_size
: the batch size to use when requesting data from the datasets server and adding samples to the FiftyOne datasetnum_workers
: thread pool workers to use when downloading mediaoverwrite
: whether to overwrite existing documents for the datasetpersistent
: whether to persist the loaded dataset to diskname
: a name to use for the dataset. If included, this will override any name present in the config file.Example Usage
To illustrate the power, flexibility, and simplicitly of this approach, here are a few examples with popular datasets on the Hugging Face Hub.
For all of these examples, we will use the following imports:
mnist
Load the test split of MNIST dataset:
Here, "mnist" is the repo id, and we are using
classification_fields="label"
to specify that the feature called "label" in the Hugging Face dataset should be converted into a FiftyOneClassification
label.coyo-700m
Load the first 1,000 samples from the COYO-700M dataset:
Here we use
max_samples
to specify that we only want the first 1,000.cppe-5
Load the CPPE-5 dataset and persist it to database:
Here we use
detection_fields="objects"
to specify that the feature "objects" should be converted into a FiftyOneDetections
label field.scene_parse150
Just load the test split from the
scene_parsing
subset:Here we are using the
"split"
and"subset"
keyword arguments to specify what we want to download. Also note that we are converting multiple features from the Hugging Face dataset into FiftyOne label fields. The segmentation masks are saved to disk.Documentation: For comprehensive coverage of all of these options, supported datasets, and more, see the PDF version of the integration docs, attached.
Pushing to the Hub
If you are working with a dataset in FiftyOne and you want to quickly share it with others, you can do so via the
push_to_hub()
function, which takes two positional arguments: the FiftyOne sample collection (aDataset
orDatasetView
), and therepo_name
, which will be combined with your username/org to construct therepo_id
where the dataset will be uploaded.When you push to the hub, a few things happen:
fiftyone.types.dataset_types.FiftyOneDataset
, but you can specify the format you want via thedataset_type
keyword argument.fiftyone.yml
config file for the dataset is generated and uploaded, which contains all of the necessary information so that the dataset can be loaded withload_from_hub()
.When you push to the hub, you can specify any/all of the following dataset card and config file attributes:
description
license
tags
push_to_hub
supports the following Hugging Face API arguments:private
: whether to upload the dataset as private or publicexist_ok
: whether to throw an error if the repo already existsExample Usage
COCODetectionDataset
with an MIT license:Wish List
datasets.load_dataset()
load_from_hub()
withmax_samples=1000
, but now you want the remaining samples, it shouldn't need to query the server for the first 1000. Currently, it doesn't re-download media files, but it does re-query the server.push_to_hub()
sets the required FiftyOne version to the user's version on upload, but this is too restrictive...Release Notes
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit