Hugging Face Hub Integration! #4193

jacobmarks · 2024-03-25T04:32:56Z

Hugging Face Integration — FiftyOne 0.24.0 documentation.pdf

Overview

This integration introduces two functions: load_from_hub() and push_to_hub().

The major architectural changes from the previous version are as follows:

Use configs fiftyone.yml, fiftyone.yaml, or a custom local yaml config file (can be extended to github as well) specifying the format of the dataset to be converted into FiftyOne. This is in contrast to the previous version of the integration, which had a default converter that tried its best but was not guaranteed to succeed, and custom Python loading script execution, which we can still add in at some point, but is not necessarily the lowest barrier to entry
As opposed to using HF's datasets library, loading the dataset with their load_dataset() function, and then converting — which was limiting our flexibility, requiring massive downloads in many cases, and resulting in duplication of a ton of files, this version uses the HF Datasets Server API to directly request and load the relevant data within needing to go through the datasets Python library. This gives a few key additional advantages, as I will document below.
Config info can be specified exclusively via kwargs, removing the need to explicitly create a fiftyone.yml file locally.
Supports loading datasets from gated repos

Additional improvements:

Added Logging during the loading & conversion process
description, license, and repo_id from the HF config file are added to the FiftyOne dataset's info dictionary, and all tags listed in the config are added to the dataset's tags.
Support for thumbnails in the sample grid, additional media fields, and segmentation mask labels
Includes FiftyOne version validation, ensuring that the dataset to be loaded is compatible with the user's current version of fiftyone

Loading from the Hub

The load_from_hub() utility in the hf_hub utils allows you to load datasets from the Hugging Face Hub that are in either:

Parquet format — compatible with the Hugging Face datasets library, and accessible via the Datasets Server API
Any FiftyOne dataset type, from those listed here

When you use load_from_hub(), you must specify the repo_id, which specifies the organization and repo on Hugging Face Hub where the dataset can be found. This is the only positional argument.

The loading config also needs to be specified in one of three ways:

Via a fiftyone.yml or fiftyone.yaml file in the Hugging Face repo itself
Using the config_file keyword argument to specify the location (locally) of the config file to use
By passing the config params directly into the load_from_hub() call via keywords.

The only required element is a format specifier. For Parquet datasets, you can use format="ParquetFilesDataset" or format="parquet" for short. For FiftyOne formats, use the name of the class. For instance, for a dataset in the format fiftyone.types.dataset_types.COCODetectionDataset, use format="COCODetectionDataset".

Loading Arguments

Additionally, the user can specify options:

revision: the revision (or version commit) or the dataset to load
split or splits: which of the available splits they want to load.
subset or subsets: which of the available subsets they want to load. Many datasets on the hub have multiple subsets. As an example, check out the newyorker_caption_contest, which has 30 subsets.
max_samples: the maximum number of samples per <split, subset> pair to load. This can be useful if you want to rapidly get a feel for the dataset without downloading 100s of GBs of data.
batch_size: the batch size to use when requesting data from the datasets server and adding samples to the FiftyOne dataset
num_workers: thread pool workers to use when downloading media
overwrite: whether to overwrite existing documents for the dataset
persistent: whether to persist the loaded dataset to disk
name: a name to use for the dataset. If included, this will override any name present in the config file.

Example Usage

To illustrate the power, flexibility, and simplicitly of this approach, here are a few examples with popular datasets on the Hugging Face Hub.

For all of these examples, we will use the following imports:

import fiftyone as fo
import fiftyone.utils.hf_hub as fouh

mnist

Load the test split of MNIST dataset:

dataset = fouh.load_from_hub(
   "mnist", 
   split="test", 
   format="parquet", 
   classification_fields="label"
)

session = fo.launch_app(dataset)

Here, "mnist" is the repo id, and we are using classification_fields="label" to specify that the feature called "label" in the Hugging Face dataset should be converted into a FiftyOne Classification label.

coyo-700m

Load the first 1,000 samples from the COYO-700M dataset:

dataset = fouh.load_from_hub(
   "kakaobrain/coyo-700m", 
   format="parquet", 
   max_samples=1000
)

session = fo.launch_app(dataset)

Here we use max_samples to specify that we only want the first 1,000.

cppe-5

Load the CPPE-5 dataset and persist it to database:

dataset = fouh.load_from_hub(
   "cppe-5", 
   format="parquet", 
   detection_fields="objects",
   persistent=True
)

session = fo.launch_app(dataset)

Here we use detection_fields="objects" to specify that the feature "objects" should be converted into a FiftyOne Detections label field.

scene_parse150

Just load the test split from the scene_parsing subset:

dataset = fouh.load_from_hub(
   "scene_parse150", 
   format="parquet", 
   subset="scene_parsing",
   split="test",
   classification_fields="scene_category",
   mask_fields="annotation"
)

session = fo.launch_app(dataset)

Here we are using the "split" and "subset" keyword arguments to specify what we want to download. Also note that we are converting multiple features from the Hugging Face dataset into FiftyOne label fields. The segmentation masks are saved to disk.

Documentation: For comprehensive coverage of all of these options, supported datasets, and more, see the PDF version of the integration docs, attached.

Pushing to the Hub

If you are working with a dataset in FiftyOne and you want to quickly share it with others, you can do so via the push_to_hub() function, which takes two positional arguments: the FiftyOne sample collection (a Dataset or DatasetView), and the repo_name, which will be combined with your username/org to construct the repo_id where the dataset will be uploaded.

When you push to the hub, a few things happen:

The dataset and its media files are exported/uploaded in a specified format. By default, this format is fiftyone.types.dataset_types.FiftyOneDataset, but you can specify the format you want via the dataset_type keyword argument.
A fiftyone.yml config file for the dataset is generated and uploaded, which contains all of the necessary information so that the dataset can be loaded with load_from_hub().
A Hugging Face Dataset Card for the dataset is auto-generated, providing tags, metadata, license info, and a code snippet illustrating how to load the dataset from the hub.

When you push to the hub, you can specify any/all of the following dataset card and config file attributes:

description
license
tags

push_to_hub supports the following Hugging Face API arguments:

private: whether to upload the dataset as private or public
exist_ok: whether to throw an error if the repo already exists

Example Usage

Upload the Quickstart dataset to a private repo:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(dataset, "quickstart")

Upload this dataset as a public COCODetectionDataset with an MIT license:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(
    dataset,
    "quickstart-coco",
    dataset_type=fot.dataset_types.COCODetectionDataset,
    private=False,
    license="mit",
    label_fields="*" ### convert all label fields, not just ground truth
 )

Upload the first 3 samples of a video dataset:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart-video")

fouh.push_to_hub(
    dataset[:3],
    "video-test",
    private=True,
 )

Wish List

Support different detection formats (like VOC, COCO, etc)
Extend support for specifying the location of the config file at remote urls, such as github repos
Extend this "config" based approach to loading from Github, and to working with datasets, models, and plugins
URL redirects for anchor links from previous version of the integration which just had transformers
Add robustness to the download process — currently, sometimes when sending requests to the Datasets Server it hangs. This is not great.
Add support for converting directly from the Hugging Face dataset, assuming it was downloaded with datasets.load_dataset()
Add functionality to extend an existing dataset with the next batch of samples. In other words, if you used load_from_hub() with max_samples=1000, but now you want the remaining samples, it shouldn't need to query the server for the first 1000. Currently, it doesn't re-download media files, but it does re-query the server.
Handle FiftyOne version compatibility: currently, push_to_hub() sets the required FiftyOne version to the user's version on upload, but this is too restrictive...

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release
notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

App: FiftyOne application changes
Build: Build and test infrastructure changes
Core: Core fiftyone Python library changes
Documentation: FiftyOne documentation changes
Other

Summary by CodeRabbit

New Features
- Enhanced utilities for working with Hugging Face, enabling pushing FiftyOne datasets to the Hugging Face Hub, loading datasets from the Hub into FiftyOne, and managing dataset repositories.
- Introduced functionalities for handling dataset metadata, field conversions, and configurations based on Hugging Face Hub dataset configurations.
- Added options for loading datasets with splits, subsets, and batch loading.

coderabbitai · 2024-03-25T04:33:01Z

Walkthrough

The new fiftyone/utils/huggingface.py file introduces a comprehensive set of utilities for seamless integration with the Hugging Face ecosystem, enabling tasks like dataset management, configuration handling, and flexible dataset loading options.

Changes

File	Change Summary
`fiftyone/utils/huggingface.py`	Introduces utilities for Hugging Face integration, including dataset pushing to the Hub, loading datasets from the Hub, managing repos, uploading datasets, metadata handling, field conversions, and configuration based on Hub dataset settings. Provides flexible dataset loading options.

🐇✨
In the world of code, where changes abound,
A script emerges with features profound.
Hugging Face now closer, datasets in sync,
This utility brings a new wave, not a blink.
So here's to progress, in lines of code we trace,
Evolving tools that find their place.
🌟🐰

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-03-25T04:36:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.24%. Comparing base (90c8853) to head (fd624fa).
Report is 314 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #4193       +/-   ##
============================================
+ Coverage    16.00%   99.24%   +83.24%     
============================================
  Files          734       35      -699     
  Lines        82223    15236    -66987     
  Branches      1119        0     -1119     
============================================
+ Hits         13159    15121     +1962     
+ Misses       69064      115    -68949

Flag	Coverage Δ
app	`?`
python	`99.24% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fiftyone/utils/hf_hub.py

brimoor

@jacobmarks this is looking very nice! definitely have my 👍 to add docs to the HF integration page so we can ship this 🚢

fiftyone/utils/hf_hub.py

Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 3951c2c and 8b74e3d.

Files selected for processing (1)

docs/source/integrations/huggingface.rst (15 hunks)

Files not summarized due to errors (1)

docs/source/integrations/huggingface.rst: Error: Message exceeds token limit

Additional comments not posted (4)

docs/source/integrations/huggingface.rst (4)

948-1006: Consider adding a note about dataset size limitations or recommendations when pushing datasets to the Hugging Face Hub.

While the documentation provides comprehensive instructions on pushing datasets to the Hugging Face Hub, it might be helpful for users to be aware of any size limitations or best practices regarding dataset sizes. This could prevent potential issues or confusion during the upload process.

1149-1278: Ensure the documentation on loading datasets from the Hugging Face Hub includes information on handling large datasets efficiently.

The section on loading datasets from the Hugging Face Hub is detailed and covers a wide range of use cases. However, it would be beneficial to include tips or best practices for efficiently handling large datasets, such as using the max_samples parameter to limit the number of samples loaded initially.

1410-1449: Highlight the importance of the num_workers parameter for optimizing download speeds.

The section on configuring the download process when loading datasets from the Hugging Face Hub is informative. It might be helpful to emphasize the significance of the num_workers parameter more explicitly, as adjusting this parameter can significantly impact download speeds, especially for large datasets.

1811-1925: Consider adding examples for error handling when loading datasets from the Hugging Face Hub.

The advanced examples section is comprehensive and covers a variety of complex scenarios. Including examples or guidance on error handling when issues arise during dataset loading (e.g., due to incorrect configurations or access permissions) would be beneficial for users, providing them with strategies to troubleshoot and resolve common issues.

coderabbitai

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 8b74e3d and dc3af96.

Files selected for processing (1)

docs/source/integrations/huggingface.rst (15 hunks)

Files not summarized due to errors (1)

docs/source/integrations/huggingface.rst: Error: Message exceeds token limit

Additional comments not posted (1)

docs/source/integrations/huggingface.rst (1)

911-915: Consider adding a section on troubleshooting common issues when integrating with the Hugging Face Hub.
This could include common error messages and their solutions, tips for debugging, and links to relevant Hugging Face documentation or forums for additional support.

…ftyone into hf-hub-integration-v2

brimoor

LGTM! 🤗

coderabbitai

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between dc3af96 and 4484c2f.

Files selected for processing (1)

docs/source/integrations/huggingface.rst (15 hunks)

Files not summarized due to errors (1)

docs/source/integrations/huggingface.rst: Error: Message exceeds token limit

Additional comments not posted (4)

docs/source/integrations/huggingface.rst (4)
922-944: Ensure the Hugging Face Hub Python client version specified in the installation command is up-to-date and compatible with the latest features used in the integration.
-    pip install "huggingface_hub>=0.20.0"
+    pip install "huggingface_hub>=0.21.0"
Also, verify that the instructions for obtaining an access token are still accurate according to the latest Hugging Face website layout and settings.

951-1006: Confirm that the push_to_hub() function's behavior, especially regarding the auto-generation of the dataset card and the fiftyone.yml config file, aligns with the latest implementation details. Ensure that the examples provided are clear, accurate, and reflect the current best practices for pushing datasets to the Hugging Face Hub.

1145-1173: Ensure the instructions for loading datasets from the Hugging Face Hub, especially the handling of private or gated datasets, are up-to-date and reflect the current capabilities of the load_from_hub() function. Verify that the examples provided are comprehensive, covering a wide range of use cases, and clearly demonstrate how to use the function effectively.

Verification successful

The provided context from the shell script output confirms that the HF_TOKEN environment variable is indeed used within the fiftyone.utils.huggingface.py file. Specifically, it is used to optionally authenticate requests to the Hugging Face Hub by either directly passing a token parameter or by fetching the token from the HF_TOKEN environment variable if it's set. This aligns with the documentation snippet under review, which mentions the use of the HF_TOKEN environment variable for loading private or gated datasets from the Hugging Face Hub.

Given this confirmation, the instructions and examples provided in the documentation for loading datasets from the Hugging Face Hub, including the handling of private or gated datasets using the HF_TOKEN, are accurate and reflect the current capabilities of the load_from_hub() function.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify the handling of private or gated datasets in load_from_hub function
grep -A 10 "def load_from_hub" $(fd --type f "huggingface.py")
Length of output: 297

Script:
#!/bin/bash
# Search for references to HF_TOKEN in the codebase to understand its usage in loading datasets
rg "HF_TOKEN" $(fd --type f)
Length of output: 108443
1773-1882: Review the advanced examples for pushing and loading datasets to ensure they accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub. Confirm that the configurations and parameters mentioned are up-to-date and clearly explained, providing users with the knowledge needed to leverage the integration's full potential.

Verification successful

The extracted content from the docs/source/integrations/huggingface.rst file matches the provided documentation snippet in the review context. This includes detailed examples and configurations for advanced usage of the FiftyOne integration with the Hugging Face Hub, covering aspects such as loading datasets from revisions, handling datasets with multiple media fields, customizing the download process, and loading private or gated datasets.

Based on this, the advanced examples for pushing and loading datasets accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub, and the configurations and parameters mentioned are up-to-date and clearly explained.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify advanced examples and configurations for pushing and loading datasets
grep -A 20 "Advanced examples" $(fd --type f "huggingface.py")
Length of output: 92

Script:
#!/bin/bash
# Locate and verify the advanced examples and configurations within RST documentation files
fd --type f --extension rst | xargs grep -l "Advanced examples"
Length of output: 106

Script:
#!/bin/bash
# Extract and review the "Advanced examples" section from the identified RST documentation file
sed -n '/Advanced examples/,/^\*\*/p' docs/source/integrations/huggingface.rst
Length of output: 463

Script:
#!/bin/bash
# Attempt to extract a more comprehensive portion of the "Advanced examples" section for review
sed -n '/Advanced examples/,/^\*\*\*/p' docs/source/integrations/huggingface.rst
Length of output: 3712

Hugging Face Hub Upgrades

coderabbitai

Actionable comments posted: 5

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 4484c2f and 1179d9a.

Files ignored due to path filters (1)

docs/source/images/integrations/hf_data_card_preview.jpg is excluded by !**/*.jpg, !**/*.jpg

Files selected for processing (2)

docs/source/integrations/huggingface.rst (15 hunks)
fiftyone/utils/huggingface.py (1 hunks)

Files not summarized due to errors (1)

docs/source/integrations/huggingface.rst: Error: Message exceeds token limit

coderabbitai · 2024-04-05T13:30:39Z

fiftyone/utils/huggingface.py

+def push_to_hub(
+    dataset,
+    repo_name,
+    description=None,
+    license=None,
+    tags=None,
+    private=False,
+    exist_ok=False,
+    dataset_type=None,
+    min_fiftyone_version=None,
+    label_field=None,
+    frame_labels_field=None,
+    token=None,
+    preview_path=None,
+    **data_card_kwargs,
+):


Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub( + def push_to_hub( + dataset: fod.Dataset, + repo_name: str, + description: Optional[str] = None, + license: Optional[str] = None, + tags: Optional[List[str]] = None, + private: bool = False, + exist_ok: bool = False, + dataset_type: Optional[Type[fot.DatasetFormat]] = None, + min_fiftyone_version: Optional[str] = None, + label_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + token: Optional[str] = None, + preview_path: Optional[str] = None, + **data_card_kwargs, + ) -> None:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def push_to_hub(

dataset,

repo_name,

description=None,

license=None,

tags=None,

private=False,

exist_ok=False,

dataset_type=None,

min_fiftyone_version=None,

label_field=None,

frame_labels_field=None,

token=None,

preview_path=None,

**data_card_kwargs,

):

def push_to_hub(

dataset: fod.Dataset,

repo_name: str,

description: Optional[str] = None,

license: Optional[str] = None,

tags: Optional[List[str]] = None,

private: bool = False,

exist_ok: bool = False,

dataset_type: Optional[Type[fot.DatasetFormat]] = None,

min_fiftyone_version: Optional[str] = None,

label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,

frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,

token: Optional[str] = None,

preview_path: Optional[str] = None,

**data_card_kwargs,

) -> None:

coderabbitai · 2024-04-05T13:30:39Z

fiftyone/utils/huggingface.py

+def load_from_hub(
+    repo_id,
+    revision=None,
+    split=None,
+    splits=None,
+    subset=None,
+    subsets=None,
+    max_samples=None,
+    batch_size=None,
+    num_workers=None,
+    overwrite=False,
+    persistent=False,
+    name=None,
+    token=None,
+    config_file=None,
+    **kwargs,
+):


Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub( + def load_from_hub( + repo_id: str, + revision: Optional[str] = None, + split: Optional[str] = None, + splits: Optional[List[str]] = None, + subset: Optional[str] = None, + subsets: Optional[List[str]] = None, + max_samples: Optional[int] = None, + batch_size: Optional[int] = None, + num_workers: Optional[int] = None, + overwrite: bool = False, + persistent: bool = False, + name: Optional[str] = None, + token: Optional[str] = None, + config_file: Optional[str] = None, + **kwargs, + ) -> fod.Dataset:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def load_from_hub(

repo_id,

revision=None,

split=None,

splits=None,

subset=None,

subsets=None,

max_samples=None,

batch_size=None,

num_workers=None,

overwrite=False,

persistent=False,

name=None,

token=None,

config_file=None,

**kwargs,

):

def load_from_hub(

repo_id: str,

revision: Optional[str] = None,

split: Optional[str] = None,

splits: Optional[List[str]] = None,

subset: Optional[str] = None,

subsets: Optional[List[str]] = None,

max_samples: Optional[int] = None,

batch_size: Optional[int] = None,

num_workers: Optional[int] = None,

overwrite: bool = False,

persistent: bool = False,

name: Optional[str] = None,

token: Optional[str] = None,

config_file: Optional[str] = None,

**kwargs,

) -> fod.Dataset:

coderabbitai · 2024-04-05T13:30:40Z

fiftyone/utils/huggingface.py

+class HFHubDatasetConfig(Config):
+    """Config for a Hugging Face Hub dataset.
+
+    Args:
+        name: the name of the dataset
+        repo_type: the type of the repository
+        repo_id: the identifier of the repository
+        revision: the revision of the dataset
+        filename: the name of the file
+        format: the format of the dataset
+        tags: the tags of the dataset
+        license: the license of the dataset
+        description: the description of the dataset
+        fiftyone: the fiftyone version requirement of the dataset
+    """


Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs): + def __init__( + self, + name: Optional[str] = None, + repo_type: Optional[str] = None, + repo_id: Optional[str] = None, + revision: Optional[str] = None, + filename: Optional[str] = None, + format: Optional[str] = None, + tags: Optional[List[str]] = None, + license: Optional[str] = None, + description: Optional[str] = None, + fiftyone: Optional[Dict[str, Any]] = None, + **kwargs, + ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

class HFHubDatasetConfig(Config):

"""Config for a Hugging Face Hub dataset.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

"""

class HFHubDatasetConfig(Config):

"""Config for a Hugging Face Hub dataset.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

"""

def __init__(

self,

name: Optional[str] = None,

repo_type: Optional[str] = None,

repo_id: Optional[str] = None,

revision: Optional[str] = None,

filename: Optional[str] = None,

format: Optional[str] = None,

tags: Optional[List[str]] = None,

license: Optional[str] = None,

description: Optional[str] = None,

fiftyone: Optional[Dict[str, Any]] = None,

**kwargs,

):

coderabbitai · 2024-04-05T13:30:40Z

fiftyone/utils/huggingface.py

+class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
+    """Config for a Hugging Face Hub dataset that is stored as parquet files.
+
+    Args:
+        name: the name of the dataset
+        repo_type: the type of the repository
+        repo_id: the identifier of the repository
+        revision: the revision of the dataset
+        filename: the name of the file
+        format: the format of the dataset
+        tags: the tags of the dataset
+        license: the license of the dataset
+        description: the description of the dataset
+        fiftyone: the fiftyone version requirement of the dataset
+        label_fields: the label fields of the dataset
+        media_type: the media type of the dataset
+        default_media_fields: the default media fields of the dataset
+        additional_media_fields: the additional media fields of the dataset
+    """


Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs): + def __init__( + self, + media_type: str = DEFAULT_MEDIA_TYPE, + label_fields: Optional[Dict[str, List[str]]] = None, + default_media_fields: Optional[Dict[str, str]] = None, + additional_media_fields: Optional[Dict[str, str]] = None, + **kwargs, + ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):

"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

label_fields: the label fields of the dataset

media_type: the media type of the dataset

default_media_fields: the default media fields of the dataset

additional_media_fields: the additional media fields of the dataset

"""

class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):

"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

label_fields: the label fields of the dataset

media_type: the media type of the dataset

default_media_fields: the default media fields of the dataset

additional_media_fields: the additional media fields of the dataset

"""

def __init__(

self,

media_type: str = DEFAULT_MEDIA_TYPE,

label_fields: Optional[Dict[str, List[str]]] = None,

default_media_fields: Optional[Dict[str, str]] = None,

additional_media_fields: Optional[Dict[str, str]] = None,

**kwargs,

):

[HF Hub Integration] Upload and Download without Progress Bars

coderabbitai

Actionable comments posted: 5

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 1179d9a and 91f026d.

Files selected for processing (1)

fiftyone/utils/huggingface.py (1 hunks)

coderabbitai · 2024-04-05T14:36:47Z

fiftyone/utils/huggingface.py

+def push_to_hub(
+    dataset,
+    repo_name,
+    description=None,
+    license=None,
+    tags=None,
+    private=False,
+    exist_ok=False,
+    dataset_type=None,
+    min_fiftyone_version=None,
+    label_field=None,
+    frame_labels_field=None,
+    token=None,
+    preview_path=None,
+    **data_card_kwargs,
+):
+    """Push a FiftyOne dataset to the Hugging Face Hub.
+
+    Args:
+        dataset: a FiftyOne dataset
+        repo_name: the name of the dataset repo to create. The repo ID will be
+            ``{your_username}/{repo_name}``
+        description (None): a description of the dataset
+        license (None): the license of the dataset
+        tags (None): a list of tags for the dataset
+        private (True): whether the repo should be private
+        exist_ok (False): if True, do not raise an error if repo already exists.
+        dataset_type (None): the type of the dataset to create
+        min_fiftyone_version (None): the minimum version of FiftyOne required
+            to load the dataset. For example ``"0.23.0"``.
+        label_field (None): controls the label field(s) to export. Only
+            applicable to labeled datasets. Can be any of the following:
+
+            - the name of a label field to export
+            - a glob pattern of label field(s) to export
+            - a list or tuple of label field(s) to export
+            - a dictionary mapping label field names to keys to use when
+              constructing the label dictionaries to pass to the exporter
+        frame_labels_field (None): controls the frame label field(s) to export.
+            The "frames." prefix is optional. Only applicable to labeled video
+            datasets. Can be any of the following:
+
+            - the name of a frame label field to export
+            - a glob pattern of frame label field(s) to export
+            - a list or tuple of frame label field(s) to export
+            - a dictionary mapping frame label field names to keys to use when
+              constructing the frame label dictionaries to pass to the exporter
+        token (None): a Hugging Face API token to use. May also be provided via
+            the ``HF_TOKEN`` environment variable
+        preview_path (None): a path to a preview image or video to display on
+            the readme of the dataset repo.
+        data_card_kwargs: additional keyword arguments to pass to the
+            `DatasetCard` constructor
+    """


Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub( + def push_to_hub( + dataset: fod.Dataset, + repo_name: str, + description: Optional[str] = None, + license: Optional[str] = None, + tags: Optional[List[str]] = None, + private: bool = False, + exist_ok: bool = False, + dataset_type: Optional[Type[fot.DatasetFormat]] = None, + min_fiftyone_version: Optional[str] = None, + label_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None, + token: Optional[str] = None, + preview_path: Optional[str] = None, + **data_card_kwargs, + ) -> None:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def push_to_hub(

dataset,

repo_name,

description=None,

license=None,

tags=None,

private=False,

exist_ok=False,

dataset_type=None,

min_fiftyone_version=None,

label_field=None,

frame_labels_field=None,

token=None,

preview_path=None,

**data_card_kwargs,

):

"""Push a FiftyOne dataset to the Hugging Face Hub.

Args:

dataset: a FiftyOne dataset

repo_name: the name of the dataset repo to create. The repo ID will be

``{your_username}/{repo_name}``

description (None): a description of the dataset

license (None): the license of the dataset

tags (None): a list of tags for the dataset

private (True): whether the repo should be private

exist_ok (False): if True, do not raise an error if repo already exists.

dataset_type (None): the type of the dataset to create

min_fiftyone_version (None): the minimum version of FiftyOne required

to load the dataset. For example ``"0.23.0"``.

label_field (None): controls the label field(s) to export. Only

applicable to labeled datasets. Can be any of the following:

- the name of a label field to export

- a glob pattern of label field(s) to export

- a list or tuple of label field(s) to export

- a dictionary mapping label field names to keys to use when

constructing the label dictionaries to pass to the exporter

frame_labels_field (None): controls the frame label field(s) to export.

The "frames." prefix is optional. Only applicable to labeled video

datasets. Can be any of the following:

- the name of a frame label field to export

- a glob pattern of frame label field(s) to export

- a list or tuple of frame label field(s) to export

- a dictionary mapping frame label field names to keys to use when

constructing the frame label dictionaries to pass to the exporter

token (None): a Hugging Face API token to use. May also be provided via

the ``HF_TOKEN`` environment variable

preview_path (None): a path to a preview image or video to display on

the readme of the dataset repo.

data_card_kwargs: additional keyword arguments to pass to the

`DatasetCard` constructor

"""

def push_to_hub(

dataset: fod.Dataset,

repo_name: str,

description: Optional[str] = None,

license: Optional[str] = None,

tags: Optional[List[str]] = None,

private: bool = False,

exist_ok: bool = False,

dataset_type: Optional[Type[fot.DatasetFormat]] = None,

min_fiftyone_version: Optional[str] = None,

label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,

frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,

token: Optional[str] = None,

preview_path: Optional[str] = None,

**data_card_kwargs,

) -> None:

"""Push a FiftyOne dataset to the Hugging Face Hub.

Args:

dataset: a FiftyOne dataset

repo_name: the name of the dataset repo to create. The repo ID will be

``{your_username}/{repo_name}``

description (None): a description of the dataset

license (None): the license of the dataset

tags (None): a list of tags for the dataset

private (True): whether the repo should be private

exist_ok (False): if True, do not raise an error if repo already exists.

dataset_type (None): the type of the dataset to create

min_fiftyone_version (None): the minimum version of FiftyOne required

to load the dataset. For example ``"0.23.0"``.

label_field (None): controls the label field(s) to export. Only

applicable to labeled datasets. Can be any of the following:

- the name of a label field to export

- a glob pattern of label field(s) to export

- a list or tuple of label field(s) to export

- a dictionary mapping label field names to keys to use when

constructing the label dictionaries to pass to the exporter

frame_labels_field (None): controls the frame label field(s) to export.

The "frames." prefix is optional. Only applicable to labeled video

datasets. Can be any of the following:

- the name of a frame label field to export

- a glob pattern of frame label field(s) to export

- a list or tuple of frame label field(s) to export

- a dictionary mapping frame label field names to keys to use when

constructing the frame label dictionaries to pass to the exporter

token (None): a Hugging Face API token to use. May also be provided via

the ``HF_TOKEN`` environment variable

preview_path (None): a path to a preview image or video to display on

the readme of the dataset repo.

data_card_kwargs: additional keyword arguments to pass to the

`DatasetCard` constructor

"""

coderabbitai · 2024-04-05T14:36:47Z

fiftyone/utils/huggingface.py

+def load_from_hub(
+    repo_id,
+    revision=None,
+    split=None,
+    splits=None,
+    subset=None,
+    subsets=None,
+    max_samples=None,
+    batch_size=None,
+    num_workers=None,
+    overwrite=False,
+    persistent=False,
+    name=None,
+    token=None,
+    config_file=None,
+    **kwargs,
+):
+    """Loads a dataset from the Hugging Face Hub into FiftyOne.
+
+    Args:
+        repo_id: the Hugging Face Hub identifier of the dataset
+        revision (None): the revision of the dataset to load
+        split (None): the split of the dataset to load
+        splits (None): the splits of the dataset to load
+        subset (None): the subset of the dataset to load
+        subsets (None): the subsets of the dataset to load
+        max_samples (None): the maximum number of samples to load
+        batch_size (None): the batch size to use when loading samples
+        num_workers (None): a suggested number of threads to use when
+            downloading media
+        overwrite (True): whether to overwrite an existing dataset with the
+            same name
+        persistent (False): whether the dataset should be persistent
+        name (None): an optional name to give the dataset
+        token (None): a Hugging Face API token to use. May also be provided via
+            the ``HF_TOKEN`` environment variable
+        config_file (None): the path to a config file on disk specifying how to
+            load the dataset if the repo has no ``fiftyone.yml`` file
+        **kwargs: keyword arguments specifying config parameters to load the
+            dataset if the repo has no ``fiftyone.yml`` file
+
+    Returns:
+        a :class:`fiftyone.core.dataset.Dataset`


Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub( + def load_from_hub( + repo_id: str, + revision: Optional[str] = None, + split: Optional[str] = None, + splits: Optional[List[str]] = None, + subset: Optional[str] = None, + subsets: Optional[List[str]] = None, + max_samples: Optional[int] = None, + batch_size: Optional[int] = None, + num_workers: Optional[int] = None, + overwrite: bool = False, + persistent: bool = False, + name: Optional[str] = None, + token: Optional[str] = None, + config_file: Optional[str] = None, + **kwargs, + ) -> fod.Dataset:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def load_from_hub(

repo_id,

revision=None,

split=None,

splits=None,

subset=None,

subsets=None,

max_samples=None,

batch_size=None,

num_workers=None,

overwrite=False,

persistent=False,

name=None,

token=None,

config_file=None,

**kwargs,

):

"""Loads a dataset from the Hugging Face Hub into FiftyOne.

Args:

repo_id: the Hugging Face Hub identifier of the dataset

revision (None): the revision of the dataset to load

split (None): the split of the dataset to load

splits (None): the splits of the dataset to load

subset (None): the subset of the dataset to load

subsets (None): the subsets of the dataset to load

max_samples (None): the maximum number of samples to load

batch_size (None): the batch size to use when loading samples

num_workers (None): a suggested number of threads to use when

downloading media

overwrite (True): whether to overwrite an existing dataset with the

same name

persistent (False): whether the dataset should be persistent

name (None): an optional name to give the dataset

token (None): a Hugging Face API token to use. May also be provided via

the ``HF_TOKEN`` environment variable

config_file (None): the path to a config file on disk specifying how to

load the dataset if the repo has no ``fiftyone.yml`` file

**kwargs: keyword arguments specifying config parameters to load the

dataset if the repo has no ``fiftyone.yml`` file

Returns:

a :class:`fiftyone.core.dataset.Dataset`

def load_from_hub(

repo_id: str,

revision: Optional[str] = None,

split: Optional[str] = None,

splits: Optional[List[str]] = None,

subset: Optional[str] = None,

subsets: Optional[List[str]] = None,

max_samples: Optional[int] = None,

batch_size: Optional[int] = None,

num_workers: Optional[int] = None,

overwrite: bool = False,

persistent: bool = False,

name: Optional[str] = None,

token: Optional[str] = None,

config_file: Optional[str] = None,

**kwargs,

) -> fod.Dataset:

"""Loads a dataset from the Hugging Face Hub into FiftyOne.

Args:

repo_id: the Hugging Face Hub identifier of the dataset

revision (None): the revision of the dataset to load

split (None): the split of the dataset to load

splits (None): the splits of the dataset to load

subset (None): the subset of the dataset to load

subsets (None): the subsets of the dataset to load

max_samples (None): the maximum number of samples to load

batch_size (None): the batch size to use when loading samples

num_workers (None): a suggested number of threads to use when

downloading media

overwrite (True): whether to overwrite an existing dataset with the

same name

persistent (False): whether the dataset should be persistent

name (None): an optional name to give the dataset

token (None): a Hugging Face API token to use. May also be provided via

the ``HF_TOKEN`` environment variable

config_file (None): the path to a config file on disk specifying how to

load the dataset if the repo has no ``fiftyone.yml`` file

**kwargs: keyword arguments specifying config parameters to load the

dataset if the repo has no ``fiftyone.yml`` file

Returns:

a :class:`fiftyone.core.dataset.Dataset`

coderabbitai · 2024-04-05T14:36:47Z

fiftyone/utils/huggingface.py

+class HFHubDatasetConfig(Config):
+    """Config for a Hugging Face Hub dataset.
+
+    Args:
+        name: the name of the dataset
+        repo_type: the type of the repository
+        repo_id: the identifier of the repository
+        revision: the revision of the dataset
+        filename: the name of the file
+        format: the format of the dataset
+        tags: the tags of the dataset
+        license: the license of the dataset
+        description: the description of the dataset
+        fiftyone: the fiftyone version requirement of the dataset
+    """
+
+    def __init__(self, **kwargs):
+        ## Internals
+        self._repo_type = kwargs.get("repo_type", None)
+        self._repo_id = kwargs.get("repo_id", None)
+        self._revision = kwargs.get("revision", None)
+        self._filename = kwargs.get("filename", None)
+        self._format = kwargs.get("format", None)
+
+        ## Dataset metadata
+        self.tags = kwargs.get("tags", [])
+        if isinstance(self.tags, str):
+            self.tags = [t.strip() for t in self.tags.split(",")]
+        elif isinstance(self.tags, list):
+            self.tags = [t.strip() for t in self.tags]
+        self.license = kwargs.get("license", None)
+        self.description = kwargs.get("description", None)
+        self._get_fiftyone_version(kwargs)
+


Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs): + def __init__( + self, + name: Optional[str] = None, + repo_type: Optional[str] = None, + repo_id: Optional[str] = None, + revision: Optional[str] = None, + filename: Optional[str] = None, + format: Optional[str] = None, + tags: Optional[List[str]] = None, + license: Optional[str] = None, + description: Optional[str] = None, + fiftyone: Optional[Dict[str, Any]] = None, + **kwargs, + ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

class HFHubDatasetConfig(Config):

"""Config for a Hugging Face Hub dataset.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

"""

def __init__(self, **kwargs):

## Internals

self._repo_type = kwargs.get("repo_type", None)

self._repo_id = kwargs.get("repo_id", None)

self._revision = kwargs.get("revision", None)

self._filename = kwargs.get("filename", None)

self._format = kwargs.get("format", None)

## Dataset metadata

self.tags = kwargs.get("tags", [])

if isinstance(self.tags, str):

self.tags = [t.strip() for t in self.tags.split(",")]

elif isinstance(self.tags, list):

self.tags = [t.strip() for t in self.tags]

self.license = kwargs.get("license", None)

self.description = kwargs.get("description", None)

self._get_fiftyone_version(kwargs)

class HFHubDatasetConfig(Config):

"""Config for a Hugging Face Hub dataset.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

"""

def __init__(

self,

name: Optional[str] = None,

repo_type: Optional[str] = None,

repo_id: Optional[str] = None,

revision: Optional[str] = None,

filename: Optional[str] = None,

format: Optional[str] = None,

tags: Optional[List[str]] = None,

license: Optional[str] = None,

description: Optional[str] = None,

fiftyone: Optional[Dict[str, Any]] = None,

**kwargs,

):

## Internals

self._repo_type = kwargs.get("repo_type", None)

self._repo_id = kwargs.get("repo_id", None)

self._revision = kwargs.get("revision", None)

self._filename = kwargs.get("filename", None)

self._format = kwargs.get("format", None)

## Dataset metadata

self.tags = kwargs.get("tags", [])

if isinstance(self.tags, str):

self.tags = [t.strip() for t in self.tags.split(",")]

elif isinstance(self.tags, list):

self.tags = [t.strip() for t in self.tags]

self.license = kwargs.get("license", None)

self.description = kwargs.get("description", None)

self._get_fiftyone_version(kwargs)

coderabbitai · 2024-04-05T14:36:47Z

fiftyone/utils/huggingface.py

+class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
+    """Config for a Hugging Face Hub dataset that is stored as parquet files.
+
+    Args:
+        name: the name of the dataset
+        repo_type: the type of the repository
+        repo_id: the identifier of the repository
+        revision: the revision of the dataset
+        filename: the name of the file
+        format: the format of the dataset
+        tags: the tags of the dataset
+        license: the license of the dataset
+        description: the description of the dataset
+        fiftyone: the fiftyone version requirement of the dataset
+        label_fields: the label fields of the dataset
+        media_type: the media type of the dataset
+        default_media_fields: the default media fields of the dataset
+        additional_media_fields: the additional media fields of the dataset
+    """
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+        self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE)
+
+        self._build_name(kwargs)
+        self._build_media_fields_dict(kwargs)
+        self._build_label_fields_dict(kwargs)
+        self._build_allowed_splits(kwargs)
+        self._build_allowed_subsets(kwargs)
+


Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs): + def __init__( + self, + media_type: str = DEFAULT_MEDIA_TYPE, + label_fields: Optional[Dict[str, List[str]]] = None, + default_media_fields: Optional[Dict[str, str]] = None, + additional_media_fields: Optional[Dict[str, str]] = None, + **kwargs, + ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):

"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

label_fields: the label fields of the dataset

media_type: the media type of the dataset

default_media_fields: the default media fields of the dataset

additional_media_fields: the additional media fields of the dataset

"""

def __init__(self, **kwargs):

super().__init__(**kwargs)

self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE)

self._build_name(kwargs)

self._build_media_fields_dict(kwargs)

self._build_label_fields_dict(kwargs)

self._build_allowed_splits(kwargs)

self._build_allowed_subsets(kwargs)

class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):

"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:

name: the name of the dataset

repo_type: the type of the repository

repo_id: the identifier of the repository

revision: the revision of the dataset

filename: the name of the file

format: the format of the dataset

tags: the tags of the dataset

license: the license of the dataset

description: the description of the dataset

fiftyone: the fiftyone version requirement of the dataset

label_fields: the label fields of the dataset

media_type: the media type of the dataset

default_media_fields: the default media fields of the dataset

additional_media_fields: the additional media fields of the dataset

"""

def __init__(

self,

media_type: str = DEFAULT_MEDIA_TYPE,

label_fields: Optional[Dict[str, List[str]]] = None,

default_media_fields: Optional[Dict[str, str]] = None,

additional_media_fields: Optional[Dict[str, str]] = None,

**kwargs,

):

super().__init__(**kwargs)

self.media_type = media_type

self._build_name(kwargs)

self._build_media_fields_dict(kwargs)

self._build_label_fields_dict(kwargs)

self._build_allowed_splits(kwargs)

self._build_allowed_subsets(kwargs)

coderabbitai · 2024-04-05T14:36:47Z

fiftyone/utils/huggingface.py

+def _ensure_dataset_compatibility(config):
+    req_str = config.version
+    if req_str is None:
+        return
+
+    try:
+        req = Requirement(req_str)
+    except:
+        logger.warning(
+            f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"
+        )
+        return
+
+    if not req.specifier.contains(foc.VERSION):
+        logger.warning(
+            f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}"
+        )
+


Consider logging a clear message or raising a warning when the dataset version requirement is not met, in addition to the current error handling strategy.

if not req.specifier.contains(foc.VERSION): + logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.") exception = ImportError( f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible" ) fou.handle_error(exception, error_level)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def _ensure_dataset_compatibility(config):

req_str = config.version

if req_str is None:

return

try:

req = Requirement(req_str)

except:

logger.warning(

f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"

)

return

if not req.specifier.contains(foc.VERSION):

logger.warning(

f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}"

)

def _ensure_dataset_compatibility(config):

req_str = config.version

if req_str is None:

return

try:

req = Requirement(req_str)

except:

logger.warning(

f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"

)

return

if not req.specifier.contains(foc.VERSION):

logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.")

exception = ImportError(

f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible"

)

fou.handle_error(exception, error_level)

Ensure user is on latest version of FO (0.23.7) so the integration works

coderabbitai

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 91f026d and fd624fa.

Files selected for processing (1)

fiftyone/utils/huggingface.py (1 hunks)

Additional comments not posted (5)

fiftyone/utils/huggingface.py (5)

62-115: Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub(
+ def push_to_hub(
+     dataset: fod.Dataset,
+     repo_name: str,
+     description: Optional[str] = None,
+     license: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     private: bool = False,
+     exist_ok: bool = False,
+     dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+     min_fiftyone_version: Optional[str] = None,
+     label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     token: Optional[str] = None,
+     preview_path: Optional[str] = None,
+     **data_card_kwargs,
+ ) -> None:

212-254: Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub(
+ def load_from_hub(
+     repo_id: str,
+     revision: Optional[str] = None,
+     split: Optional[str] = None,
+     splits: Optional[List[str]] = None,
+     subset: Optional[str] = None,
+     subsets: Optional[List[str]] = None,
+     max_samples: Optional[int] = None,
+     batch_size: Optional[int] = None,
+     num_workers: Optional[int] = None,
+     overwrite: bool = False,
+     persistent: bool = False,
+     name: Optional[str] = None,
+     token: Optional[str] = None,
+     config_file: Optional[str] = None,
+     **kwargs,
+ ) -> fod.Dataset:

276-309: Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     name: Optional[str] = None,
+     repo_type: Optional[str] = None,
+     repo_id: Optional[str] = None,
+     revision: Optional[str] = None,
+     filename: Optional[str] = None,
+     format: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     license: Optional[str] = None,
+     description: Optional[str] = None,
+     fiftyone: Optional[Dict[str, Any]] = None,
+     **kwargs,
+ ):

479-509: Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     media_type: str = DEFAULT_MEDIA_TYPE,
+     label_fields: Optional[Dict[str, List[str]]] = None,
+     default_media_fields: Optional[Dict[str, str]] = None,
+     additional_media_fields: Optional[Dict[str, str]] = None,
+     **kwargs,
+ ):

613-630: The addition of a warning log when the dataset version requirement is not met is a good practice for informing users about potential compatibility issues.

initial sketch

a9eb9c7

always add filepath to sample dict

cadc1be

brimoor reviewed Mar 25, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

brimoor reviewed Mar 25, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

brimoor reviewed Mar 25, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

brimoor reviewed Mar 25, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

brimoor reviewed Mar 25, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

jacobmarks added 11 commits March 25, 2024 10:23

image shape util usage

7214599

change load_from_hub kwargs to named

8eaf809

robust label field kwarg handling

558497a

enable config spec through kwargs

80b946e

Speedup media download with threadpools

0de7125

adds support for loading from gated repos

72b381f

use threadpool to download masks

3090f37

default batch size config

50d8deb

adds push_to_hub()

7225363

adds loading FiftyOne datasets from hub

3664981

clean up

b63cdbf

jacobmarks requested a review from brimoor March 25, 2024 23:23

handle no-label classification case

74e214c

brimoor reviewed Mar 28, 2024

View reviewed changes

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved

jacobmarks added 7 commits March 28, 2024 13:11

rename hf util file

639a7c2

adds standard module docstring

a539d76

removes detection_format and mask_targets for now

8d49821

lazy import huggingface_hub

cbafc83

remove unused import

7fd50ba

use TmpDir eta util

45aaefc

fixes docstring indentation

0463937

jacobmarks and others added 2 commits April 4, 2024 12:43

Update docs/source/integrations/huggingface.rst

cd72833

Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

Update docs/source/integrations/huggingface.rst

9318690

Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

coderabbitai bot reviewed Apr 4, 2024

View reviewed changes

brimoor added 2 commits April 4, 2024 13:27

docs pass

43e6ce1

Merge branch 'hf-hub-integration-v2' of https://github.com/voxel51/fi…

4484c2f

…ftyone into hf-hub-integration-v2

brimoor changed the title ~~Revamped Hugging Face Hub Integration~~ Hugging Face Hub Integration! Apr 4, 2024

brimoor approved these changes Apr 4, 2024

View reviewed changes

coderabbitai bot reviewed Apr 4, 2024

View reviewed changes

jacobmarks added 11 commits April 4, 2024 14:16

makes fiftyone version req optional

6cab1d0

document min_fiftyone_version

e50e353

validation min_fiftyone_version

9e1e7d5

remove unused import

df1db86

adds preview_path option to preview dataset

e4e78ac

documents preview_path

40f7b92

improve name resolution

d5ea06e

only download necessary media files for FiftyOneDatasets

b71e3fb

removing version validation

e2f55af

improve efficiency

6937faa

Merge pull request #4231 from voxel51/hf-hub-upgrades

1179d9a

Hugging Face Hub Upgrades

coderabbitai bot reviewed Apr 5, 2024

View reviewed changes

jacobmarks added 2 commits April 5, 2024 10:20

Adds context without progress bars for upload/download

a5291a6

Merge pull request #4234 from voxel51/hf-hub-progress-bars

91f026d

[HF Hub Integration] Upload and Download without Progress Bars

coderabbitai bot reviewed Apr 5, 2024

View reviewed changes

Update install instructions

fd624fa

Ensure user is on latest version of FO (0.23.7) so the integration works

coderabbitai bot reviewed Apr 5, 2024

View reviewed changes

jacobmarks merged commit 81b2545 into develop Apr 5, 2024
10 checks passed

jacobmarks deleted the hf-hub-integration-v2 branch April 5, 2024 15:17

This was referenced Apr 6, 2024

Hugging Face Hub Integration #4069

Closed

Release notes + cherry-picking PRs for v0.23.8 #4239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face Hub Integration! #4193

Hugging Face Hub Integration! #4193

jacobmarks commented Mar 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Mar 25, 2024 •

edited

Loading

brimoor left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

brimoor left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot left a comment

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot Apr 5, 2024

coderabbitai bot left a comment

Hugging Face Hub Integration! #4193

Hugging Face Hub Integration! #4193

Conversation

jacobmarks commented Mar 25, 2024 • edited by coderabbitai bot Loading

Overview

Additional improvements:

Loading from the Hub

Loading Arguments

Example Usage

mnist

coyo-700m

cppe-5

scene_parse150

Pushing to the Hub

Example Usage

Wish List

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

What areas of FiftyOne does this PR affect?

Summary by CodeRabbit

coderabbitai bot commented Mar 25, 2024 • edited Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Mar 25, 2024 • edited Loading

Codecov Report

brimoor left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

brimoor left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot Apr 5, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jacobmarks commented Mar 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 25, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov bot commented Mar 25, 2024 •

edited

Loading