Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Hub Integration! #4193

Merged
merged 69 commits into from
Apr 5, 2024
Merged

Hugging Face Hub Integration! #4193

merged 69 commits into from
Apr 5, 2024

Conversation

jacobmarks
Copy link
Contributor

@jacobmarks jacobmarks commented Mar 25, 2024

Hugging Face Integration — FiftyOne 0.24.0 documentation.pdf

Overview

This integration introduces two functions: load_from_hub() and push_to_hub().

The major architectural changes from the previous version are as follows:

  1. Use configs fiftyone.yml, fiftyone.yaml, or a custom local yaml config file (can be extended to github as well) specifying the format of the dataset to be converted into FiftyOne. This is in contrast to the previous version of the integration, which had a default converter that tried its best but was not guaranteed to succeed, and custom Python loading script execution, which we can still add in at some point, but is not necessarily the lowest barrier to entry
  2. As opposed to using HF's datasets library, loading the dataset with their load_dataset() function, and then converting — which was limiting our flexibility, requiring massive downloads in many cases, and resulting in duplication of a ton of files, this version uses the HF Datasets Server API to directly request and load the relevant data within needing to go through the datasets Python library. This gives a few key additional advantages, as I will document below.
  3. Config info can be specified exclusively via kwargs, removing the need to explicitly create a fiftyone.yml file locally.
  4. Supports loading datasets from gated repos

Additional improvements:

  • Added Logging during the loading & conversion process
  • description, license, and repo_id from the HF config file are added to the FiftyOne dataset's info dictionary, and all tags listed in the config are added to the dataset's tags.
  • Support for thumbnails in the sample grid, additional media fields, and segmentation mask labels
  • Includes FiftyOne version validation, ensuring that the dataset to be loaded is compatible with the user's current version of fiftyone

Loading from the Hub

The load_from_hub() utility in the hf_hub utils allows you to load datasets from the Hugging Face Hub that are in either:

  1. Parquet format — compatible with the Hugging Face datasets library, and accessible via the Datasets Server API
  2. Any FiftyOne dataset type, from those listed here

When you use load_from_hub(), you must specify the repo_id, which specifies the organization and repo on Hugging Face Hub where the dataset can be found. This is the only positional argument.

The loading config also needs to be specified in one of three ways:

  1. Via a fiftyone.yml or fiftyone.yaml file in the Hugging Face repo itself
  2. Using the config_file keyword argument to specify the location (locally) of the config file to use
  3. By passing the config params directly into the load_from_hub() call via keywords.

The only required element is a format specifier. For Parquet datasets, you can use format="ParquetFilesDataset" or format="parquet" for short. For FiftyOne formats, use the name of the class. For instance, for a dataset in the format fiftyone.types.dataset_types.COCODetectionDataset, use format="COCODetectionDataset".

Loading Arguments

Additionally, the user can specify options:

  • revision: the revision (or version commit) or the dataset to load
  • split or splits: which of the available splits they want to load.
  • subset or subsets: which of the available subsets they want to load. Many datasets on the hub have multiple subsets. As an example, check out the newyorker_caption_contest, which has 30 subsets.
  • max_samples: the maximum number of samples per <split, subset> pair to load. This can be useful if you want to rapidly get a feel for the dataset without downloading 100s of GBs of data.
  • batch_size: the batch size to use when requesting data from the datasets server and adding samples to the FiftyOne dataset
  • num_workers: thread pool workers to use when downloading media
  • overwrite: whether to overwrite existing documents for the dataset
  • persistent: whether to persist the loaded dataset to disk
  • name: a name to use for the dataset. If included, this will override any name present in the config file.

Example Usage

To illustrate the power, flexibility, and simplicitly of this approach, here are a few examples with popular datasets on the Hugging Face Hub.

For all of these examples, we will use the following imports:

import fiftyone as fo
import fiftyone.utils.hf_hub as fouh

mnist

Load the test split of MNIST dataset:

dataset = fouh.load_from_hub(
   "mnist", 
   split="test", 
   format="parquet", 
   classification_fields="label"
)

session = fo.launch_app(dataset)

Here, "mnist" is the repo id, and we are using classification_fields="label" to specify that the feature called "label" in the Hugging Face dataset should be converted into a FiftyOne Classification label.

coyo-700m

Load the first 1,000 samples from the COYO-700M dataset:

dataset = fouh.load_from_hub(
   "kakaobrain/coyo-700m", 
   format="parquet", 
   max_samples=1000
)

session = fo.launch_app(dataset)

Here we use max_samples to specify that we only want the first 1,000.

cppe-5

Load the CPPE-5 dataset and persist it to database:

dataset = fouh.load_from_hub(
   "cppe-5", 
   format="parquet", 
   detection_fields="objects",
   persistent=True
)

session = fo.launch_app(dataset)

Here we use detection_fields="objects" to specify that the feature "objects" should be converted into a FiftyOne Detections label field.

scene_parse150

Just load the test split from the scene_parsing subset:

dataset = fouh.load_from_hub(
   "scene_parse150", 
   format="parquet", 
   subset="scene_parsing",
   split="test",
   classification_fields="scene_category",
   mask_fields="annotation"
)

session = fo.launch_app(dataset)

Here we are using the "split" and "subset" keyword arguments to specify what we want to download. Also note that we are converting multiple features from the Hugging Face dataset into FiftyOne label fields. The segmentation masks are saved to disk.

Documentation: For comprehensive coverage of all of these options, supported datasets, and more, see the PDF version of the integration docs, attached.

Pushing to the Hub

If you are working with a dataset in FiftyOne and you want to quickly share it with others, you can do so via the push_to_hub() function, which takes two positional arguments: the FiftyOne sample collection (a Dataset or DatasetView), and the repo_name, which will be combined with your username/org to construct the repo_id where the dataset will be uploaded.

When you push to the hub, a few things happen:

  1. The dataset and its media files are exported/uploaded in a specified format. By default, this format is fiftyone.types.dataset_types.FiftyOneDataset, but you can specify the format you want via the dataset_type keyword argument.
  2. A fiftyone.yml config file for the dataset is generated and uploaded, which contains all of the necessary information so that the dataset can be loaded with load_from_hub().
  3. A Hugging Face Dataset Card for the dataset is auto-generated, providing tags, metadata, license info, and a code snippet illustrating how to load the dataset from the hub.

When you push to the hub, you can specify any/all of the following dataset card and config file attributes:

  • description
  • license
  • tags

push_to_hub supports the following Hugging Face API arguments:

  • private: whether to upload the dataset as private or public
  • exist_ok: whether to throw an error if the repo already exists

Example Usage

  • Upload the Quickstart dataset to a private repo:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(dataset, "quickstart")
  • Upload this dataset as a public COCODetectionDataset with an MIT license:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart")

fouh.push_to_hub(
    dataset,
    "quickstart-coco",
    dataset_type=fot.dataset_types.COCODetectionDataset,
    private=False,
    license="mit",
    label_fields="*" ### convert all label fields, not just ground truth
 )
  • Upload the first 3 samples of a video dataset:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.hf_hub as fouh
import fiftyone.types as fot

dataset = foz.load_zoo_dataset("quickstart-video")

fouh.push_to_hub(
    dataset[:3],
    "video-test",
    private=True,
 )

Wish List

  • Support different detection formats (like VOC, COCO, etc)
  • Extend support for specifying the location of the config file at remote urls, such as github repos
  • Extend this "config" based approach to loading from Github, and to working with datasets, models, and plugins
  • URL redirects for anchor links from previous version of the integration which just had transformers
  • Add robustness to the download process — currently, sometimes when sending requests to the Datasets Server it hangs. This is not great.
  • Add support for converting directly from the Hugging Face dataset, assuming it was downloaded with datasets.load_dataset()
  • Add functionality to extend an existing dataset with the next batch of samples. In other words, if you used load_from_hub() with max_samples=1000, but now you want the remaining samples, it shouldn't need to query the server for the first 1000. Currently, it doesn't re-download media files, but it does re-query the server.
  • Handle FiftyOne version compatibility: currently, push_to_hub() sets the required FiftyOne version to the user's version on upload, but this is too restrictive...

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release
    notes for FiftyOne users.

(Details in 1-2 sentences. You can just refer to another PR with a description
if this PR is part of a larger change.)

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features
    • Enhanced utilities for working with Hugging Face, enabling pushing FiftyOne datasets to the Hugging Face Hub, loading datasets from the Hub into FiftyOne, and managing dataset repositories.
    • Introduced functionalities for handling dataset metadata, field conversions, and configurations based on Hugging Face Hub dataset configurations.
    • Added options for loading datasets with splits, subsets, and batch loading.

Copy link
Contributor

coderabbitai bot commented Mar 25, 2024

Walkthrough

The new fiftyone/utils/huggingface.py file introduces a comprehensive set of utilities for seamless integration with the Hugging Face ecosystem, enabling tasks like dataset management, configuration handling, and flexible dataset loading options.

Changes

File Change Summary
fiftyone/utils/huggingface.py Introduces utilities for Hugging Face integration, including dataset pushing to the Hub, loading datasets from the Hub, managing repos, uploading datasets, metadata handling, field conversions, and configuration based on Hub dataset settings. Provides flexible dataset loading options.

🐇✨
In the world of code, where changes abound,
A script emerges with features profound.
Hugging Face now closer, datasets in sync,
This utility brings a new wave, not a blink.
So here's to progress, in lines of code we trace,
Evolving tools that find their place.
🌟🐰

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Mar 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.24%. Comparing base (90c8853) to head (fd624fa).
Report is 314 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #4193       +/-   ##
============================================
+ Coverage    16.00%   99.24%   +83.24%     
============================================
  Files          734       35      -699     
  Lines        82223    15236    -66987     
  Branches      1119        0     -1119     
============================================
+ Hits         13159    15121     +1962     
+ Misses       69064      115    -68949     
Flag Coverage Δ
app ?
python 99.24% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
@jacobmarks jacobmarks requested a review from brimoor March 25, 2024 23:23
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobmarks this is looking very nice! definitely have my 👍 to add docs to the HF integration page so we can ship this 🚢

fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
fiftyone/utils/hf_hub.py Outdated Show resolved Hide resolved
jacobmarks and others added 2 commits April 4, 2024 12:43
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 3951c2c and 8b74e3d.
Files selected for processing (1)
  • docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
  • docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (4)
docs/source/integrations/huggingface.rst (4)

948-1006: Consider adding a note about dataset size limitations or recommendations when pushing datasets to the Hugging Face Hub.

While the documentation provides comprehensive instructions on pushing datasets to the Hugging Face Hub, it might be helpful for users to be aware of any size limitations or best practices regarding dataset sizes. This could prevent potential issues or confusion during the upload process.


1149-1278: Ensure the documentation on loading datasets from the Hugging Face Hub includes information on handling large datasets efficiently.

The section on loading datasets from the Hugging Face Hub is detailed and covers a wide range of use cases. However, it would be beneficial to include tips or best practices for efficiently handling large datasets, such as using the max_samples parameter to limit the number of samples loaded initially.


1410-1449: Highlight the importance of the num_workers parameter for optimizing download speeds.

The section on configuring the download process when loading datasets from the Hugging Face Hub is informative. It might be helpful to emphasize the significance of the num_workers parameter more explicitly, as adjusting this parameter can significantly impact download speeds, especially for large datasets.


1811-1925: Consider adding examples for error handling when loading datasets from the Hugging Face Hub.

The advanced examples section is comprehensive and covers a variety of complex scenarios. Including examples or guidance on error handling when issues arise during dataset loading (e.g., due to incorrect configurations or access permissions) would be beneficial for users, providing them with strategies to troubleshoot and resolve common issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 8b74e3d and dc3af96.
Files selected for processing (1)
  • docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
  • docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (1)
docs/source/integrations/huggingface.rst (1)

911-915: Consider adding a section on troubleshooting common issues when integrating with the Hugging Face Hub.
This could include common error messages and their solutions, tips for debugging, and links to relevant Hugging Face documentation or forums for additional support.

@brimoor brimoor changed the title Revamped Hugging Face Hub Integration Hugging Face Hub Integration! Apr 4, 2024
Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🤗

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between dc3af96 and 4484c2f.
Files selected for processing (1)
  • docs/source/integrations/huggingface.rst (15 hunks)
Files not summarized due to errors (1)
  • docs/source/integrations/huggingface.rst: Error: Message exceeds token limit
Additional comments not posted (4)
docs/source/integrations/huggingface.rst (4)

922-944: Ensure the Hugging Face Hub Python client version specified in the installation command is up-to-date and compatible with the latest features used in the integration.

-    pip install "huggingface_hub>=0.20.0"
+    pip install "huggingface_hub>=0.21.0"

Also, verify that the instructions for obtaining an access token are still accurate according to the latest Hugging Face website layout and settings.


951-1006: Confirm that the push_to_hub() function's behavior, especially regarding the auto-generation of the dataset card and the fiftyone.yml config file, aligns with the latest implementation details. Ensure that the examples provided are clear, accurate, and reflect the current best practices for pushing datasets to the Hugging Face Hub.


1145-1173: Ensure the instructions for loading datasets from the Hugging Face Hub, especially the handling of private or gated datasets, are up-to-date and reflect the current capabilities of the load_from_hub() function. Verify that the examples provided are comprehensive, covering a wide range of use cases, and clearly demonstrate how to use the function effectively.

Verification successful

The provided context from the shell script output confirms that the HF_TOKEN environment variable is indeed used within the fiftyone.utils.huggingface.py file. Specifically, it is used to optionally authenticate requests to the Hugging Face Hub by either directly passing a token parameter or by fetching the token from the HF_TOKEN environment variable if it's set. This aligns with the documentation snippet under review, which mentions the use of the HF_TOKEN environment variable for loading private or gated datasets from the Hugging Face Hub.

Given this confirmation, the instructions and examples provided in the documentation for loading datasets from the Hugging Face Hub, including the handling of private or gated datasets using the HF_TOKEN, are accurate and reflect the current capabilities of the load_from_hub() function.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the handling of private or gated datasets in load_from_hub function
grep -A 10 "def load_from_hub" $(fd --type f "huggingface.py")

Length of output: 297



Script:

#!/bin/bash
# Search for references to HF_TOKEN in the codebase to understand its usage in loading datasets
rg "HF_TOKEN" $(fd --type f)

Length of output: 108443


1773-1882: Review the advanced examples for pushing and loading datasets to ensure they accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub. Confirm that the configurations and parameters mentioned are up-to-date and clearly explained, providing users with the knowledge needed to leverage the integration's full potential.

Verification successful

The extracted content from the docs/source/integrations/huggingface.rst file matches the provided documentation snippet in the review context. This includes detailed examples and configurations for advanced usage of the FiftyOne integration with the Hugging Face Hub, covering aspects such as loading datasets from revisions, handling datasets with multiple media fields, customizing the download process, and loading private or gated datasets.

Based on this, the advanced examples for pushing and loading datasets accurately represent the capabilities of the FiftyOne integration with the Hugging Face Hub, and the configurations and parameters mentioned are up-to-date and clearly explained.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify advanced examples and configurations for pushing and loading datasets
grep -A 20 "Advanced examples" $(fd --type f "huggingface.py")

Length of output: 92



Script:

#!/bin/bash
# Locate and verify the advanced examples and configurations within RST documentation files
fd --type f --extension rst | xargs grep -l "Advanced examples"

Length of output: 106



Script:

#!/bin/bash
# Extract and review the "Advanced examples" section from the identified RST documentation file
sed -n '/Advanced examples/,/^\*\*/p' docs/source/integrations/huggingface.rst

Length of output: 463



Script:

#!/bin/bash
# Attempt to extract a more comprehensive portion of the "Advanced examples" section for review
sed -n '/Advanced examples/,/^\*\*\*/p' docs/source/integrations/huggingface.rst

Length of output: 3712

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 4484c2f and 1179d9a.
Files ignored due to path filters (1)
  • docs/source/images/integrations/hf_data_card_preview.jpg is excluded by !**/*.jpg, !**/*.jpg
Files selected for processing (2)
  • docs/source/integrations/huggingface.rst (15 hunks)
  • fiftyone/utils/huggingface.py (1 hunks)
Files not summarized due to errors (1)
  • docs/source/integrations/huggingface.rst: Error: Message exceeds token limit

Comment on lines +56 to +71
def push_to_hub(
dataset,
repo_name,
description=None,
license=None,
tags=None,
private=False,
exist_ok=False,
dataset_type=None,
min_fiftyone_version=None,
label_field=None,
frame_labels_field=None,
token=None,
preview_path=None,
**data_card_kwargs,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub(
+ def push_to_hub(
+     dataset: fod.Dataset,
+     repo_name: str,
+     description: Optional[str] = None,
+     license: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     private: bool = False,
+     exist_ok: bool = False,
+     dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+     min_fiftyone_version: Optional[str] = None,
+     label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     token: Optional[str] = None,
+     preview_path: Optional[str] = None,
+     **data_card_kwargs,
+ ) -> None:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def push_to_hub(
dataset,
repo_name,
description=None,
license=None,
tags=None,
private=False,
exist_ok=False,
dataset_type=None,
min_fiftyone_version=None,
label_field=None,
frame_labels_field=None,
token=None,
preview_path=None,
**data_card_kwargs,
):
def push_to_hub(
dataset: fod.Dataset,
repo_name: str,
description: Optional[str] = None,
license: Optional[str] = None,
tags: Optional[List[str]] = None,
private: bool = False,
exist_ok: bool = False,
dataset_type: Optional[Type[fot.DatasetFormat]] = None,
min_fiftyone_version: Optional[str] = None,
label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
token: Optional[str] = None,
preview_path: Optional[str] = None,
**data_card_kwargs,
) -> None:

Comment on lines +205 to +221
def load_from_hub(
repo_id,
revision=None,
split=None,
splits=None,
subset=None,
subsets=None,
max_samples=None,
batch_size=None,
num_workers=None,
overwrite=False,
persistent=False,
name=None,
token=None,
config_file=None,
**kwargs,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub(
+ def load_from_hub(
+     repo_id: str,
+     revision: Optional[str] = None,
+     split: Optional[str] = None,
+     splits: Optional[List[str]] = None,
+     subset: Optional[str] = None,
+     subsets: Optional[List[str]] = None,
+     max_samples: Optional[int] = None,
+     batch_size: Optional[int] = None,
+     num_workers: Optional[int] = None,
+     overwrite: bool = False,
+     persistent: bool = False,
+     name: Optional[str] = None,
+     token: Optional[str] = None,
+     config_file: Optional[str] = None,
+     **kwargs,
+ ) -> fod.Dataset:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def load_from_hub(
repo_id,
revision=None,
split=None,
splits=None,
subset=None,
subsets=None,
max_samples=None,
batch_size=None,
num_workers=None,
overwrite=False,
persistent=False,
name=None,
token=None,
config_file=None,
**kwargs,
):
def load_from_hub(
repo_id: str,
revision: Optional[str] = None,
split: Optional[str] = None,
splits: Optional[List[str]] = None,
subset: Optional[str] = None,
subsets: Optional[List[str]] = None,
max_samples: Optional[int] = None,
batch_size: Optional[int] = None,
num_workers: Optional[int] = None,
overwrite: bool = False,
persistent: bool = False,
name: Optional[str] = None,
token: Optional[str] = None,
config_file: Optional[str] = None,
**kwargs,
) -> fod.Dataset:

Comment on lines +269 to +283
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.

Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     name: Optional[str] = None,
+     repo_type: Optional[str] = None,
+     repo_id: Optional[str] = None,
+     revision: Optional[str] = None,
+     filename: Optional[str] = None,
+     format: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     license: Optional[str] = None,
+     description: Optional[str] = None,
+     fiftyone: Optional[Dict[str, Any]] = None,
+     **kwargs,
+ ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""
def __init__(
self,
name: Optional[str] = None,
repo_type: Optional[str] = None,
repo_id: Optional[str] = None,
revision: Optional[str] = None,
filename: Optional[str] = None,
format: Optional[str] = None,
tags: Optional[List[str]] = None,
license: Optional[str] = None,
description: Optional[str] = None,
fiftyone: Optional[Dict[str, Any]] = None,
**kwargs,
):

Comment on lines +460 to +478
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     media_type: str = DEFAULT_MEDIA_TYPE,
+     label_fields: Optional[Dict[str, List[str]]] = None,
+     default_media_fields: Optional[Dict[str, str]] = None,
+     additional_media_fields: Optional[Dict[str, str]] = None,
+     **kwargs,
+ ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""
def __init__(
self,
media_type: str = DEFAULT_MEDIA_TYPE,
label_fields: Optional[Dict[str, List[str]]] = None,
default_media_fields: Optional[Dict[str, str]] = None,
additional_media_fields: Optional[Dict[str, str]] = None,
**kwargs,
):

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 1179d9a and 91f026d.
Files selected for processing (1)
  • fiftyone/utils/huggingface.py (1 hunks)

Comment on lines +62 to +115
def push_to_hub(
dataset,
repo_name,
description=None,
license=None,
tags=None,
private=False,
exist_ok=False,
dataset_type=None,
min_fiftyone_version=None,
label_field=None,
frame_labels_field=None,
token=None,
preview_path=None,
**data_card_kwargs,
):
"""Push a FiftyOne dataset to the Hugging Face Hub.

Args:
dataset: a FiftyOne dataset
repo_name: the name of the dataset repo to create. The repo ID will be
``{your_username}/{repo_name}``
description (None): a description of the dataset
license (None): the license of the dataset
tags (None): a list of tags for the dataset
private (True): whether the repo should be private
exist_ok (False): if True, do not raise an error if repo already exists.
dataset_type (None): the type of the dataset to create
min_fiftyone_version (None): the minimum version of FiftyOne required
to load the dataset. For example ``"0.23.0"``.
label_field (None): controls the label field(s) to export. Only
applicable to labeled datasets. Can be any of the following:

- the name of a label field to export
- a glob pattern of label field(s) to export
- a list or tuple of label field(s) to export
- a dictionary mapping label field names to keys to use when
constructing the label dictionaries to pass to the exporter
frame_labels_field (None): controls the frame label field(s) to export.
The "frames." prefix is optional. Only applicable to labeled video
datasets. Can be any of the following:

- the name of a frame label field to export
- a glob pattern of frame label field(s) to export
- a list or tuple of frame label field(s) to export
- a dictionary mapping frame label field names to keys to use when
constructing the frame label dictionaries to pass to the exporter
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
preview_path (None): a path to a preview image or video to display on
the readme of the dataset repo.
data_card_kwargs: additional keyword arguments to pass to the
`DatasetCard` constructor
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub(
+ def push_to_hub(
+     dataset: fod.Dataset,
+     repo_name: str,
+     description: Optional[str] = None,
+     license: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     private: bool = False,
+     exist_ok: bool = False,
+     dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+     min_fiftyone_version: Optional[str] = None,
+     label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     token: Optional[str] = None,
+     preview_path: Optional[str] = None,
+     **data_card_kwargs,
+ ) -> None:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def push_to_hub(
dataset,
repo_name,
description=None,
license=None,
tags=None,
private=False,
exist_ok=False,
dataset_type=None,
min_fiftyone_version=None,
label_field=None,
frame_labels_field=None,
token=None,
preview_path=None,
**data_card_kwargs,
):
"""Push a FiftyOne dataset to the Hugging Face Hub.
Args:
dataset: a FiftyOne dataset
repo_name: the name of the dataset repo to create. The repo ID will be
``{your_username}/{repo_name}``
description (None): a description of the dataset
license (None): the license of the dataset
tags (None): a list of tags for the dataset
private (True): whether the repo should be private
exist_ok (False): if True, do not raise an error if repo already exists.
dataset_type (None): the type of the dataset to create
min_fiftyone_version (None): the minimum version of FiftyOne required
to load the dataset. For example ``"0.23.0"``.
label_field (None): controls the label field(s) to export. Only
applicable to labeled datasets. Can be any of the following:
- the name of a label field to export
- a glob pattern of label field(s) to export
- a list or tuple of label field(s) to export
- a dictionary mapping label field names to keys to use when
constructing the label dictionaries to pass to the exporter
frame_labels_field (None): controls the frame label field(s) to export.
The "frames." prefix is optional. Only applicable to labeled video
datasets. Can be any of the following:
- the name of a frame label field to export
- a glob pattern of frame label field(s) to export
- a list or tuple of frame label field(s) to export
- a dictionary mapping frame label field names to keys to use when
constructing the frame label dictionaries to pass to the exporter
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
preview_path (None): a path to a preview image or video to display on
the readme of the dataset repo.
data_card_kwargs: additional keyword arguments to pass to the
`DatasetCard` constructor
"""
def push_to_hub(
dataset: fod.Dataset,
repo_name: str,
description: Optional[str] = None,
license: Optional[str] = None,
tags: Optional[List[str]] = None,
private: bool = False,
exist_ok: bool = False,
dataset_type: Optional[Type[fot.DatasetFormat]] = None,
min_fiftyone_version: Optional[str] = None,
label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
token: Optional[str] = None,
preview_path: Optional[str] = None,
**data_card_kwargs,
) -> None:
"""Push a FiftyOne dataset to the Hugging Face Hub.
Args:
dataset: a FiftyOne dataset
repo_name: the name of the dataset repo to create. The repo ID will be
``{your_username}/{repo_name}``
description (None): a description of the dataset
license (None): the license of the dataset
tags (None): a list of tags for the dataset
private (True): whether the repo should be private
exist_ok (False): if True, do not raise an error if repo already exists.
dataset_type (None): the type of the dataset to create
min_fiftyone_version (None): the minimum version of FiftyOne required
to load the dataset. For example ``"0.23.0"``.
label_field (None): controls the label field(s) to export. Only
applicable to labeled datasets. Can be any of the following:
- the name of a label field to export
- a glob pattern of label field(s) to export
- a list or tuple of label field(s) to export
- a dictionary mapping label field names to keys to use when
constructing the label dictionaries to pass to the exporter
frame_labels_field (None): controls the frame label field(s) to export.
The "frames." prefix is optional. Only applicable to labeled video
datasets. Can be any of the following:
- the name of a frame label field to export
- a glob pattern of frame label field(s) to export
- a list or tuple of frame label field(s) to export
- a dictionary mapping frame label field names to keys to use when
constructing the frame label dictionaries to pass to the exporter
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
preview_path (None): a path to a preview image or video to display on
the readme of the dataset repo.
data_card_kwargs: additional keyword arguments to pass to the
`DatasetCard` constructor
"""

Comment on lines +212 to +254
def load_from_hub(
repo_id,
revision=None,
split=None,
splits=None,
subset=None,
subsets=None,
max_samples=None,
batch_size=None,
num_workers=None,
overwrite=False,
persistent=False,
name=None,
token=None,
config_file=None,
**kwargs,
):
"""Loads a dataset from the Hugging Face Hub into FiftyOne.

Args:
repo_id: the Hugging Face Hub identifier of the dataset
revision (None): the revision of the dataset to load
split (None): the split of the dataset to load
splits (None): the splits of the dataset to load
subset (None): the subset of the dataset to load
subsets (None): the subsets of the dataset to load
max_samples (None): the maximum number of samples to load
batch_size (None): the batch size to use when loading samples
num_workers (None): a suggested number of threads to use when
downloading media
overwrite (True): whether to overwrite an existing dataset with the
same name
persistent (False): whether the dataset should be persistent
name (None): an optional name to give the dataset
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
config_file (None): the path to a config file on disk specifying how to
load the dataset if the repo has no ``fiftyone.yml`` file
**kwargs: keyword arguments specifying config parameters to load the
dataset if the repo has no ``fiftyone.yml`` file

Returns:
a :class:`fiftyone.core.dataset.Dataset`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub(
+ def load_from_hub(
+     repo_id: str,
+     revision: Optional[str] = None,
+     split: Optional[str] = None,
+     splits: Optional[List[str]] = None,
+     subset: Optional[str] = None,
+     subsets: Optional[List[str]] = None,
+     max_samples: Optional[int] = None,
+     batch_size: Optional[int] = None,
+     num_workers: Optional[int] = None,
+     overwrite: bool = False,
+     persistent: bool = False,
+     name: Optional[str] = None,
+     token: Optional[str] = None,
+     config_file: Optional[str] = None,
+     **kwargs,
+ ) -> fod.Dataset:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def load_from_hub(
repo_id,
revision=None,
split=None,
splits=None,
subset=None,
subsets=None,
max_samples=None,
batch_size=None,
num_workers=None,
overwrite=False,
persistent=False,
name=None,
token=None,
config_file=None,
**kwargs,
):
"""Loads a dataset from the Hugging Face Hub into FiftyOne.
Args:
repo_id: the Hugging Face Hub identifier of the dataset
revision (None): the revision of the dataset to load
split (None): the split of the dataset to load
splits (None): the splits of the dataset to load
subset (None): the subset of the dataset to load
subsets (None): the subsets of the dataset to load
max_samples (None): the maximum number of samples to load
batch_size (None): the batch size to use when loading samples
num_workers (None): a suggested number of threads to use when
downloading media
overwrite (True): whether to overwrite an existing dataset with the
same name
persistent (False): whether the dataset should be persistent
name (None): an optional name to give the dataset
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
config_file (None): the path to a config file on disk specifying how to
load the dataset if the repo has no ``fiftyone.yml`` file
**kwargs: keyword arguments specifying config parameters to load the
dataset if the repo has no ``fiftyone.yml`` file
Returns:
a :class:`fiftyone.core.dataset.Dataset`
def load_from_hub(
repo_id: str,
revision: Optional[str] = None,
split: Optional[str] = None,
splits: Optional[List[str]] = None,
subset: Optional[str] = None,
subsets: Optional[List[str]] = None,
max_samples: Optional[int] = None,
batch_size: Optional[int] = None,
num_workers: Optional[int] = None,
overwrite: bool = False,
persistent: bool = False,
name: Optional[str] = None,
token: Optional[str] = None,
config_file: Optional[str] = None,
**kwargs,
) -> fod.Dataset:
"""Loads a dataset from the Hugging Face Hub into FiftyOne.
Args:
repo_id: the Hugging Face Hub identifier of the dataset
revision (None): the revision of the dataset to load
split (None): the split of the dataset to load
splits (None): the splits of the dataset to load
subset (None): the subset of the dataset to load
subsets (None): the subsets of the dataset to load
max_samples (None): the maximum number of samples to load
batch_size (None): the batch size to use when loading samples
num_workers (None): a suggested number of threads to use when
downloading media
overwrite (True): whether to overwrite an existing dataset with the
same name
persistent (False): whether the dataset should be persistent
name (None): an optional name to give the dataset
token (None): a Hugging Face API token to use. May also be provided via
the ``HF_TOKEN`` environment variable
config_file (None): the path to a config file on disk specifying how to
load the dataset if the repo has no ``fiftyone.yml`` file
**kwargs: keyword arguments specifying config parameters to load the
dataset if the repo has no ``fiftyone.yml`` file
Returns:
a :class:`fiftyone.core.dataset.Dataset`

Comment on lines +276 to +309
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.

Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""

def __init__(self, **kwargs):
## Internals
self._repo_type = kwargs.get("repo_type", None)
self._repo_id = kwargs.get("repo_id", None)
self._revision = kwargs.get("revision", None)
self._filename = kwargs.get("filename", None)
self._format = kwargs.get("format", None)

## Dataset metadata
self.tags = kwargs.get("tags", [])
if isinstance(self.tags, str):
self.tags = [t.strip() for t in self.tags.split(",")]
elif isinstance(self.tags, list):
self.tags = [t.strip() for t in self.tags]
self.license = kwargs.get("license", None)
self.description = kwargs.get("description", None)
self._get_fiftyone_version(kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     name: Optional[str] = None,
+     repo_type: Optional[str] = None,
+     repo_id: Optional[str] = None,
+     revision: Optional[str] = None,
+     filename: Optional[str] = None,
+     format: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     license: Optional[str] = None,
+     description: Optional[str] = None,
+     fiftyone: Optional[Dict[str, Any]] = None,
+     **kwargs,
+ ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""
def __init__(self, **kwargs):
## Internals
self._repo_type = kwargs.get("repo_type", None)
self._repo_id = kwargs.get("repo_id", None)
self._revision = kwargs.get("revision", None)
self._filename = kwargs.get("filename", None)
self._format = kwargs.get("format", None)
## Dataset metadata
self.tags = kwargs.get("tags", [])
if isinstance(self.tags, str):
self.tags = [t.strip() for t in self.tags.split(",")]
elif isinstance(self.tags, list):
self.tags = [t.strip() for t in self.tags]
self.license = kwargs.get("license", None)
self.description = kwargs.get("description", None)
self._get_fiftyone_version(kwargs)
class HFHubDatasetConfig(Config):
"""Config for a Hugging Face Hub dataset.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
"""
def __init__(
self,
name: Optional[str] = None,
repo_type: Optional[str] = None,
repo_id: Optional[str] = None,
revision: Optional[str] = None,
filename: Optional[str] = None,
format: Optional[str] = None,
tags: Optional[List[str]] = None,
license: Optional[str] = None,
description: Optional[str] = None,
fiftyone: Optional[Dict[str, Any]] = None,
**kwargs,
):
## Internals
self._repo_type = kwargs.get("repo_type", None)
self._repo_id = kwargs.get("repo_id", None)
self._revision = kwargs.get("revision", None)
self._filename = kwargs.get("filename", None)
self._format = kwargs.get("format", None)
## Dataset metadata
self.tags = kwargs.get("tags", [])
if isinstance(self.tags, str):
self.tags = [t.strip() for t in self.tags.split(",")]
elif isinstance(self.tags, list):
self.tags = [t.strip() for t in self.tags]
self.license = kwargs.get("license", None)
self.description = kwargs.get("description", None)
self._get_fiftyone_version(kwargs)

Comment on lines +479 to +509
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.

Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""

def __init__(self, **kwargs):
super().__init__(**kwargs)

self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE)

self._build_name(kwargs)
self._build_media_fields_dict(kwargs)
self._build_label_fields_dict(kwargs)
self._build_allowed_splits(kwargs)
self._build_allowed_subsets(kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     media_type: str = DEFAULT_MEDIA_TYPE,
+     label_fields: Optional[Dict[str, List[str]]] = None,
+     default_media_fields: Optional[Dict[str, str]] = None,
+     additional_media_fields: Optional[Dict[str, str]] = None,
+     **kwargs,
+ ):

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.media_type = kwargs.get("media_type", DEFAULT_MEDIA_TYPE)
self._build_name(kwargs)
self._build_media_fields_dict(kwargs)
self._build_label_fields_dict(kwargs)
self._build_allowed_splits(kwargs)
self._build_allowed_subsets(kwargs)
class HFHubParquetFilesDatasetConfig(HFHubDatasetConfig):
"""Config for a Hugging Face Hub dataset that is stored as parquet files.
Args:
name: the name of the dataset
repo_type: the type of the repository
repo_id: the identifier of the repository
revision: the revision of the dataset
filename: the name of the file
format: the format of the dataset
tags: the tags of the dataset
license: the license of the dataset
description: the description of the dataset
fiftyone: the fiftyone version requirement of the dataset
label_fields: the label fields of the dataset
media_type: the media type of the dataset
default_media_fields: the default media fields of the dataset
additional_media_fields: the additional media fields of the dataset
"""
def __init__(
self,
media_type: str = DEFAULT_MEDIA_TYPE,
label_fields: Optional[Dict[str, List[str]]] = None,
default_media_fields: Optional[Dict[str, str]] = None,
additional_media_fields: Optional[Dict[str, str]] = None,
**kwargs,
):
super().__init__(**kwargs)
self.media_type = media_type
self._build_name(kwargs)
self._build_media_fields_dict(kwargs)
self._build_label_fields_dict(kwargs)
self._build_allowed_splits(kwargs)
self._build_allowed_subsets(kwargs)

Comment on lines +613 to +630
def _ensure_dataset_compatibility(config):
req_str = config.version
if req_str is None:
return

try:
req = Requirement(req_str)
except:
logger.warning(
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"
)
return

if not req.specifier.contains(foc.VERSION):
logger.warning(
f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider logging a clear message or raising a warning when the dataset version requirement is not met, in addition to the current error handling strategy.

if not req.specifier.contains(foc.VERSION):
+   logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.")
    exception = ImportError(
        f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible"
    )
    fou.handle_error(exception, error_level)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def _ensure_dataset_compatibility(config):
req_str = config.version
if req_str is None:
return
try:
req = Requirement(req_str)
except:
logger.warning(
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"
)
return
if not req.specifier.contains(foc.VERSION):
logger.warning(
f"Dataset {config.name} requires {req_str} but you are running fiftyone=={foc.VERSION}"
)
def _ensure_dataset_compatibility(config):
req_str = config.version
if req_str is None:
return
try:
req = Requirement(req_str)
except:
logger.warning(
f"Unable to parse dataset {config.name}'s fiftyone version requirement {req_str}"
)
return
if not req.specifier.contains(foc.VERSION):
logger.warning(f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which may lead to compatibility issues.")
exception = ImportError(
f"Dataset {config.name} requires {req_str} but you are running {foc.VERSION}, which is not compatible"
)
fou.handle_error(exception, error_level)

Ensure user is on latest version of FO (0.23.7) so the integration works
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 91f026d and fd624fa.
Files selected for processing (1)
  • fiftyone/utils/huggingface.py (1 hunks)
Additional comments not posted (5)
fiftyone/utils/huggingface.py (5)

62-115: Consider adding type hints to the push_to_hub function parameters and return type for improved readability and type checking.

- def push_to_hub(
+ def push_to_hub(
+     dataset: fod.Dataset,
+     repo_name: str,
+     description: Optional[str] = None,
+     license: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     private: bool = False,
+     exist_ok: bool = False,
+     dataset_type: Optional[Type[fot.DatasetFormat]] = None,
+     min_fiftyone_version: Optional[str] = None,
+     label_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     frame_labels_field: Optional[Union[str, List[str], Dict[str, str]]] = None,
+     token: Optional[str] = None,
+     preview_path: Optional[str] = None,
+     **data_card_kwargs,
+ ) -> None:

212-254: Consider adding type hints to the load_from_hub function parameters and return type for improved readability and type checking.

- def load_from_hub(
+ def load_from_hub(
+     repo_id: str,
+     revision: Optional[str] = None,
+     split: Optional[str] = None,
+     splits: Optional[List[str]] = None,
+     subset: Optional[str] = None,
+     subsets: Optional[List[str]] = None,
+     max_samples: Optional[int] = None,
+     batch_size: Optional[int] = None,
+     num_workers: Optional[int] = None,
+     overwrite: bool = False,
+     persistent: bool = False,
+     name: Optional[str] = None,
+     token: Optional[str] = None,
+     config_file: Optional[str] = None,
+     **kwargs,
+ ) -> fod.Dataset:

276-309: Consider adding type hints to the HFHubDatasetConfig class __init__ method parameters for consistency and clarity.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     name: Optional[str] = None,
+     repo_type: Optional[str] = None,
+     repo_id: Optional[str] = None,
+     revision: Optional[str] = None,
+     filename: Optional[str] = None,
+     format: Optional[str] = None,
+     tags: Optional[List[str]] = None,
+     license: Optional[str] = None,
+     description: Optional[str] = None,
+     fiftyone: Optional[Dict[str, Any]] = None,
+     **kwargs,
+ ):

479-509: Consider adding type hints to the HFHubParquetFilesDatasetConfig class __init__ method parameters for improved readability and type checking.

- def __init__(self, **kwargs):
+ def __init__(
+     self,
+     media_type: str = DEFAULT_MEDIA_TYPE,
+     label_fields: Optional[Dict[str, List[str]]] = None,
+     default_media_fields: Optional[Dict[str, str]] = None,
+     additional_media_fields: Optional[Dict[str, str]] = None,
+     **kwargs,
+ ):

613-630: The addition of a warning log when the dataset version requirement is not met is a good practice for informing users about potential compatibility issues.

@jacobmarks jacobmarks merged commit 81b2545 into develop Apr 5, 2024
10 checks passed
@jacobmarks jacobmarks deleted the hf-hub-integration-v2 branch April 5, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants