Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Hub Integration! #4193

Merged
merged 69 commits into from
Apr 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
a9eb9c7
initial sketch
jacobmarks Mar 24, 2024
cadc1be
always add filepath to sample dict
jacobmarks Mar 25, 2024
7214599
image shape util usage
jacobmarks Mar 25, 2024
8eaf809
change load_from_hub kwargs to named
jacobmarks Mar 25, 2024
558497a
robust label field kwarg handling
jacobmarks Mar 25, 2024
80b946e
enable config spec through kwargs
jacobmarks Mar 25, 2024
0de7125
Speedup media download with threadpools
jacobmarks Mar 25, 2024
72b381f
adds support for loading from gated repos
jacobmarks Mar 25, 2024
3090f37
use threadpool to download masks
jacobmarks Mar 25, 2024
50d8deb
default batch size config
jacobmarks Mar 25, 2024
7225363
adds push_to_hub()
jacobmarks Mar 25, 2024
3664981
adds loading FiftyOne datasets from hub
jacobmarks Mar 25, 2024
b63cdbf
clean up
jacobmarks Mar 25, 2024
74e214c
handle no-label classification case
jacobmarks Mar 26, 2024
639a7c2
rename hf util file
jacobmarks Mar 28, 2024
a539d76
adds standard module docstring
jacobmarks Mar 28, 2024
8d49821
removes detection_format and mask_targets for now
jacobmarks Mar 28, 2024
cbafc83
lazy import huggingface_hub
jacobmarks Mar 28, 2024
7fd50ba
remove unused import
jacobmarks Mar 28, 2024
45aaefc
use TmpDir eta util
jacobmarks Mar 28, 2024
0463937
fixes docstring indentation
jacobmarks Mar 28, 2024
047947a
fix api docstring bullet list indent
jacobmarks Mar 28, 2024
4e2f674
changing heading levels
jacobmarks Mar 28, 2024
750c75a
change default push to public
jacobmarks Mar 28, 2024
025a61e
basic push_to_hub docs
jacobmarks Mar 28, 2024
97a7241
updating code snippet for loading dataset
jacobmarks Mar 28, 2024
8eb5a0e
pushing to hub advanced usage section
jacobmarks Mar 28, 2024
a9d6801
examples for advanced push section
jacobmarks Mar 28, 2024
cc7f2c0
loading from hub via local config docs
jacobmarks Mar 28, 2024
edcbc4c
hub loading from kwargs docs
jacobmarks Mar 28, 2024
c6fe103
starting supported config fields section
jacobmarks Mar 28, 2024
75b997d
documenting all supported config args
jacobmarks Mar 29, 2024
cc64b7b
documents split and subset config
jacobmarks Mar 29, 2024
fc9cfa8
documents download process params
jacobmarks Mar 29, 2024
4c49b32
documents loading classification datasets
jacobmarks Mar 29, 2024
0404864
documents loading detection datasets
jacobmarks Mar 29, 2024
e45317d
adds note about other detection formats
jacobmarks Mar 29, 2024
b510fff
documents loading segmentation datasets
jacobmarks Mar 29, 2024
b79fdf6
Documents loading unlabeled datasets
jacobmarks Mar 29, 2024
75b60ef
documents advanced loading
jacobmarks Mar 29, 2024
6114d74
adds breadcrumb to config args
jacobmarks Mar 29, 2024
d619c0e
Documents loading gated datasets
jacobmarks Mar 29, 2024
f0e82c2
adds num_workers arg for downloading
jacobmarks Mar 29, 2024
75ee44c
remove unnecessary max_samples
jacobmarks Mar 29, 2024
de658d8
access warning
jacobmarks Mar 29, 2024
c771667
removes globals cache
jacobmarks Mar 29, 2024
b1e3882
implementation/doc updates
brimoor Apr 3, 2024
7801a0a
Merge pull request #4223 from voxel51/hf-hub-integration-v2-brian
jacobmarks Apr 3, 2024
3951c2c
bump huggingface_hub version for token auth
jacobmarks Apr 3, 2024
8b74e3d
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
dc3af96
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
cd72833
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
9318690
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
43e6ce1
docs pass
brimoor Apr 4, 2024
4484c2f
Merge branch 'hf-hub-integration-v2' of https://github.com/voxel51/fi…
brimoor Apr 4, 2024
6cab1d0
makes fiftyone version req optional
jacobmarks Apr 4, 2024
e50e353
document min_fiftyone_version
jacobmarks Apr 4, 2024
9e1e7d5
validation min_fiftyone_version
jacobmarks Apr 4, 2024
df1db86
remove unused import
jacobmarks Apr 4, 2024
e4e78ac
adds preview_path option to preview dataset
jacobmarks Apr 4, 2024
40f7b92
documents preview_path
jacobmarks Apr 4, 2024
d5ea06e
improve name resolution
jacobmarks Apr 4, 2024
b71e3fb
only download necessary media files for FiftyOneDatasets
jacobmarks Apr 4, 2024
e2f55af
removing version validation
jacobmarks Apr 4, 2024
6937faa
improve efficiency
jacobmarks Apr 5, 2024
1179d9a
Merge pull request #4231 from voxel51/hf-hub-upgrades
jacobmarks Apr 5, 2024
a5291a6
Adds context without progress bars for upload/download
jacobmarks Apr 5, 2024
91f026d
Merge pull request #4234 from voxel51/hf-hub-progress-bars
jacobmarks Apr 5, 2024
fd624fa
Update install instructions
jacobmarks Apr 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
enable config spec through kwargs
  • Loading branch information
jacobmarks committed Mar 25, 2024
commit 80b946e234953d5d87ee56a4b5556cc1bfec435f
37 changes: 31 additions & 6 deletions fiftyone/utils/hf_hub.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,14 @@ def _build_allowed_subsets(self, kwargs):
def _build_media_fields_dict(self, kwargs):
media_fields_dict = kwargs.get("default_media_fields", {})
if media_fields_dict.get("filepath", None) is None:
media_fields_dict["filepath"] = DEFAULT_IMAGE_FILEPATH_FEATURE
media_fields_dict["filepath"] = kwargs.get(
"filepath", DEFAULT_IMAGE_FILEPATH_FEATURE
)
if (
media_fields_dict.get("thumbnail_path", None) is None
and kwargs.get("thumbnail_path", None) is not None
):
media_fields_dict["thumbnail_path"] = kwargs["thumbnail_path"]

additional_media_fields = kwargs.get("additional_media_fields", {})
media_fields_dict.update(additional_media_fields)
Expand All @@ -223,7 +230,10 @@ def _build_label_fields_dict(self, kwargs):
for label_type in label_types:
label_fields = kwargs.get(f"{label_type}_fields", None)
if label_fields is not None:
self.label_fields[label_type] = label_fields.split(",")
if isinstance(label_fields, str):
self.label_fields[label_type] = label_fields.split(",")
elif isinstance(label_fields, list):
self.label_fields[label_type] = label_fields

# def _build_mask_targets(self, kwargs):
# self.mask_targets = kwargs.get("mask_targets", None)
Expand All @@ -239,8 +249,19 @@ def _build_label_fields_dict(self, kwargs):
# self.mask_targets = json.load(f)


def _parse_format_string(format_str):
if "parquet" in format_str.lower():
return "ParquetFilesDataset"
else:
return format_str


def _build_config(config_dict):
format = config_dict.get("format", None)
if format is None:
raise ValueError("Dataset config must have a format key")

format = _parse_format_string(format)
if format == "ParquetFilesDataset":
return HFHubParquetFilesDatasetConfig(**config_dict)
else:
Expand Down Expand Up @@ -274,11 +295,15 @@ def _get_dataset_metadata(repo_id, revision=None, **kwargs):
config_file = hfh.hf_hub_download(**all_kwargs)
break

if config_file is None:
if config_file is None and "format" not in kwargs:
return None

with open(config_file, "r") as f:
config_dict = yaml.safe_load(f)
elif config_file is None:
config_dict = kwargs
config_dict.update(**common_kwargs)
config_dict["repo_id"] = repo_id
else:
with open(config_file, "r") as f:
config_dict = yaml.safe_load(f)
config_dict.update(**all_kwargs)
return _build_config(config_dict)

Expand Down