Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Hub Integration! #4193

Merged
merged 69 commits into from
Apr 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
a9eb9c7
initial sketch
jacobmarks Mar 24, 2024
cadc1be
always add filepath to sample dict
jacobmarks Mar 25, 2024
7214599
image shape util usage
jacobmarks Mar 25, 2024
8eaf809
change load_from_hub kwargs to named
jacobmarks Mar 25, 2024
558497a
robust label field kwarg handling
jacobmarks Mar 25, 2024
80b946e
enable config spec through kwargs
jacobmarks Mar 25, 2024
0de7125
Speedup media download with threadpools
jacobmarks Mar 25, 2024
72b381f
adds support for loading from gated repos
jacobmarks Mar 25, 2024
3090f37
use threadpool to download masks
jacobmarks Mar 25, 2024
50d8deb
default batch size config
jacobmarks Mar 25, 2024
7225363
adds push_to_hub()
jacobmarks Mar 25, 2024
3664981
adds loading FiftyOne datasets from hub
jacobmarks Mar 25, 2024
b63cdbf
clean up
jacobmarks Mar 25, 2024
74e214c
handle no-label classification case
jacobmarks Mar 26, 2024
639a7c2
rename hf util file
jacobmarks Mar 28, 2024
a539d76
adds standard module docstring
jacobmarks Mar 28, 2024
8d49821
removes detection_format and mask_targets for now
jacobmarks Mar 28, 2024
cbafc83
lazy import huggingface_hub
jacobmarks Mar 28, 2024
7fd50ba
remove unused import
jacobmarks Mar 28, 2024
45aaefc
use TmpDir eta util
jacobmarks Mar 28, 2024
0463937
fixes docstring indentation
jacobmarks Mar 28, 2024
047947a
fix api docstring bullet list indent
jacobmarks Mar 28, 2024
4e2f674
changing heading levels
jacobmarks Mar 28, 2024
750c75a
change default push to public
jacobmarks Mar 28, 2024
025a61e
basic push_to_hub docs
jacobmarks Mar 28, 2024
97a7241
updating code snippet for loading dataset
jacobmarks Mar 28, 2024
8eb5a0e
pushing to hub advanced usage section
jacobmarks Mar 28, 2024
a9d6801
examples for advanced push section
jacobmarks Mar 28, 2024
cc7f2c0
loading from hub via local config docs
jacobmarks Mar 28, 2024
edcbc4c
hub loading from kwargs docs
jacobmarks Mar 28, 2024
c6fe103
starting supported config fields section
jacobmarks Mar 28, 2024
75b997d
documenting all supported config args
jacobmarks Mar 29, 2024
cc64b7b
documents split and subset config
jacobmarks Mar 29, 2024
fc9cfa8
documents download process params
jacobmarks Mar 29, 2024
4c49b32
documents loading classification datasets
jacobmarks Mar 29, 2024
0404864
documents loading detection datasets
jacobmarks Mar 29, 2024
e45317d
adds note about other detection formats
jacobmarks Mar 29, 2024
b510fff
documents loading segmentation datasets
jacobmarks Mar 29, 2024
b79fdf6
Documents loading unlabeled datasets
jacobmarks Mar 29, 2024
75b60ef
documents advanced loading
jacobmarks Mar 29, 2024
6114d74
adds breadcrumb to config args
jacobmarks Mar 29, 2024
d619c0e
Documents loading gated datasets
jacobmarks Mar 29, 2024
f0e82c2
adds num_workers arg for downloading
jacobmarks Mar 29, 2024
75ee44c
remove unnecessary max_samples
jacobmarks Mar 29, 2024
de658d8
access warning
jacobmarks Mar 29, 2024
c771667
removes globals cache
jacobmarks Mar 29, 2024
b1e3882
implementation/doc updates
brimoor Apr 3, 2024
7801a0a
Merge pull request #4223 from voxel51/hf-hub-integration-v2-brian
jacobmarks Apr 3, 2024
3951c2c
bump huggingface_hub version for token auth
jacobmarks Apr 3, 2024
8b74e3d
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
dc3af96
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
cd72833
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
9318690
Update docs/source/integrations/huggingface.rst
jacobmarks Apr 4, 2024
43e6ce1
docs pass
brimoor Apr 4, 2024
4484c2f
Merge branch 'hf-hub-integration-v2' of https://github.com/voxel51/fi…
brimoor Apr 4, 2024
6cab1d0
makes fiftyone version req optional
jacobmarks Apr 4, 2024
e50e353
document min_fiftyone_version
jacobmarks Apr 4, 2024
9e1e7d5
validation min_fiftyone_version
jacobmarks Apr 4, 2024
df1db86
remove unused import
jacobmarks Apr 4, 2024
e4e78ac
adds preview_path option to preview dataset
jacobmarks Apr 4, 2024
40f7b92
documents preview_path
jacobmarks Apr 4, 2024
d5ea06e
improve name resolution
jacobmarks Apr 4, 2024
b71e3fb
only download necessary media files for FiftyOneDatasets
jacobmarks Apr 4, 2024
e2f55af
removing version validation
jacobmarks Apr 4, 2024
6937faa
improve efficiency
jacobmarks Apr 5, 2024
1179d9a
Merge pull request #4231 from voxel51/hf-hub-upgrades
jacobmarks Apr 5, 2024
a5291a6
Adds context without progress bars for upload/download
jacobmarks Apr 5, 2024
91f026d
Merge pull request #4234 from voxel51/hf-hub-progress-bars
jacobmarks Apr 5, 2024
fd624fa
Update install instructions
jacobmarks Apr 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
only download necessary media files for FiftyOneDatasets
  • Loading branch information
jacobmarks committed Apr 4, 2024
commit b71e3fb58d265af784cb4a7020b1adedbffd5b6d
39 changes: 36 additions & 3 deletions fiftyone/utils/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -1077,6 +1077,12 @@ def _resolve_dataset_name(config, **kwargs):
return name


def _get_files_to_download(dataset):
filepaths = dataset.values("filepath")
filepaths = [fp for fp in filepaths if not os.path.exists(fp)]
return filepaths


def _load_fiftyone_dataset_from_config(config, **kwargs):
logger.info("Loading dataset")

Expand All @@ -1086,12 +1092,27 @@ def _load_fiftyone_dataset_from_config(config, **kwargs):
splits = _parse_split_kwargs(**kwargs)

download_dir = _get_download_dir(config._repo_id, **kwargs)
hfh.snapshot_download(
repo_id=config._repo_id, repo_type="dataset", local_dir=download_dir
)

init_download_kwargs = {
"repo_id": config._repo_id,
"repo_type": "dataset",
"local_dir": download_dir,
}

dataset_type_name = config._format.strip()

if dataset_type_name == "FiftyOneDataset":
# If the dataset is a FiftyOneDataset, we can smart only download the
# necessary files
hfh.snapshot_download(
**init_download_kwargs,
ignore_patterns="data/*",
)
else:
hfh.snapshot_download(
**init_download_kwargs,
)

dataset_type = getattr(
__import__("fiftyone.types", fromlist=[dataset_type_name]),
dataset_type_name,
Expand All @@ -1110,6 +1131,18 @@ def _load_fiftyone_dataset_from_config(config, **kwargs):
dataset_kwargs["name"] = name

dataset = fod.Dataset.from_dir(download_dir, **dataset_kwargs)

if dataset_type_name != "FiftyOneDataset":
return dataset

filepaths = _get_files_to_download(dataset)
if filepaths:
logger.info(f"Downloading {len(filepaths)} media files...")
filenames = [os.path.basename(fp) for fp in filepaths]
allowed_globs = ["data/" + fn for fn in filenames]
hfh.snapshot_download(
**init_download_kwargs, allow_patterns=allowed_globs
)
return dataset


Expand Down