Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement annif upload and annif download commands for Hugging Face Hub integration #762

Merged
merged 39 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f6d2b7d
Initial functionality for HF Hub upload
juhoinkinen Feb 1, 2024
ab5e4bf
Use tempfile module and file-like objects for uploads
juhoinkinen Feb 5, 2024
d3dd888
Separate files for each project, vocab and config
juhoinkinen Feb 6, 2024
9d030c6
Catch also HFValidationError in HFH uploads
juhoinkinen Feb 6, 2024
3135114
Initial functionality for HF Hub download
juhoinkinen Feb 7, 2024
038d86d
Upgrade to huggingface-hub 0.21.*
juhoinkinen Feb 29, 2024
5afb251
Drop -projects part from upload/download CLI commands
juhoinkinen Feb 29, 2024
13191fc
Speed up CLI startup by moving imports in functions
juhoinkinen Feb 29, 2024
7666de8
Add --force option to allow overwrite local contents on download
juhoinkinen Mar 1, 2024
301d787
Resolve CodeQL complaint about imports
juhoinkinen Mar 1, 2024
d5b4abe
Restore datafile timestamps after unzipping
juhoinkinen Mar 4, 2024
a1e7605
Add comment to zip file with used Annif version
juhoinkinen Mar 4, 2024
25a46dc
Catch HFH Errors in listing files in repo
juhoinkinen Mar 4, 2024
86714d8
Unzip archive contents to used DATADIR
juhoinkinen Mar 6, 2024
6ba1e08
Add tests
juhoinkinen Mar 7, 2024
4d06be6
Create /.cache/huggingface/ with full access rights in Dockerimage
juhoinkinen Mar 7, 2024
a4f0f6f
Merge branch 'update-dependencies-v1.1' into issue760-hugging-face-hu…
juhoinkinen Mar 8, 2024
7575fff
Fix and improve tests and increase coverage
juhoinkinen Mar 8, 2024
16bacfb
Remove todos
juhoinkinen Mar 8, 2024
2952f64
Create /Annif/projects.d/ for tests in Dockerfile
juhoinkinen Mar 8, 2024
ed3cf2c
Refactor to address quality complains; improve names
juhoinkinen Mar 8, 2024
5b16952
Add docstrings
juhoinkinen Mar 12, 2024
c87675c
Add type hints
juhoinkinen Mar 12, 2024
2fe5b73
Update RTD CLI commands page
juhoinkinen Mar 12, 2024
d7be137
Remove --revision option of download command
juhoinkinen Mar 13, 2024
47f7ee4
Upgrade to huggingface-hub 0.22.*
juhoinkinen Mar 25, 2024
a488d07
Revert "Remove --revision option of download command"
juhoinkinen Mar 26, 2024
0c57bf2
Preupload lfs files
juhoinkinen Mar 26, 2024
df105a3
Fix HF Hub caching in Dockerfile
juhoinkinen Mar 27, 2024
d14ff30
Refactor to address quality complains
juhoinkinen Apr 12, 2024
cc0c989
Again: Refactor & simplify to address quality complains
juhoinkinen Apr 12, 2024
9443c8f
Fix typo in mocked filenames in repo
juhoinkinen Apr 19, 2024
156bbf5
Detect projects present in repo by .cfg files, not .zip files
juhoinkinen Apr 19, 2024
3f60456
Add --revision option to upload command
juhoinkinen Apr 19, 2024
2dd359d
Enable completion of project_id argument in upload command
juhoinkinen Apr 19, 2024
63076cd
Adapt test for adding revision option to upload command
juhoinkinen Apr 19, 2024
a0a3850
Move functions for HuggingFaceHub interactions to own file
juhoinkinen Apr 23, 2024
638aa07
Move unit tests for HuggingFaceHub util fns to own file
juhoinkinen Apr 23, 2024
6f35fff
Make io import conditional to TYPE_CHECKING
juhoinkinen Apr 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add --force option to allow overwrite local contents on download
  • Loading branch information
juhoinkinen committed Mar 1, 2024
commit 7666de8a9a0dcb7d575c0ae7ae285f0d94244648
16 changes: 11 additions & 5 deletions annif/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import json
import os.path
import re
import shutil
import sys
from fnmatch import fnmatch

Expand Down Expand Up @@ -606,34 +605,34 @@
with the projects configuration to the specified Hugging Face Hub repository.
An authentication token and commit message can be given with options.
"""
projects = [

Check warning on line 608 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L608

Added line #L608 was not covered by tests
proj
for proj in annif.registry.get_projects(min_access=Access.private).values()
if fnmatch(proj.project_id, project_ids_pattern)
]
click.echo(f"Uploading project(s): {', '.join([p.project_id for p in projects])}")

Check warning on line 613 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L613

Added line #L613 was not covered by tests

commit_message = (

Check warning on line 615 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L615

Added line #L615 was not covered by tests
commit_message
if commit_message is not None
else f"Upload project(s) {project_ids_pattern} with Annif"
)

project_dirs = {p.datadir for p in projects}
vocab_dirs = {p.vocab.datadir for p in projects}
data_dirs = project_dirs.union(vocab_dirs)

Check warning on line 623 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L621-L623

Added lines #L621 - L623 were not covered by tests

for data_dir in data_dirs:
zip_path = data_dir.split(os.path.sep, 1)[1] + ".zip" # TODO Check this
fobj = cli_util.archive_dir(data_dir)
cli_util.upload_to_hf_hub(fobj, zip_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 629 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L625-L629

Added lines #L625 - L629 were not covered by tests

for project in projects:
config_path = project.project_id + ".cfg"
fobj = cli_util.write_config(project)
cli_util.upload_to_hf_hub(fobj, config_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 635 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L631-L635

Added lines #L631 - L635 were not covered by tests


@cli.command("download")
Expand All @@ -651,8 +650,15 @@
hash.
""",
)
@click.option(
"--force",
"-f",
default=False,
is_flag=True,
help="Replace an existing project/vocabulary/config with the downloaded one",
)
@cli_util.common_options
def run_download(project_ids_pattern, repo_id, token, revision):
def run_download(project_ids_pattern, repo_id, token, revision, force):
"""
Download selected projects and their vocabularies from a Hugging Face Hub repository
\f
Expand All @@ -664,30 +670,30 @@
be given with options.
"""

project_ids = cli_util.get_selected_project_ids_from_hf_hub(

Check warning on line 673 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L673

Added line #L673 was not covered by tests
project_ids_pattern, repo_id, token, revision
)
click.echo(f"Downloading project(s): {', '.join(project_ids)}")

Check warning on line 676 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L676

Added line #L676 was not covered by tests

if not os.path.isdir("projects.d"):
os.mkdir("projects.d")
vocab_ids = set()
for project_id in project_ids:
project_zip_local_cache_path = cli_util.download_from_hf_hub(

Check warning on line 682 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L678-L682

Added lines #L678 - L682 were not covered by tests
f"projects/{project_id}.zip", repo_id, token, revision
)
cli_util.unzip(project_zip_local_cache_path)
cli_util.unzip(project_zip_local_cache_path, force)
local_config_cache_path = cli_util.download_from_hf_hub(

Check warning on line 686 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L685-L686

Added lines #L685 - L686 were not covered by tests
f"{project_id}.cfg", repo_id, token, revision
)
vocab_ids.add(cli_util.get_vocab_id(local_config_cache_path))
shutil.copy(local_config_cache_path, "projects.d") # TODO Disallow overwrite
cli_util.move_project_config(local_config_cache_path, force)

Check warning on line 690 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L689-L690

Added lines #L689 - L690 were not covered by tests

for vocab_id in vocab_ids:
vocab_zip_local_cache_path = cli_util.download_from_hf_hub(

Check warning on line 693 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L692-L693

Added lines #L692 - L693 were not covered by tests
f"vocabs/{vocab_id}.zip", repo_id, token, revision
)
cli_util.unzip(vocab_zip_local_cache_path)
cli_util.unzip(vocab_zip_local_cache_path, force)

Check warning on line 696 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L696

Added line #L696 was not covered by tests


@cli.command("completion")
Expand Down
50 changes: 47 additions & 3 deletions annif/cli_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

from __future__ import annotations

import binascii
import collections
import configparser
import io
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
import itertools
import os
import pathlib
import shutil
import sys
import tempfile
import zipfile
Expand Down Expand Up @@ -238,55 +240,55 @@


def _is_train_file(fname):
train_file_patterns = ("-train", "tmp-")
for pat in train_file_patterns:
if pat in fname:
return True
return False

Check warning on line 247 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L243-L247

Added lines #L243 - L247 were not covered by tests


def archive_dir(data_dir):
fp = tempfile.TemporaryFile()
path = pathlib.Path(data_dir)
fpaths = [fpath for fpath in path.glob("**/*") if not _is_train_file(fpath.name)]
with zipfile.ZipFile(fp, mode="w") as zfile:
for fpath in fpaths:
logger.debug(f"Adding {fpath}")
zfile.write(fpath)
fp.seek(0)
return fp

Check warning on line 259 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L251-L259

Added lines #L251 - L259 were not covered by tests


def write_config(project):
fp = tempfile.TemporaryFile(mode="w+t")
config = configparser.ConfigParser()
config[project.project_id] = project.config
config.write(fp) # This needs tempfile in text mode
fp.seek(0)

Check warning on line 267 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L263-L267

Added lines #L263 - L267 were not covered by tests
# But for upload fobj needs to be in binary mode
return io.BytesIO(fp.read().encode("utf8"))

Check warning on line 269 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L269

Added line #L269 was not covered by tests


def upload_to_hf_hub(fileobj, filename, repo_id, token, commit_message):
from huggingface_hub import HfApi
from huggingface_hub.utils import HfHubHTTPError, HFValidationError

Check warning on line 274 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L273-L274

Added lines #L273 - L274 were not covered by tests

api = HfApi()
try:
api.upload_file(

Check warning on line 278 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L276-L278

Added lines #L276 - L278 were not covered by tests
path_or_fileobj=fileobj,
path_in_repo=filename,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
except (HfHubHTTPError, HFValidationError) as err:
raise OperationFailedException(str(err))

Check warning on line 286 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L285-L286

Added lines #L285 - L286 were not covered by tests


def get_selected_project_ids_from_hf_hub(project_ids_pattern, repo_id, token, revision):
all_repo_file_paths = _list_files_in_hf_hub(repo_id, token, revision)
return [

Check warning on line 291 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L290-L291

Added lines #L290 - L291 were not covered by tests
path.rsplit(".zip")[0].split("projects/")[1] # TODO Try-catch this
for path in all_repo_file_paths
if fnmatch(path, f"projects/{project_ids_pattern}.zip")
Expand All @@ -294,39 +296,81 @@


def _list_files_in_hf_hub(repo_id, token, revision):
from huggingface_hub import list_repo_files

Check warning on line 299 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L299

Added line #L299 was not covered by tests

return [

Check warning on line 301 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L301

Added line #L301 was not covered by tests
repofile
for repofile in list_repo_files(repo_id=repo_id, token=token, revision=revision)
]


def download_from_hf_hub(filename, repo_id, token, revision):
from huggingface_hub import hf_hub_download
from huggingface_hub.utils import HfHubHTTPError, HFValidationError

Check warning on line 309 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L308-L309

Added lines #L308 - L309 were not covered by tests

try:
return hf_hub_download(

Check warning on line 312 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L311-L312

Added lines #L311 - L312 were not covered by tests
repo_id=repo_id,
filename=filename,
token=token,
revision=revision,
)
except (HfHubHTTPError, HFValidationError) as err:
raise OperationFailedException(str(err))

Check warning on line 319 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L318-L319

Added lines #L318 - L319 were not covered by tests


def unzip(source_path):
with zipfile.ZipFile(source_path, "r") as zfile:
zfile.extractall() # TODO Disallow overwrite
def unzip(src_path, force):
with zipfile.ZipFile(src_path, "r") as zfile:
for member in zfile.infolist():
if os.path.exists(member.filename) and not force:
if _is_existing_identical(member):
logger.debug(

Check warning on line 327 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L323-L327

Added lines #L323 - L327 were not covered by tests
f"Skipping unzip of {member.filename}; already in place"
)
else:
click.echo(

Check warning on line 331 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L331

Added line #L331 was not covered by tests
f"Not overwriting {member.filename} (use --force to override)"
)
else:
logger.debug(f"Unzipping {member.filename}")
zfile.extract(member)

Check warning on line 336 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L335-L336

Added lines #L335 - L336 were not covered by tests


def move_project_config(src_path, force):
dst_path = os.path.join("projects.d", os.path.basename(src_path))
if os.path.exists(dst_path) and not force:
if _compute_crc32(dst_path) == _compute_crc32(src_path):
logger.debug(

Check warning on line 343 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L340-L343

Added lines #L340 - L343 were not covered by tests
f"Skipping move of {os.path.basename(src_path)}; already in place"
)
else:
click.echo(f"Not overwriting {dst_path} (use --force to override)")

Check warning on line 347 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L347

Added line #L347 was not covered by tests
else:
shutil.copy(src_path, dst_path)

Check warning on line 349 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L349

Added line #L349 was not covered by tests


def _is_existing_identical(member):
file_crc = _compute_crc32(member.filename)
return file_crc == member.CRC

Check warning on line 354 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L353-L354

Added lines #L353 - L354 were not covered by tests


def _compute_crc32(path):
if os.path.isdir(path):
return 0

Check warning on line 359 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L358-L359

Added lines #L358 - L359 were not covered by tests

size = 1024 * 1024 * 10 # 10 MiB chunks
with open(path, "rb") as fp:
crcval = 0
while chunk := fp.read(size):
crcval = binascii.crc32(chunk, crcval)
return crcval

Check warning on line 366 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L361-L366

Added lines #L361 - L366 were not covered by tests


def get_vocab_id(config_path):
config = configparser.ConfigParser()
config.read(config_path)
section = config.sections()[0]
return config[section]["vocab"]

Check warning on line 373 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L370-L373

Added lines #L370 - L373 were not covered by tests


def _get_completion_choices(
Expand Down
Loading