-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement annif upload
and annif download
commands for Hugging Face Hub integration
#762
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #762 +/- ##
==========================================
- Coverage 99.65% 99.64% -0.02%
==========================================
Files 89 91 +2
Lines 6404 6768 +364
==========================================
+ Hits 6382 6744 +362
- Misses 22 24 +2 ☔ View full report in Codecov by Sentry. |
annif upload-projects
command for Hugging Face Hub integrationannif upload-projects
and download-projects
commands for Hugging Face Hub integration
TODO:
@osma could you take a look at the working principles of the commands before I spend more time with this? The downloaded files now remain in cache for huggingface client, e.g. ( Could or should these commands be flagged as experimental for now, to allow changing them? |
I took a look at this (sorry for the delay!) and I think this is a really good feature. Some thoughts:
I tested this on my local install by running the command
Isn't the caching standard behaviour of HF Hub operations? I think just keeping the cache would be the least surprising thing to do here. Maybe there could be a |
About overwriting vocabularies - I think it would be great if Annif would notice, that the downloaded vocabulary differs from what exists locally. It should be no problem to download two different models that use the same vocabulary:
But when you have a different version of the vocabulary already present, Annif could show a message such as
Even better would be to list the projects that may break. But how to detect that the local vocabulary is different than the downloaded one? Calculate checksums on the files and compare them? |
Actually the ZipFile objects allow getting ZipInfo of each member of the archive, which in turn contain a CRC-32 checksum (and the name and modification timestamp) of the uncomprossed file. And ZipInfo.from_file() class method allows to get the CRC from the existing, uncompressed files, so the checksum comparisons should be quite easy to implement. Edit: But CRC of the objects constructed by |
If it's the above bug in zipfile, it's claimed to be
Annif just dropped 3.8 support so maybe we're good? Just need to rebase this branch. |
The Python versions 3.9.18, 3.10.13 and 3.11.8 all behaved the same. This seems strange: the code zinfo = zipfile.ZipInfo.from_file("data/projects/yake-fi/yake-index")
print(dir(zinfo))
print(zinfo.file_size)
print(zinfo.CRC) outputs
The CRC attribute is created using Also trying to access
Anyway, the CRC-32 checksums for the existing, uncompressed files could be calculated by |
I added I did not see a reason to restrict the content comparison to only vocabularies(?) (and it was easier to implement both projects and vocabs 🙂). Demo with #!/bin/bash
rm -r data/projects/{fasttext-fi,tfidf-fi,yake-fi}
echo "# Initial download"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Second download with identical content, no-op"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo $RANDOM > data/projects/tfidf-fi/file.txt
echo $RANDOM >> projects.d/tfidf-fi.cfg
echo "# Third download over changed content, skips file overwrites with complains"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing
echo "# Forced third download over changed content"
annif download "tfidf-fi" juhoinkinen/Annif-models-upload-testing --force
tree data The output:
|
annif upload-projects
and download-projects
commands for Hugging Face Hub integrationannif upload
and annif download
commands for Hugging Face Hub integration
I commented on huggingface-hub repository about git tags, and they promptly created an issue for creating, listing and deleting tags with the huggingface-cli (also wrote a question on HF forum). If/when it gets implemented, I think it would be most convenient for the case of Annif models if the tagging would be performed after |
The When testing with |
While uploading projects to FintoAI-data-YSO with
However I had about 60 GBs of free disk space. And preuploading should free memory of the uploaded objects, so it should not be memory issue either... Anyway, this could be circumvented by uploading only projects of one language by one command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good.
The only thing that caught my eye is that this PR adds a lot of new code especially to cli_util.py, but also cli.py and test_cli.py. Would it make sense to put the HF-specific code in separate files instead, at least in the case of cli_util.py?
Quality Gate passedIssues Measures |
🎉 |
The organization of the uploaded files follows the 4. option of issue #760:
Each vocabularies zip is placed in
vocabs/<vocab-id>.zip
, and each project configuration is placed in<project-id>.cfg
in the repo root.This option is good for caching by preventing unnecessary uploads/downloads of the projects bundle when only one is changed, and for visibility of the repo contents.
The unzip after download places the data directories directly to
data/projects/
anddata/vocabs/
, and the project configurations as separate<project-id>.cfg
files inprojects.d/
directory, so after download the projects are directly usable (ifprojects.{cfg,toml}
does not exists or by using the--projects
option).Upload
Push a set of selected projects and vocabularies to a Hugging Face Hub repository as zip files and the configurations of the projects as cfg/ini files. For example, the following command uploads all projects with ids matching
*-fi
to juhoinkinen/Annif-models-upload-testing:Download
Downloads the project and vocabulary archives and the configuration files of the projects that match the given
and unzip the archives to
data/
directory and places the configuration files toprojects.d/
directory:Note that currently the download will overwrite the project and vocab dirs if they already exist.Edit: By default overwrite does not happen, it can be performed by adding the--force
option to the command.A git revision (commit hash, branch etc.) can be specified with
--revision
option.The downloaded files remain in the huggingface_hub client cache dir.
For all uploads and downloads in the case of private repos the user needs to have logged in with
huggingface-cli login
, or the HF Hub token can be given also with the--token
option of this Annif command.