Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face integration #760

Closed
juhoinkinen opened this issue Jan 30, 2024 · 1 comment · Fixed by #762
Closed

Hugging Face integration #760

juhoinkinen opened this issue Jan 30, 2024 · 1 comment · Fixed by #762
Assignees
Milestone

Comments

@juhoinkinen
Copy link
Member

juhoinkinen commented Jan 30, 2024

The 🤗 Hugging Face Hub intends to facilitate the hosting and sharing of AI models and datasets (as well as demo applications), and now also NatLibFi has an organization account in the Hugging Face Hub.

The data (models and datasets) in the HF Hub live in git repositories, and git can be used to handle the data (to commit, push, pull...) . However, also direct integration of applications with HF Hub is supported using the huggingface_hub Python library, which is usable also as a CLI tool.

Annif could have the functionality to push (and pull) projects or project sets to (and from) the HF Hub. It should to be able to operate on project sets because ensemble projects require the availability of also its base projects and also because of convenience.

There could be the following CLI command to push a set of projects to HF Hub:

annif upload-projects <glob-pattern> <username/reponame> [--options]

For example

annif upload-projects yso-*fi NatLibFi/FintoAI-data-YSO

would upload the specified projects to NatLibFi/FintoAI-data-YSO repository.

The files and dirs needed to be uploaded are

  • data/projects/project-id the project directories
  • data/vocabs/vocab-id vocabularies of the projects
  • projects.{cfg,toml,d} configurations of the projects

Options for bundling and uploading

1. Single file

Bundle all files into one zip named: yso-fi.zip (possibly include only the configs of the selected projects). Upload to the root of the repo.

The filename could be derived by the glob pattern of the projects or it could be a required argument for the upload command (as 2nd argument, to be added to the above example).

This option would be easiest for downloads: just wget one file and unzip.

2. One file for projects and vocab, and one for projects configs

Bundle projects and vocabulary directories into one zip and leave projects config file uncompressed.

3. One file for projects, one for vocab, and one for projects configs

Bundle the selected projects into one zip (yso-fi.zip) and vocabularies into another (yso.zip) and leave projects config file uncompressed. Upload the projects zip to data/projects directory and the vocab zip to data/vocabs.

4. Separate files for each project, vocab, and projects configs

Compress each project directory into its own zip (<project-id>.zip).

For this option for downloads one should use e.g. wget --accept yso*-fi.zip for the projects.


Some details and ideas:

Downloading projects

We could also implement a feature to fetch projects from the HF Hub, for example:

annif download-project <username/reponame> <projects-set-file>[--options]

But implementing this is probably best done only after the upload functionality; downloading from the HF Hub can be done also by simply with wget or curl. However, if the download function is known to be added, the hierarchy and structure of the data files in the repo should be thought from this point of view.

@davanstrien
Copy link

Very excited to see this! Feel free to ping me if you need any support with anything on the HF side :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants