Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual vocabularies #559

Closed
osma opened this issue Jan 28, 2022 · 1 comment
Closed

Multilingual vocabularies #559

osma opened this issue Jan 28, 2022 · 1 comment
Assignees
Milestone

Comments

@osma
Copy link
Member

osma commented Jan 28, 2022

Introduction

This is a proposed conceptual change to how Annif handles subject vocabularies. Currently, vocabularies are strictly monolingual. The same vocabulary can be used by multiple projects, as long as they share the same language. Typically, vocabularies have identifiers like yso-fi or stw-en, reflecting that they are specific to a single language, even though in reality both YSO and STW are multilingual.

Proposed change

Annif vocabularies should be multilingual; that is, projects with different languages can still share the same vocabulary.

Note that #556 would be a small step in this direction and it would probably make sense to implement that first.

Rationale

Making vocabularies multilingual would have the following benefits:

  1. Enable scenarios where the vocabulary of the indexing language is different from the language of documents. For example, Finnish language documents indexed with LCSH (terms in English only), or German language documents indexed with YSO (Finnish, Swedish and English terms). At present this is not possible, or at least requires some dirty tricks.
  2. Align with SKOS: concepts are language-agnostic, terms can be given in multiple languages.
  3. Simplify configuration: no need to have separate vocabulary id's for different languages.
  4. Simplify loading vocabularies: no need to load e.g. YSO three times just because you have projects in Finnish, Swedish and English. Loading once is enough.

Changes in Annif usage and implementation

This is somewhat speculative and may change when actually trying to implement this...

Changes to configuration

The main change would be to start using language-agnostic vocabulary id's. So instead of

vocab=yso-fi

use

vocab=yso

Optionally, it should be possible to override the language of labels using a parameter (the default being the language of the project). For example, in a Finnish language project where the vocabulary is LCSH, English LCSH labels would be used like this:

language=fi
vocab=lcsh(en)

Changes to vocabulary data files

Currently, the vocabulary is stored as three files: subjects (TSV), subjects.ttl (SKOS), subjects.dump.gz (dumped rdflib Graph). The last two don't need to change as they are already multilingual. Instead of a single subjects file, there should be one file per language, e.g. subjects.fi.tsv, subjects.sv.tsv, subjects.en.tsv.

Changes to the loadvoc command

When given a SKOS file, the loadvoc code should figure out which languages are used in the labels and write as many subjects.<lang>.tsv files as there are languages.

When given a TSV file, write just a single subjects.<lang>.tsv file; generate the SKOS and Graph files like before. <lang> would be chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en)

Changes to the vocabulary support code

  • need to be able to load the subjects.<lang>.tsv files instead of current subjects monolingual file (<lang> chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en))

Other changes

This change would affect a lot of the configuration examples in the Wiki documentation, as well as a lot of materials in the Annif tutorial. Simply put, the language part should be stripped from vocabulary id's. Fortunately, the old examples should still work even though the vocabulary id's with languages may be slightly misleading.

Unless extra effort is made to support old subjects files, the vocabularies would have to be reloaded, otherwise they would stop working.

@osma
Copy link
Member Author

osma commented Sep 23, 2022

Closed by #600 and follow-up PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant