Multilingual vocabularies #559

osma · 2022-01-28T13:58:55Z

Introduction

This is a proposed conceptual change to how Annif handles subject vocabularies. Currently, vocabularies are strictly monolingual. The same vocabulary can be used by multiple projects, as long as they share the same language. Typically, vocabularies have identifiers like yso-fi or stw-en, reflecting that they are specific to a single language, even though in reality both YSO and STW are multilingual.

Proposed change

Annif vocabularies should be multilingual; that is, projects with different languages can still share the same vocabulary.

Note that #556 would be a small step in this direction and it would probably make sense to implement that first.

Rationale

Making vocabularies multilingual would have the following benefits:

Enable scenarios where the vocabulary of the indexing language is different from the language of documents. For example, Finnish language documents indexed with LCSH (terms in English only), or German language documents indexed with YSO (Finnish, Swedish and English terms). At present this is not possible, or at least requires some dirty tricks.
Align with SKOS: concepts are language-agnostic, terms can be given in multiple languages.
Simplify configuration: no need to have separate vocabulary id's for different languages.
Simplify loading vocabularies: no need to load e.g. YSO three times just because you have projects in Finnish, Swedish and English. Loading once is enough.

Changes in Annif usage and implementation

This is somewhat speculative and may change when actually trying to implement this...

Changes to configuration

The main change would be to start using language-agnostic vocabulary id's. So instead of

vocab=yso-fi

use

vocab=yso

Optionally, it should be possible to override the language of labels using a parameter (the default being the language of the project). For example, in a Finnish language project where the vocabulary is LCSH, English LCSH labels would be used like this:

language=fi
vocab=lcsh(en)

Changes to vocabulary data files

Currently, the vocabulary is stored as three files: subjects (TSV), subjects.ttl (SKOS), subjects.dump.gz (dumped rdflib Graph). The last two don't need to change as they are already multilingual. Instead of a single subjects file, there should be one file per language, e.g. subjects.fi.tsv, subjects.sv.tsv, subjects.en.tsv.

Changes to the loadvoc command

When given a SKOS file, the loadvoc code should figure out which languages are used in the labels and write as many subjects.<lang>.tsv files as there are languages.

When given a TSV file, write just a single subjects.<lang>.tsv file; generate the SKOS and Graph files like before. <lang> would be chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en)

Changes to the vocabulary support code

need to be able to load the subjects.<lang>.tsv files instead of current subjects monolingual file (<lang> chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en))

Other changes

This change would affect a lot of the configuration examples in the Wiki documentation, as well as a lot of materials in the Annif tutorial. Simply put, the language part should be stripped from vocabulary id's. Fortunately, the old examples should still work even though the vocabulary id's with languages may be slightly misleading.

Unless extra effort is made to support old subjects files, the vocabularies would have to be reloaded, otherwise they would stop working.

The text was updated successfully, but these errors were encountered:

osma · 2022-09-23T07:38:09Z

Closed by #600 and follow-up PRs.

osma added the enhancement label Jan 28, 2022

osma added this to the Long term milestone Jan 28, 2022

osma modified the milestones: Long term, Short term Aug 3, 2022

osma self-assigned this Aug 3, 2022

This was referenced Aug 4, 2022

Make vocabularies multilingual #600

Merged

loadvoc command should take a vocabulary id, not project id #602

Closed

osma modified the milestones: Short term, 0.59 Aug 5, 2022

This was referenced Aug 12, 2022

Remove language suffixes from vocabulary ids in example config #607

Merged

multilingual SubjectIndex backed by CSV file #608

Merged

osma closed this as completed Sep 23, 2022

osma mentioned this issue Sep 22, 2023

optimization: load a vocabulary only once even if used in different languages #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual vocabularies #559

Multilingual vocabularies #559

osma commented Jan 28, 2022

osma commented Sep 23, 2022

Multilingual vocabularies #559

Multilingual vocabularies #559

Comments

osma commented Jan 28, 2022

Introduction

Proposed change

Rationale

Changes in Annif usage and implementation

Changes to configuration

Changes to vocabulary data files

Changes to the loadvoc command

Changes to the vocabulary support code

Other changes

osma commented Sep 23, 2022