You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a proposed conceptual change to how Annif handles subject vocabularies. Currently, vocabularies are strictly monolingual. The same vocabulary can be used by multiple projects, as long as they share the same language. Typically, vocabularies have identifiers like yso-fi or stw-en, reflecting that they are specific to a single language, even though in reality both YSO and STW are multilingual.
Proposed change
Annif vocabularies should be multilingual; that is, projects with different languages can still share the same vocabulary.
Note that #556 would be a small step in this direction and it would probably make sense to implement that first.
Rationale
Making vocabularies multilingual would have the following benefits:
Enable scenarios where the vocabulary of the indexing language is different from the language of documents. For example, Finnish language documents indexed with LCSH (terms in English only), or German language documents indexed with YSO (Finnish, Swedish and English terms). At present this is not possible, or at least requires some dirty tricks.
Align with SKOS: concepts are language-agnostic, terms can be given in multiple languages.
Simplify configuration: no need to have separate vocabulary id's for different languages.
Simplify loading vocabularies: no need to load e.g. YSO three times just because you have projects in Finnish, Swedish and English. Loading once is enough.
Changes in Annif usage and implementation
This is somewhat speculative and may change when actually trying to implement this...
Changes to configuration
The main change would be to start using language-agnostic vocabulary id's. So instead of
vocab=yso-fi
use
vocab=yso
Optionally, it should be possible to override the language of labels using a parameter (the default being the language of the project). For example, in a Finnish language project where the vocabulary is LCSH, English LCSH labels would be used like this:
language=fi
vocab=lcsh(en)
Changes to vocabulary data files
Currently, the vocabulary is stored as three files: subjects (TSV), subjects.ttl (SKOS), subjects.dump.gz (dumped rdflib Graph). The last two don't need to change as they are already multilingual. Instead of a single subjects file, there should be one file per language, e.g. subjects.fi.tsv, subjects.sv.tsv, subjects.en.tsv.
Changes to the loadvoc command
When given a SKOS file, the loadvoc code should figure out which languages are used in the labels and write as many subjects.<lang>.tsv files as there are languages.
When given a TSV file, write just a single subjects.<lang>.tsv file; generate the SKOS and Graph files like before. <lang> would be chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en)
Changes to the vocabulary support code
need to be able to load the subjects.<lang>.tsv files instead of current subjects monolingual file (<lang> chosen based on project language, unless overridden by parameter e.g. vocab=lcsh(en))
Other changes
This change would affect a lot of the configuration examples in the Wiki documentation, as well as a lot of materials in the Annif tutorial. Simply put, the language part should be stripped from vocabulary id's. Fortunately, the old examples should still work even though the vocabulary id's with languages may be slightly misleading.
Unless extra effort is made to support old subjects files, the vocabularies would have to be reloaded, otherwise they would stop working.
The text was updated successfully, but these errors were encountered:
Introduction
This is a proposed conceptual change to how Annif handles subject vocabularies. Currently, vocabularies are strictly monolingual. The same vocabulary can be used by multiple projects, as long as they share the same language. Typically, vocabularies have identifiers like
yso-fi
orstw-en
, reflecting that they are specific to a single language, even though in reality both YSO and STW are multilingual.Proposed change
Annif vocabularies should be multilingual; that is, projects with different languages can still share the same vocabulary.
Note that #556 would be a small step in this direction and it would probably make sense to implement that first.
Rationale
Making vocabularies multilingual would have the following benefits:
Changes in Annif usage and implementation
This is somewhat speculative and may change when actually trying to implement this...
Changes to configuration
The main change would be to start using language-agnostic vocabulary id's. So instead of
use
Optionally, it should be possible to override the language of labels using a parameter (the default being the language of the project). For example, in a Finnish language project where the vocabulary is LCSH, English LCSH labels would be used like this:
Changes to vocabulary data files
Currently, the vocabulary is stored as three files:
subjects
(TSV),subjects.ttl
(SKOS),subjects.dump.gz
(dumped rdflib Graph). The last two don't need to change as they are already multilingual. Instead of a singlesubjects
file, there should be one file per language, e.g.subjects.fi.tsv
,subjects.sv.tsv
,subjects.en.tsv
.Changes to the loadvoc command
When given a SKOS file, the
loadvoc
code should figure out which languages are used in the labels and write as manysubjects.<lang>.tsv
files as there are languages.When given a TSV file, write just a single
subjects.<lang>.tsv
file; generate the SKOS and Graph files like before.<lang>
would be chosen based on project language, unless overridden by parameter e.g.vocab=lcsh(en)
Changes to the vocabulary support code
subjects.<lang>.tsv
files instead of currentsubjects
monolingual file (<lang>
chosen based on project language, unless overridden by parameter e.g.vocab=lcsh(en)
)Other changes
This change would affect a lot of the configuration examples in the Wiki documentation, as well as a lot of materials in the Annif tutorial. Simply put, the language part should be stripped from vocabulary id's. Fortunately, the old examples should still work even though the vocabulary id's with languages may be slightly misleading.
Unless extra effort is made to support old
subjects
files, the vocabularies would have to be reloaded, otherwise they would stop working.The text was updated successfully, but these errors were encountered: