Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include labels without language tag and concepts without labels in vocabulary #597

Merged
merged 5 commits into from
Aug 5, 2022

Conversation

osma
Copy link
Member

@osma osma commented Aug 2, 2022

Fixes #556 by modifying the way concepts from SKOS vocabularies are loaded. There are two main changes:

  1. If a concept doesn't have a prefLabel in the configured language, but has a prefLabel without a language tag, use that instead.
  2. If a concept doesn't have any suitable prefLabels (in the configured language, or without a language tag), generate a pseudo label from the qualified name (e.g. yso:p12345 or lcsh:sh85061212)

This should improve the support for multilingual vocabularies and handle cases when SKOS data is missing language tags, which can happen for example when converting MARC21 records to SKOS like @macsag did when reporting #556.

Note that unlike the solution drafted in this comment, there is no BCP 47 style matching of language tag variants (e.g. en in the SKOS file would match the configured language en-US). I considered this out of scope for now (YAGNI principle) although it could easily be added later, but it would require using a library such as langcodes for the actual language tag matching.

This PR may change the results for some multilingual corpora, for example the YSO based corpora used to train and evaluate models for Finto AI, because the vocabulary will now be larger in some cases. YSO usually lacks Swedish and/or English language labels for some recently added concepts and these used to be dropped when loading the vocabulary, but will now be included after this PR.

@osma
Copy link
Member Author

osma commented Aug 2, 2022

Hmm, I think at least the MLLM backend, possibly also YAKE and (less likely) STWFSA will need to be changed so that they don't rely on the label stored in the vocabulary. Otherwise they could be confused by the qnames.

@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

Merging #597 (3d3fa1d) into master (a6359fa) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head 3d3fa1d differs from pull request most recent head 73176b4. Consider uploading reports for the commit 73176b4 to get more accurate results

@@            Coverage Diff             @@
##           master     #597      +/-   ##
==========================================
+ Coverage   99.52%   99.54%   +0.01%     
==========================================
  Files          86       86              
  Lines        5636     5653      +17     
==========================================
+ Hits         5609     5627      +18     
+ Misses         27       26       -1     
Impacted Files Coverage Δ
annif/corpus/skos.py 100.00% <100.00%> (+1.85%) ⬆️
annif/lexical/mllm.py 100.00% <100.00%> (ø)
tests/test_vocab_skos.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@osma osma added this to the 0.59 milestone Aug 3, 2022
@osma
Copy link
Member Author

osma commented Aug 3, 2022

Hmm, I think at least the MLLM backend, possibly also YAKE and (less likely) STWFSA will need to be changed so that they don't rely on the label stored in the vocabulary. Otherwise they could be confused by the qnames.

Adjusted the MLLM code so it reads prefLabels directly. Checked YAKE and STWFSA, they are OK already as they are not using the labels from the vocabulary either.

@osma osma marked this pull request as ready for review August 3, 2022 07:04
@osma
Copy link
Member Author

osma commented Aug 3, 2022

Ready for wider testing. Code Climate still has a couple of complaints but I can't figure out how to address them without making the code harder to understand.

@osma osma requested a review from juhoinkinen August 3, 2022 07:06
@osma osma force-pushed the issue556-skos-language-tags branch from 3d3fa1d to 73176b4 Compare August 4, 2022 06:16
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 4, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented Aug 4, 2022

Rebased on current master (after the 0.59 release) and force-pushed.

@osma osma merged commit b6a1363 into master Aug 5, 2022
@osma osma deleted the issue556-skos-language-tags branch August 5, 2022 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vocabulary in SKOS (Turtle serialization) should be loaded even in case of lacking language tags
2 participants