Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simplemma analyzer #591

Merged
merged 4 commits into from
May 30, 2022
Merged

Add simplemma analyzer #591

merged 4 commits into from
May 30, 2022

Conversation

osma
Copy link
Member

@osma osma commented May 27, 2022

Fixes #590

This PR adds a simplemma analyzer based on the Simplemma lemmatizer. Opening as a draft PR to get feedback from QA tools. The analyzer itself needs more testing, e.g. the quality of results and how it works with multiprocessing.

@osma osma self-assigned this May 27, 2022
@codecov
Copy link

codecov bot commented May 27, 2022

Codecov Report

Merging #591 (c8f94a7) into master (d07243a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #591   +/-   ##
=======================================
  Coverage   99.48%   99.48%           
=======================================
  Files          84       86    +2     
  Lines        5615     5645   +30     
=======================================
+ Hits         5586     5616   +30     
  Misses         29       29           
Impacted Files Coverage Δ
annif/analyzer/__init__.py 93.93% <100.00%> (+0.39%) ⬆️
annif/analyzer/simplemma.py 100.00% <100.00%> (ø)
tests/test_analyzer_simplemma.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d07243a...c8f94a7. Read the comment docs.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented May 27, 2022

I did some test runs with MLLM and Parabel. There were no problems with parallellization. The initial version was a bit slow and memory-hungry so I implemented two optimizations, both copied from the Voikko analyzer:

  1. use functools.lru_cache to cache lemmatize lookups, which is a bit faster (negligible memory cost)
  2. avoid pickling the langdata object (happens with multiprocessing), instead load it from scratch using simplemma.load_data every time it is needed (e.g. in subprocesses) - this saves a lot of memory and is also faster

Table of test results:

backend train set lang analyzer train jobs train time train RSS eval set eval jobs eval time eval RSS F1@5 NDCG
mllm kirjastonhoitaja/fulltext-train fi voikko 4 357.78 1123776 kirjastonhoitaja/test 4 192.90 340248 0.3259 0.4368
mllm kirjastonhoitaja/fulltext-train fi simplemma-initial 4 362.77 3456424 kirjastonhoitaja/test 4 221.09 2691512 0.3269 0.4303
mllm kirjastonhoitaja/fulltext-train fi simplemma+lru_cache 4 357.15 3465288 kirjastonhoitaja/test 4 220.12 2691228 0.3217 0.4351
mllm kirjastonhoitaja/fulltext-train fi simplemma+lru_cache+load_data 4 330.41 1900016 kirjastonhoitaja/test 4 213.71 1291668 0.3246 0.4363
omikuji-parabel yso-finna (100k) fi voikko 4 243.78 748772 kirjastonhoitaja/test 4 21.84 557580 0.2033 0.2903
omikuji-parabel yso-finna (100k) fi simplemma-initial 4 305.54 2748572 kirjastonhoitaja/test 4 49.17 2738524 0.2043 0.2890
omikuji-parabel yso-finna (100k) fi simplemma+lru_cache 4 290.90 2775552 kirjastonhoitaja/test 4 41.80 2738080 0.2123 0.2826
omikuji-parabel yso-finna (100k) fi simplemma+lru_cache+load_data 4 233.96 1648580 kirjastonhoitaja/test 4 42.36 1531068 0.2071 0.2970

Observations from the table:

  • with the optimizations, simplemma is approximately as fast as voikko, but needs a bit more memory
  • quality of results seems to be similar

I think this is good enough for review and subsequent merging unless any new problems surface.

@osma osma marked this pull request as ready for review May 27, 2022 11:33
@osma osma requested a review from juhoinkinen May 27, 2022 11:33
@osma osma added this to the 0.58 milestone May 27, 2022
Copy link
Member

@juhoinkinen juhoinkinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@osma osma merged commit 0593592 into master May 30, 2022
@osma osma deleted the issue590-simplemma-analyzer branch May 30, 2022 13:18
@osma
Copy link
Member Author

osma commented May 30, 2022

I added documentation about the simplemma analyzer in the wiki.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplemma analyzer
2 participants