Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dependencies v1.0 #726

Merged
merged 11 commits into from
Aug 16, 2023
Merged

Update dependencies v1.0 #726

merged 11 commits into from
Aug 16, 2023

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Aug 14, 2023

Updates most of the outdated dependencies, and pins Flask and rdflib versions more tightly.

Exceptions to updating to most-recent releases:

Also closes #697 and #532.

@juhoinkinen juhoinkinen added this to the 1.0 milestone Aug 14, 2023
@codecov
Copy link

codecov bot commented Aug 14, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (40cc2fd) 99.67% compared to head (e037b78) 99.67%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #726   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files          89       89           
  Lines        6397     6397           
=======================================
  Hits         6376     6376           
  Misses         21       21           
Files Changed Coverage Δ
annif/backend/ensemble.py 100.00% <ø> (ø)
annif/backend/svc.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@juhoinkinen
Copy link
Member Author

When updating to scikit-learn v1.3.0 old MLLM models cease to work:

...

  File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 142, in suggest
    self.initialize()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 119, in initialize
    self._model = self._load_model()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 102, in _load_model
    return MLLMModel.load(path)
  File "/home/local/jmminkin/git/Annif/annif/lexical/mllm.py", line 366, in load
    return joblib.load(filename)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.8/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 402, in load_build
    Unpickler.load_build(self)
  File "/usr/lib/python3.8/pickle.py", line 1705, in load_build
    setstate(state)
  File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
  File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

@osma
Copy link
Member

osma commented Aug 14, 2023

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

@juhoinkinen juhoinkinen linked an issue Aug 15, 2023 that may be closed by this pull request
@juhoinkinen
Copy link
Member Author

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

Also old stwfsa models won't work with updated skikit-learn, the error message is the same as for MLLM models.

@juhoinkinen juhoinkinen marked this pull request as ready for review August 15, 2023 10:02
@osma
Copy link
Member

osma commented Aug 16, 2023

I'm seeing this new TensorFlow/Keras warning for test_backend_nn_ensemble.py:

UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')

It seems there is a new Keras format (.keras) available, which is nowadays the recommended one for saving Keras models. We are using the legacy HDF5 (.h5) format:

MODEL_FILE = "nn-model.h5"

What should we do about this for Annif 1.0?

  1. Nothing, just ignore the warning for now
  2. Switch to the Keras format in a simplistic way (basically just switching the file extension to .keras on the above line)
  3. Switch to .keras but additionally add fallback code that can load existing .h5 models as well
  4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.

I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen
Copy link
Member Author

What should we do about this for Annif 1.0?

1. Nothing, just ignore the warning for now

2. Switch to the Keras format in a simplistic way (basically just switching the file extension to `.keras` on the above line)

3. Switch to `.keras` but additionally add fallback code that can load existing `.h5` models as well

4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.

I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

Option 3 seems best to me. Better promise as little as possible.

@osma
Copy link
Member

osma commented Aug 16, 2023

Option 3 seems best to me. Better promise as little as possible.

I initially implemented option 3 in PR #730, but the fallback code (especially the unit test) got messy and then I realized that it's not much use supporting old pre-1.0 NN ensemble models when MLLM and STWFSA models will break anyway due to changes in scikit-learn.

@juhoinkinen juhoinkinen merged commit 1c30cd5 into main Aug 16, 2023
@juhoinkinen juhoinkinen deleted the update-dependencies-v1.0 branch August 16, 2023 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pin Flask and rdflib major/minor versions Update to Optuna 3.x to get rid of deprecation message
2 participants