Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update spaCy for thinc 8.0.0 #4920

Merged
merged 254 commits into from
Jan 29, 2020
Merged

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Jan 17, 2020

Update spaCy to use the revamped thinc 8.0.0 currently on thinc's develop branch, see explosion/thinc#143

Description

  • allow training from config, cf example configurations in https://github.com/explosion/spaCy/tree/feature/config/examples/experiments/ptb-joint-pos-dep
  • Tok2Vec Pipe component and Tok2VecListener Model
  • update layer names according to the thinc refactor, e.g.
    • Pooling(sum_pool)SumPool
    • with_flattenwith_array
    • with_square_sequenceswith_list2padded
    • flatten_add_lengthslist2ragged
    • LinearSparseLinear
    • AffineLinear
    • ...
  • dimensions of models (e.g. nI and nO) are queried through model.has_dim(), model.get_dim(), model.set_dim(). Similar functions exist for _param (e.g. weights), _attr (e.g. window_size - the data has to be json-serializable) and _ref (e.g. a model's "tok2vec" component).
  • quick reimplementations of the previous functions in _ml can now be found in component_models
  • dropout is defined much more consistently across different layers, ultimately translating in a Dropout layer with a certain rate that can be set across the model by set_dropout_rate(model, drop)
  • use ml_datasets for loading example datasets
  • added "overfitting" tests for parser, tagger, ner etc, to check whether they can quickly converge on just a handful of training examples (taken from the docs)

Types of change

enhancement / refactor

Tasks

Some are done, some ongoing, some TODO

  • Fix all tests (normal and slow)
  • Run train_from_config with defaults.cfg : runs fine
  • Run train_from_config with various other configs
  • Ensure EL algorithm runs & learns
  • Ensure tagger runs & learns
  • Ensure parser runs & learns
  • Run train script with new branch
  • Run pretrain script with new branch
  • Run other CLI scripts
  • Clean up example configs in code

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg marked this pull request as ready for review January 24, 2020 07:12
@honnibal honnibal merged commit 569cc98 into explosion:develop Jan 29, 2020
@svlandeg svlandeg deleted the feature/config branch March 3, 2020 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants