Default settings to configurations #4995

svlandeg · 2020-02-10T17:25:49Z

[EDITED 27 feb to reflect changes in response to Matt's comments]

Further refactor of the spaCy code to work with thinc 8 and the config files.

Making pretrained vectors work again, with Fixing StaticVectors thinc#303. spacy.VocabVectors.v1 loads pretrained_vectors from a name or a package.
Moved the stuff from the previous component_models into a separate file each under ml/models
Defined default configs/models (in the subfolder ml/models/defaults) mimicking the previous hard-coded behaviour in spaCy.
pipeline components require a Model at __init__ (no more True or False).
nlp.create_pipe can take a "model" argument in its cfg that should contain the model config and will be stored by the Language class to enable restoring from disk. If it's not present, a default one will be used + a warning shown.
The pipeline components in pipes.pyx are now Model-agnostic, with the exception of:
- defining the output dimension by setting nO on the model's layer output_layer (obtained with get_ref()). Note that simply setting self.model.set_dim("nO", ...) does not work, because then some of the layers may not get initialized properly. This may need to be solved more properly in thinc at some point.

Some conventions I implemented (up for discussion ofcourse):

build_XXX methods: specifies no defaults except optionally nO=None --> all parameters should come from the config
default_XXX methods: should take no parameters, just build the model from the correct XXX_defaults.cfg, to avoid the mess of having arbitrary places that set parameters
renamed all architectures to spacy.XXX.v1

TODO:

the default textcat does not seem to perform as well as before. We probably need to bring back the build_text_classifier method.
I got confused about the morphologizer as I was porting code, e.g. could not find the create_class_map anymore, did this get lost in a refactor? @adrianeboyd
We noticed a small performance drop on the Tagger before this refactor (due to PR Update spaCy for thinc 8.0.0 #4920). I checked the numbers after this refactor and they remained unchanged (i.e. still slightly worse). However, this PR now removes the dropout of the HashEmbed layer which seems to result in a slight increase again. Probably good to test the performance across languages when we push this to develop.
nn_parser.pyx still has some hardcoded defaults, e.g. cfg.setdefault('min_action_freq', 30), which are outside the Model definition.
allow setting the char_embed params in the model configs (nM, nC)
further refactor tok2vec.py
~~I had to change test_serialize_pipeline.py, because the comparison bytes1 == bytes2 was taking a ridiculous amount of time. Not sure why all of a sudden?~~ Fixed !
more tests with the Parser's resizing & IO. A few unit tests failed because to_bytes() and from_bytes() on the Parser did not work exactly as before anymore - a resize() step is now required in between to prevent thinc from throwing an error about changing dimensions.

Notes:

I did not look into detail at the multi_task.py methods, not sure whether this was functional or experimental.
some of the unit tests got kind of "hacked" with the new format, I think they're OK but review would be good.
there may still be some unnecessary vocab/pretrained_vectors clutter but I haven't looked into detail, as this will probably go once we kill the vocab anyway.

Types of change

enhancement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal

I've been thinking about this. I think I don't like:

pipeline components don't expect a Model anymore at init but a model configuration in cfg[model]

I think we should pass in Model instance, and have that be a requirement --- so you can't pass in True or False or None anymore.

The general philosophy with the config is that when we have a tree of objects, the tree should be built bottom-up. So we want the config for the components to be something like this:

[pipeline.parser]
@components = "DependencyParser.v1"
# More settings here

[pipeline.parser.model]
@architectures = "spacy.ParserModel.v1"
# More settings here

If we pass in the config for the model, then the parser will be creating that object, so all of our config functions would be returning config dicts rather than actual objects.

Regarding the serialization, it's not the pipeline component's job to create itself in the correct configuration. By the time you call component.from_bytes(), you expect that the component has a Model object that has been created correctly. So the component doesn't need to know the config file that was used to create it --- you could create the same object in any number of different ways, by referencing different registered functions. It's the user's responsibility to make sure they use the same config file, with the same registered functions available, to create the pipeline and then load it back later. So the config settings are owned at the top-most level, by e.g. by a script like train-from-config, or a function like spacy.load.

svlandeg · 2020-02-16T17:37:21Z

The general philosophy with the config is that when we have a tree of objects, the tree should be built bottom-up.

I agree, this refactor sort of broke up the config in pieces which felt kind of against the original philosphy.

I think we should pass in Model instance ...

Ok, can definitely refactor accordingly.

... and have that be a requirement --- so you can't pass in True or False or None anymore.

So does that imply that simply calling nlp.create_pipe("ner") wouldn't work anymore? The user needs to define the model first and provide that as an argument? Do we still want to provide a default model (+ param setting) for the main components (perhaps also as a factory) ?

If we have a "default" model available, I don't think it's much additional overhead for the users, but it will still change the user experience slightly.

honnibal · 2020-02-17T16:32:15Z

So does that imply that simply calling nlp.create_pipe("ner") wouldn't work anymore?

No that should still work: the factory function (currently stuff like Language.Defaults.create_parser) needs to create the Model instance and pass it into the Pipe.__init__. So there's still a place to do all the assembly for the defaults --- it just shouldn't happen within the Pipe.__init__ function.

…ectors

svlandeg · 2020-02-27T08:43:41Z

@honnibal @ines : according to our discussions, this is the current state of the model passing & IO:

Language.create_pipe parses the incoming config looking for a model argument. If it can't find one, it will add a default one according to name. It stores the used config in self.config which gets serialized to file with Config.to_disk() as a plain-text config.cfg file (alongside meta.json).

What is still ugly is the storage of default models, in Language here. I wanted to incorporate this into the component decorator but that didn't work because of circular imports. We should definitely look into this further, but it works for now.

util.py now knows how to load an entire nlp config, as well as the specific config field serialized by Language. If it encounters a config.cfg that contains the root element nlp, it will thus load the Language object from config. Else, it will try to find the root element for each pipeline component (e.g. parser) and load the corresponding config when calling create_pipe().

test_serialize_config.py shows some use-cases of how the IO works.

svlandeg · 2020-02-27T09:29:11Z

spacy/ml/models/__init__.py

+from .tagger import *
+from .tensorizer import *
+from .textcat import *
+from .tok2vec import *


This sucks, but I put it in as a quick hack to make sure the @registry annotations were properly registered in those files. Needs a proper solution.

svlandeg added 27 commits February 4, 2020 12:03

fix grad_clip naming

619dae9

cleaning up pretrained_vectors out of cfg

32a1f08

further refactoring Model init's

a29ea7b

Merge branch 'bugfix/optimizer' into feature/static-vectors

7c9aebe

move Model building out of pipes

1f390a1

further refactor to require a model config when creating a pipe

1ee119c

small fixes

fad5862

making cfg in nn_parser more consistent

3082df4

fixing nr_class for parser

f852d7d

fixing nn_parser's nO

d66c65a

fix printing of loss

5b4c035

architectures in own file per type, consistent naming

6725b19

convenience methods default_tagger_config and default_tok2vec_config

1133968

let create_pipe access default config if available for that component

b85aecf

default_parser_config

301a5fb

move defaults to separate folder

0278c14

allow reading nlp from package or dir with argument 'name'

ad8ff6e

architecture spacy.VocabVectors.v1 to read static vectors from file

e06dd68

cleanup

ade2d37

default configs for nel, textcat, morphologizer, tensorizer

2412db9

fix imports

8ab20b6

fixing unit tests

359faba

fixes and clean up

215a282

fixing defaults, nO, fix unit tests

36371b4

restore parser IO

9438264

fix IO

8857718

'fix' serialization test

0f12be6

svlandeg added enhancement Feature requests and improvements v3.0 Related to v3.0 labels Feb 10, 2020

svlandeg changed the title ~~Feature/static vectors~~ Default settings to configurations Feb 10, 2020

svlandeg added 2 commits February 11, 2020 16:05

add defaults for sentrec

a9037ea

fix merge conflict in manifest

e48a5a1

honnibal requested changes Feb 16, 2020

View reviewed changes

svlandeg added 13 commits February 18, 2020 11:08

update thinc

3d368b8

merge develop into branch

05c9184

always pass a Model instance to a Pipe

df47926

fix piped_added statement

a518868

remove obsolete W029

817578b

remove obsolete errors

add4105

Merge remote-tracking branch 'upstream/develop' into feature/static-v…

403f57c

…ectors

restore byte checking tests (work again)

3c285f5

clean up test

492e941

further test cleanup

c7a0329

convert from config to Model in create_pipe

10a50c7

bring back error when component is not initialized

1deaffd

cleanup

79fdcd8

svlandeg requested a review from honnibal February 27, 2020 08:30

remove calls for nlp2.begin_training

b614adb

svlandeg changed the title ~~WIP: Default settings to configurations~~ Default settings to configurations Feb 27, 2020

svlandeg commented Feb 27, 2020

View reviewed changes

svlandeg added 5 commits February 27, 2020 13:28

use thinc.api in imports

80e5e2f

allow setting charembed's nM and nC

2c96c24

fix for hardcoded nM/nC + unit test

f69f8db

formatting fixes

a673ba3

trigger build

14da14c

honnibal merged commit 06f0a8d into explosion:develop Feb 27, 2020

svlandeg deleted the feature/static-vectors branch February 27, 2020 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default settings to configurations #4995

Default settings to configurations #4995

svlandeg commented Feb 10, 2020 •

edited

Loading

honnibal left a comment •

edited

Loading

svlandeg commented Feb 16, 2020

honnibal commented Feb 17, 2020

svlandeg commented Feb 27, 2020 •

edited

Loading

svlandeg Feb 27, 2020

Default settings to configurations #4995

Default settings to configurations #4995

Conversation

svlandeg commented Feb 10, 2020 • edited Loading

TODO:

Notes:

Types of change

Checklist

honnibal left a comment • edited Loading

Choose a reason for hiding this comment

svlandeg commented Feb 16, 2020

honnibal commented Feb 17, 2020

svlandeg commented Feb 27, 2020 • edited Loading

svlandeg Feb 27, 2020

Choose a reason for hiding this comment

svlandeg commented Feb 10, 2020 •

edited

Loading

honnibal left a comment •

edited

Loading

svlandeg commented Feb 27, 2020 •

edited

Loading