Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default settings to configurations #4995

Merged
merged 55 commits into from
Feb 27, 2020

Conversation

svlandeg
Copy link
Member

@svlandeg svlandeg commented Feb 10, 2020

[EDITED 27 feb to reflect changes in response to Matt's comments]

Further refactor of the spaCy code to work with thinc 8 and the config files.

  • Making pretrained vectors work again, with Fixing StaticVectors thinc#303. spacy.VocabVectors.v1 loads pretrained_vectors from a name or a package.

  • Moved the stuff from the previous component_models into a separate file each under ml/models

  • Defined default configs/models (in the subfolder ml/models/defaults) mimicking the previous hard-coded behaviour in spaCy.

  • pipeline components require a Model at __init__ (no more True or False).

  • nlp.create_pipe can take a "model" argument in its cfg that should contain the model config and will be stored by the Language class to enable restoring from disk. If it's not present, a default one will be used + a warning shown.

  • The pipeline components in pipes.pyx are now Model-agnostic, with the exception of:

    • defining the output dimension by setting nO on the model's layer output_layer (obtained with get_ref()). Note that simply setting self.model.set_dim("nO", ...) does not work, because then some of the layers may not get initialized properly. This may need to be solved more properly in thinc at some point.

Some conventions I implemented (up for discussion ofcourse):

  • build_XXX methods: specifies no defaults except optionally nO=None --> all parameters should come from the config
  • default_XXX methods: should take no parameters, just build the model from the correct XXX_defaults.cfg, to avoid the mess of having arbitrary places that set parameters
  • renamed all architectures to spacy.XXX.v1

TODO:

  • the default textcat does not seem to perform as well as before. We probably need to bring back the build_text_classifier method.
  • I got confused about the morphologizer as I was porting code, e.g. could not find the create_class_map anymore, did this get lost in a refactor? @adrianeboyd
  • We noticed a small performance drop on the Tagger before this refactor (due to PR Update spaCy for thinc 8.0.0 #4920). I checked the numbers after this refactor and they remained unchanged (i.e. still slightly worse). However, this PR now removes the dropout of the HashEmbed layer which seems to result in a slight increase again. Probably good to test the performance across languages when we push this to develop.
  • nn_parser.pyx still has some hardcoded defaults, e.g. cfg.setdefault('min_action_freq', 30), which are outside the Model definition.
  • allow setting the char_embed params in the model configs (nM, nC)
  • further refactor tok2vec.py
  • I had to change test_serialize_pipeline.py, because the comparison bytes1 == bytes2 was taking a ridiculous amount of time. Not sure why all of a sudden? Fixed !
  • more tests with the Parser's resizing & IO. A few unit tests failed because to_bytes() and from_bytes() on the Parser did not work exactly as before anymore - a resize() step is now required in between to prevent thinc from throwing an error about changing dimensions.

Notes:

  • I did not look into detail at the multi_task.py methods, not sure whether this was functional or experimental.
  • some of the unit tests got kind of "hacked" with the new format, I think they're OK but review would be good.
  • there may still be some unnecessary vocab/pretrained_vectors clutter but I haven't looked into detail, as this will probably go once we kill the vocab anyway.

Types of change

enhancement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added enhancement Feature requests and improvements v3.0 Related to v3.0 labels Feb 10, 2020
@svlandeg svlandeg changed the title Feature/static vectors Default settings to configurations Feb 10, 2020
Copy link
Member

@honnibal honnibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this. I think I don't like:

pipeline components don't expect a Model anymore at init but a model configuration in cfg[model]

I think we should pass in Model instance, and have that be a requirement --- so you can't pass in True or False or None anymore.

The general philosophy with the config is that when we have a tree of objects, the tree should be built bottom-up. So we want the config for the components to be something like this:

[pipeline.parser]
@components = "DependencyParser.v1"
# More settings here

[pipeline.parser.model]
@architectures = "spacy.ParserModel.v1"
# More settings here

If we pass in the config for the model, then the parser will be creating that object, so all of our config functions would be returning config dicts rather than actual objects.

Regarding the serialization, it's not the pipeline component's job to create itself in the correct configuration. By the time you call component.from_bytes(), you expect that the component has a Model object that has been created correctly. So the component doesn't need to know the config file that was used to create it --- you could create the same object in any number of different ways, by referencing different registered functions. It's the user's responsibility to make sure they use the same config file, with the same registered functions available, to create the pipeline and then load it back later. So the config settings are owned at the top-most level, by e.g. by a script like train-from-config, or a function like spacy.load.

@svlandeg
Copy link
Member Author

The general philosophy with the config is that when we have a tree of objects, the tree should be built bottom-up.

I agree, this refactor sort of broke up the config in pieces which felt kind of against the original philosphy.

I think we should pass in Model instance ...

Ok, can definitely refactor accordingly.

... and have that be a requirement --- so you can't pass in True or False or None anymore.

So does that imply that simply calling nlp.create_pipe("ner") wouldn't work anymore? The user needs to define the model first and provide that as an argument? Do we still want to provide a default model (+ param setting) for the main components (perhaps also as a factory) ?

If we have a "default" model available, I don't think it's much additional overhead for the users, but it will still change the user experience slightly.

@honnibal
Copy link
Member

So does that imply that simply calling nlp.create_pipe("ner") wouldn't work anymore?

No that should still work: the factory function (currently stuff like Language.Defaults.create_parser) needs to create the Model instance and pass it into the Pipe.__init__. So there's still a place to do all the assembly for the defaults --- it just shouldn't happen within the Pipe.__init__ function.

@svlandeg svlandeg requested a review from honnibal February 27, 2020 08:30
@svlandeg
Copy link
Member Author

svlandeg commented Feb 27, 2020

@honnibal @ines : according to our discussions, this is the current state of the model passing & IO:

Language.create_pipe parses the incoming config looking for a model argument. If it can't find one, it will add a default one according to name. It stores the used config in self.config which gets serialized to file with Config.to_disk() as a plain-text config.cfg file (alongside meta.json).

What is still ugly is the storage of default models, in Language here. I wanted to incorporate this into the component decorator but that didn't work because of circular imports. We should definitely look into this further, but it works for now.

util.py now knows how to load an entire nlp config, as well as the specific config field serialized by Language. If it encounters a config.cfg that contains the root element nlp, it will thus load the Language object from config. Else, it will try to find the root element for each pipeline component (e.g. parser) and load the corresponding config when calling create_pipe().

test_serialize_config.py shows some use-cases of how the IO works.

@svlandeg svlandeg changed the title WIP: Default settings to configurations Default settings to configurations Feb 27, 2020
from .tagger import *
from .tensorizer import *
from .textcat import *
from .tok2vec import *
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sucks, but I put it in as a quick hack to make sure the @registry annotations were properly registered in those files. Needs a proper solution.

@honnibal honnibal merged commit 06f0a8d into explosion:develop Feb 27, 2020
@svlandeg svlandeg deleted the feature/static-vectors branch February 27, 2020 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements v3.0 Related to v3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants