-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default settings to configurations #4995
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this. I think I don't like:
pipeline components don't expect a Model anymore at init but a model configuration in cfg[model]
I think we should pass in Model
instance, and have that be a requirement --- so you can't pass in True
or False
or None
anymore.
The general philosophy with the config is that when we have a tree of objects, the tree should be built bottom-up. So we want the config for the components to be something like this:
[pipeline.parser]
@components = "DependencyParser.v1"
# More settings here
[pipeline.parser.model]
@architectures = "spacy.ParserModel.v1"
# More settings here
If we pass in the config for the model, then the parser will be creating that object, so all of our config functions would be returning config dicts rather than actual objects.
Regarding the serialization, it's not the pipeline component's job to create itself in the correct configuration. By the time you call component.from_bytes(), you expect that the component has a Model
object that has been created correctly. So the component doesn't need to know the config file that was used to create it --- you could create the same object in any number of different ways, by referencing different registered functions. It's the user's responsibility to make sure they use the same config file, with the same registered functions available, to create the pipeline and then load it back later. So the config settings are owned at the top-most level, by e.g. by a script like train-from-config, or a function like spacy.load.
I agree, this refactor sort of broke up the config in pieces which felt kind of against the original philosphy.
Ok, can definitely refactor accordingly.
So does that imply that simply calling If we have a "default" model available, I don't think it's much additional overhead for the users, but it will still change the user experience slightly. |
No that should still work: the factory function (currently stuff like |
@honnibal @ines : according to our discussions, this is the current state of the model passing & IO:
What is still ugly is the storage of default models, in
|
from .tagger import * | ||
from .tensorizer import * | ||
from .textcat import * | ||
from .tok2vec import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sucks, but I put it in as a quick hack to make sure the @registry
annotations were properly registered in those files. Needs a proper solution.
[EDITED 27 feb to reflect changes in response to Matt's comments]
Further refactor of the spaCy code to work with thinc 8 and the config files.
Making pretrained vectors work again, with Fixing StaticVectors thinc#303.
spacy.VocabVectors.v1
loadspretrained_vectors
from a name or a package.Moved the stuff from the previous
component_models
into a separate file each underml/models
Defined default configs/models (in the subfolder
ml/models/defaults
) mimicking the previous hard-coded behaviour in spaCy.pipeline components require a
Model
at__init__
(no moreTrue
orFalse
).nlp.create_pipe
can take a "model" argument in itscfg
that should contain the model config and will be stored by theLanguage
class to enable restoring from disk. If it's not present, a default one will be used + a warning shown.The pipeline components in
pipes.pyx
are now Model-agnostic, with the exception of:nO
on the model's layeroutput_layer
(obtained withget_ref()
). Note that simply settingself.model.set_dim("nO", ...)
does not work, because then some of the layers may not get initialized properly. This may need to be solved more properly in thinc at some point.Some conventions I implemented (up for discussion ofcourse):
build_XXX methods
: specifies no defaults except optionallynO=None
--> all parameters should come from the configdefault_XXX methods
: should take no parameters, just build the model from the correctXXX_defaults.cfg
, to avoid the mess of having arbitrary places that set parametersspacy.XXX.v1
TODO:
textcat
does not seem to perform as well as before. We probably need to bring back thebuild_text_classifier
method.morphologizer
as I was porting code, e.g. could not find thecreate_class_map
anymore, did this get lost in a refactor? @adrianeboyddropout
of theHashEmbed
layer which seems to result in a slight increase again. Probably good to test the performance across languages when we push this todevelop
.nn_parser.pyx
still has some hardcoded defaults, e.g.cfg.setdefault('min_action_freq', 30)
, which are outside the Model definition.char_embed
params in the model configs (nM
,nC
)tok2vec.py
I had to changeFixed !test_serialize_pipeline.py
, because the comparisonbytes1 == bytes2
was taking a ridiculous amount of time. Not sure why all of a sudden?Parser
's resizing & IO. A few unit tests failed becauseto_bytes()
andfrom_bytes()
on theParser
did not work exactly as before anymore - aresize()
step is now required in between to prevent thinc from throwing an error about changing dimensions.Notes:
multi_task.py
methods, not sure whether this was functional or experimental.Types of change
enhancement
Checklist