Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization bug in ContextSpellChecker when training in cluster mode #1018

Open
albertoandreottiATgmail opened this issue Aug 22, 2020 · 1 comment
Assignees
Labels

Comments

@albertoandreottiATgmail
Copy link
Contributor

albertoandreottiATgmail commented Aug 22, 2020

Description

When training the ContextSpellChecker in cluster mode, we get this error,

vocabTest = ['name1', 'name2', 'name3']
# context dependent spell checker
spell_checker1 = sparknlp.annotator.ContextSpellCheckerApproach()\
.setInputCols(["token"])\
.setOutputCol("spell")\
.setLanguageModelClasses(1400)\
.addVocabClass(label = "_NAME_", vocab = vocabTest)
# pipeline
pipeline1 =  Pipeline().setStages([document_assembler,
                                   sentence_detector,
                                   tokenizer, 
                                   spell_checker1,
                                   finisher
                                  ])
model = pipeline1.fit(df)
An error occurred while calling o1430.fit.
: java.io.NotSerializableException: com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach$$anon$1
Serialization stack:

Expected Behavior

It should train in the same manner it does when running locally.

Current Behavior

I fails serializing

Possible Solution

Steps to Reproduce

Context

Your Environment

  • Spark NLP version:
  • Apache NLP version:
  • Java version (java -version):
  • Setup and installation (Pypi, Conda, Maven, etc.):
  • Operating System and version:
  • Link to your project (if any):
@albertoandreottiATgmail albertoandreottiATgmail changed the title Potential serialization bug in ContextSpellChecker Serialization bug in ContextSpellChecker when training in cluster mode Aug 24, 2020
@albertoandreottiATgmail
Copy link
Contributor Author

adding more information,

Py4JJavaError: An error occurred while calling o402.fit.
: java.io.NotSerializableException: com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach$$anon$1
Serialization stack:
- object not serializable (class: com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach$$anon$1, value: com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach$$anon$1@59accddd)
- element of array (index: 2)
- array (class [Lcom.johnsnowlabs.nlp.annotators.spell.context.parser.SpecialClassParser;, size 3)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(com.johnsnowlabs.nlp.annotators.spell.context.parser.DateToken$@3c712a44, com.johnsnowlabs.nlp.annotators.spell.context.parser.NumberToken$@28800c15, com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach$$anon$1@59accddd))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants