-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer attribute .tokens_from_list deprecated #152
Comments
Instead using old_tokenizer.tokens_from_list, you can substitute any custom tokenizer that does the correct input -> Doc conversion with the correct vocab for nlp.tokenizer: from spacy.tokens import Doc |
'List' is used to store input string/text |
Probably similar to @Tanvi09Garg here is what works for me: import re
import spacy
from spacy.tokens import Doc
# regexp used in CountVectorizer
# (?u) sets unicode flag, i.e. patterns are unicode
# \\b word boundary: the end of a word is indicated by whitespace or a non-alphanumeric character
# \\w alphanumeric: [0-9a-zA-Z_]
class RegexTokenizer:
"""Spacy custom tokenizer
Reference https://spacy.io/usage/linguistic-features#custom-tokenizer
"""
def __init__(self, vocab, regex_pattern='(?u)\\b\\w\\w+\\b'):
self.vocab = vocab
self.regexp = re.compile(regex_pattern)
def __call__(self, text):
words = self.regexp.findall(text)
spaces = [True] * len(words)
spaces[-1] = False #no space after last word
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.tokenizer = RegexTokenizer(nlp.vocab)
def custom_tokenizer(document):
doc_spacy = nlp(document)
return [token.lemma_ for token in doc_spacy]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(tokenizer=custom_tokenizer) It runs a bit slow, any suggestions to speed this up? |
The tokeniser attribute
.tokens_from_list
has been deprecated in SpaCy.This is used in Chapter 7, Section 7.8 "Advanced Tokenisation, Stemming and Lemmatization" in block In[39].
I'm using SpaCy version 3.0.6 - Which I am guessing is several versions higher than the book, I just can't find where it is in my copy.
Any suggestions on getting around this function? I'm a bit of a newbie, but the searches online have led to rabbit holes thus far.
The text was updated successfully, but these errors were encountered: