-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5308 +/- ##
==========================================
- Coverage 79.30% 77.20% -2.11%
==========================================
Files 138 138
Lines 24283 24285 +2
==========================================
- Hits 19258 18749 -509
- Misses 5025 5536 +511
Continue to review full report at Codecov.
|
@sshleifer @patrickvonplaten and @yjernite I updated your examples (seq2seq and eli5). Maybe you want to check it. |
lgtm! |
@@ -41,12 +41,12 @@ def encode_file( | |||
assert lns, f"found empty file at {data_path}" | |||
examples = [] | |||
for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"): | |||
tokenized = tokenizer.batch_encode_plus( | |||
tokenized = tokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…ards to the new API (huggingface#5308) * remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for huggingface#5310 * updated doc and examples
Updates the data-processors to the new recommended tokenizers' API instead of the old one.
Also update the docstrings, examples, and model-cards which were using the old API.
Supersede #5310
@sshleifer you have a couple of methods only your models use (bart and marian). I'm not sure about the consequences of updating those API so I'll let you update them. Here is the doc on the new tokenizer API if you need it: https://huggingface.co/transformers/master/preprocessing.html
Recommended updates:
__call__
instead ofencode_plus
andbatch_encode_plus
padding
andtruncation
instead ofmax_length
only andpad_to_max_length