[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

thomwolf · 2020-06-26T14:27:14Z

Updates the data-processors to the new recommended tokenizers' API instead of the old one.
Also update the docstrings, examples, and model-cards which were using the old API.

Supersede #5310

@sshleifer you have a couple of methods only your models use (bart and marian). I'm not sure about the consequences of updating those API so I'll let you update them. Here is the doc on the new tokenizer API if you need it: https://huggingface.co/transformers/master/preprocessing.html
Recommended updates:

use __call__ instead of encode_plus and batch_encode_plus
use padding and truncation instead of max_length only and pad_to_max_length

codecov · 2020-06-26T15:05:54Z

Codecov Report

Merging #5308 into master will decrease coverage by 2.10%.
The diff coverage is 58.82%.

@@            Coverage Diff             @@
##           master    #5308      +/-   ##
==========================================
- Coverage   79.30%   77.20%   -2.11%     
==========================================
  Files         138      138              
  Lines       24283    24285       +2     
==========================================
- Hits        19258    18749     -509     
- Misses       5025     5536     +511

Impacted Files	Coverage Δ
src/transformers/data/processors/squad.py	`28.66% <0.00%> (ø)`
src/transformers/file_utils.py	`79.59% <ø> (ø)`
src/transformers/modeling_albert.py	`80.73% <ø> (ø)`
src/transformers/modeling_bert.py	`87.65% <ø> (ø)`
src/transformers/modeling_ctrl.py	`96.62% <ø> (-2.54%)`	⬇️
src/transformers/modeling_distilbert.py	`96.04% <ø> (-1.70%)`	⬇️
src/transformers/modeling_electra.py	`80.62% <ø> (ø)`
src/transformers/modeling_flaubert.py	`84.37% <ø> (ø)`
src/transformers/modeling_gpt2.py	`85.94% <ø> (ø)`
src/transformers/modeling_longformer.py	`93.11% <ø> (ø)`
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 135791e...5b1fbb2. Read the comment docs.

thomwolf · 2020-06-26T15:19:13Z

@sshleifer @patrickvonplaten and @yjernite I updated your examples (seq2seq and eli5). Maybe you want to check it.

yjernite · 2020-06-26T15:42:14Z

@sshleifer @patrickvonplaten and @yjernite I updated your examples (seq2seq and eli5). Maybe you want to check it.

lgtm!

sshleifer · 2020-06-26T15:56:59Z

examples/seq2seq/utils.py

@@ -41,12 +41,12 @@ def encode_file(
    assert lns, f"found empty file at {data_path}"
    examples = []
    for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
-        tokenized = tokenizer.batch_encode_plus(
+        tokenized = tokenizer(


LGTM, thanks!

@LysandreJik

…ards to the new API (huggingface#5308) * remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for huggingface#5310 * updated doc and examples

thomwolf added 2 commits June 26, 2020 16:19

remove references to old API in docstring - update data processors

6f09f44

style

9eecce4

julien-c added the model card Related to pretrained model cards label Jun 26, 2020

thomwolf added 3 commits June 26, 2020 16:53

fix tests - better type checking error messages

f552adb

better type checking

cf2d368

include awesome fix by @LysandreJik for #5310

542699c

thomwolf requested review from LysandreJik, sshleifer and julien-c June 26, 2020 15:10

updated doc and examples

5b1fbb2

thomwolf requested a review from yjernite June 26, 2020 15:18

sshleifer approved these changes Jun 26, 2020

View reviewed changes

thomwolf merged commit 601d4d6 into master Jun 26, 2020

thomwolf deleted the update-data-processors branch June 26, 2020 17:48

thomwolf mentioned this pull request Jun 26, 2020

overflowing tokens are now always returned #5310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

thomwolf commented Jun 26, 2020 •

edited

Loading

codecov bot commented Jun 26, 2020 •

edited

Loading

thomwolf commented Jun 26, 2020

yjernite commented Jun 26, 2020

sshleifer Jun 26, 2020

[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

Conversation

thomwolf commented Jun 26, 2020 • edited Loading

codecov bot commented Jun 26, 2020 • edited Loading

Codecov Report

thomwolf commented Jun 26, 2020

yjernite commented Jun 26, 2020

sshleifer Jun 26, 2020

Choose a reason for hiding this comment

thomwolf commented Jun 26, 2020 •

edited

Loading

codecov bot commented Jun 26, 2020 •

edited

Loading