Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308

Merged
merged 6 commits into from
Jun 26, 2020

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Jun 26, 2020

Updates the data-processors to the new recommended tokenizers' API instead of the old one.
Also update the docstrings, examples, and model-cards which were using the old API.

Supersede #5310

@sshleifer you have a couple of methods only your models use (bart and marian). I'm not sure about the consequences of updating those API so I'll let you update them. Here is the doc on the new tokenizer API if you need it: https://huggingface.co/transformers/master/preprocessing.html
Recommended updates:

  • use __call__ instead of encode_plus and batch_encode_plus
  • use padding and truncation instead of max_length only and pad_to_max_length

@julien-c julien-c added the model card Related to pretrained model cards label Jun 26, 2020
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #5308 into master will decrease coverage by 2.10%.
The diff coverage is 58.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5308      +/-   ##
==========================================
- Coverage   79.30%   77.20%   -2.11%     
==========================================
  Files         138      138              
  Lines       24283    24285       +2     
==========================================
- Hits        19258    18749     -509     
- Misses       5025     5536     +511     
Impacted Files Coverage Δ
src/transformers/data/processors/squad.py 28.66% <0.00%> (ø)
src/transformers/file_utils.py 79.59% <ø> (ø)
src/transformers/modeling_albert.py 80.73% <ø> (ø)
src/transformers/modeling_bert.py 87.65% <ø> (ø)
src/transformers/modeling_ctrl.py 96.62% <ø> (-2.54%) ⬇️
src/transformers/modeling_distilbert.py 96.04% <ø> (-1.70%) ⬇️
src/transformers/modeling_electra.py 80.62% <ø> (ø)
src/transformers/modeling_flaubert.py 84.37% <ø> (ø)
src/transformers/modeling_gpt2.py 85.94% <ø> (ø)
src/transformers/modeling_longformer.py 93.11% <ø> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 135791e...5b1fbb2. Read the comment docs.

@thomwolf thomwolf requested a review from yjernite June 26, 2020 15:18
@thomwolf
Copy link
Member Author

@sshleifer @patrickvonplaten and @yjernite I updated your examples (seq2seq and eli5). Maybe you want to check it.

@yjernite
Copy link
Member

@sshleifer @patrickvonplaten and @yjernite I updated your examples (seq2seq and eli5). Maybe you want to check it.

lgtm!

@@ -41,12 +41,12 @@ def encode_file(
assert lns, f"found empty file at {data_path}"
examples = []
for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
tokenized = tokenizer.batch_encode_plus(
tokenized = tokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thomwolf thomwolf merged commit 601d4d6 into master Jun 26, 2020
@thomwolf thomwolf deleted the update-data-processors branch June 26, 2020 17:48
jplu pushed a commit to jplu/transformers that referenced this pull request Jun 29, 2020
…ards to the new API (huggingface#5308)

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for huggingface#5310

* updated doc and examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model card Related to pretrained model cards
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants