[tokenizers] Updates data processors, docstring, examples and model c…

…ards to the new API (#5308) * remove references to old API in docstring - update data processors * style * fix tests - better type checking error messages * better type checking * include awesome fix by @LysandreJik for #5310 * updated doc and examples
huggingface · Jun 26, 2020 · 601d4d6 · 601d4d6
1 parent fd405e9
commit 601d4d6
Show file tree

Hide file tree

Showing 73 changed files with 180 additions and 138 deletions.
diff --git a/README.md b/README.md
@@ -287,8 +287,8 @@ pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf
 sentence_0 = "This research was consistent with his findings."
 sentence_1 = "His findings were compatible with this research."
 sentence_2 = "His findings were not compatible with this research."
-inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
-inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
+inputs_1 = tokenizer(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
+inputs_2 = tokenizer(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
 
 pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
 pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()

diff --git a/docs/README.md b/docs/README.md
@@ -167,7 +167,7 @@ Here's an example showcasing everything so far:
 
             Indices can be obtained using :class:`transformers.AlbertTokenizer`.
             See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+            :func:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
 ```

diff --git a/docs/source/main_classes/tokenizer.rst b/docs/source/main_classes/tokenizer.rst
@@ -11,7 +11,7 @@ The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` impleme
 - adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
 - managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)
 
-``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
+``BatchEncoding`` holds the output of the tokenizer's encoding methods (``__call__``, ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
 
 ``PreTrainedTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/docs/source/task_summary.rst b/docs/source/task_summary.rst
@@ -74,7 +74,7 @@ of each other. The process is the following:
   with the weights stored in the checkpoint.
 - Build a sequence from the two sentences, with the correct model-specific separators token type ids
   and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and
-  :func:`~transformers.PreTrainedTokenizer.encode_plus` take care of this)
+  :func:`~transformers.PreTrainedTokenizer.__call__` take care of this)
 - Pass this sequence through the model so that it is classified in one of the two available classes: 0
   (not a paraphrase) and 1 (is a paraphrase)
 - Compute the softmax of the result to get probabilities over the classes
@@ -95,8 +95,8 @@ of each other. The process is the following:
     >>> sequence_1 = "Apples are especially bad for your health"
     >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
 
-    >>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
-    >>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
+    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
+    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
 
     >>> paraphrase_classification_logits = model(**paraphrase)[0]
     >>> not_paraphrase_classification_logits = model(**not_paraphrase)[0]
@@ -128,8 +128,8 @@ of each other. The process is the following:
     >>> sequence_1 = "Apples are especially bad for your health"
     >>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
 
-    >>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf")
-    >>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf")
+    >>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
+    >>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
 
     >>> paraphrase_classification_logits = model(paraphrase)[0]
     >>> not_paraphrase_classification_logits = model(not_paraphrase)[0]
@@ -221,7 +221,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
     ... ]
 
     >>> for question in questions:
-    ...     inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
+    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
     ...     input_ids = inputs["input_ids"].tolist()[0]
     ...
     ...     text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
@@ -263,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
     ... ]
 
     >>> for question in questions:
-    ...     inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf")
+    ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
     ...     input_ids = inputs["input_ids"].numpy()[0]
     ...
     ...     text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

diff --git a/docs/source/training.rst b/docs/source/training.rst
@@ -77,7 +77,7 @@ other than bias and layer normalization terms:
     optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
     
 Now we can set up a simple dummy training batch using
-:func:`~transformers.PreTrainedTokenizer.batch_encode_plus`. This returns a
+:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
 :func:`~transformers.BatchEncoding` instance which
 prepares everything we might need to pass to the model.
 

diff --git a/examples/adversarial/utils_hans.py b/examples/adversarial/utils_hans.py
@@ -298,12 +298,13 @@ def hans_convert_examples_to_features(
         if ex_index % 10000 == 0:
             logger.info("Writing example %d" % (ex_index))
 
-        inputs = tokenizer.encode_plus(
+        inputs = tokenizer(
             example.text_a,
             example.text_b,
             add_special_tokens=True,
             max_length=max_length,
-            pad_to_max_length=True,
+            padding="max_length",
+            truncation=True,
             return_overflowing_tokens=True,
         )
 

diff --git a/examples/longform-qa/eli5_utils.py b/examples/longform-qa/eli5_utils.py
@@ -193,12 +193,12 @@ def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from
 def make_qa_retriever_batch(qa_list, tokenizer, max_len=64, device="cuda:0"):
     q_ls = [q for q, a in qa_list]
     a_ls = [a for q, a in qa_list]
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
     q_ids, q_mask = (
         torch.LongTensor(q_toks["input_ids"]).to(device),
         torch.LongTensor(q_toks["attention_mask"]).to(device),
     )
-    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
+    a_toks = tokenizer(a_ls, max_length=max_len, padding="max_length", truncation=True)
     a_ids, a_mask = (
         torch.LongTensor(a_toks["input_ids"]).to(device),
         torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -375,12 +375,12 @@ def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="
 def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"):
     q_ls = [q for q, a in qa_list]
     a_ls = [a for q, a in qa_list]
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
     q_ids, q_mask = (
         torch.LongTensor(q_toks["input_ids"]).to(device),
         torch.LongTensor(q_toks["attention_mask"]).to(device),
     )
-    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
+    a_toks = tokenizer(a_ls, max_length=min(max_len, max_a_len), padding="max_length", truncation=True)
     a_ids, a_mask = (
         torch.LongTensor(a_toks["input_ids"]).to(device),
         torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -531,7 +531,7 @@ def qa_s2s_generate(
 # ELI5-trained retrieval model usage
 ###############
 def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda:0"):
-    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
+    a_toks = tokenizer(passages, max_length=max_length, padding="max_length", truncation=True)
     a_ids, a_mask = (
         torch.LongTensor(a_toks["input_ids"]).to(device),
         torch.LongTensor(a_toks["attention_mask"]).to(device),
@@ -542,7 +542,7 @@ def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=12
 
 
 def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda:0"):
-    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
+    q_toks = tokenizer(q_ls, max_length=128, padding="max_length", truncation=True)
     q_ids, q_mask = (
         torch.LongTensor(q_toks["input_ids"]).to(device),
         torch.LongTensor(q_toks["attention_mask"]).to(device),

diff --git a/examples/movement-pruning/emmental/modeling_bert_masked.py b/examples/movement-pruning/emmental/modeling_bert_masked.py
@@ -424,7 +424,7 @@ def _init_weights(self, module):
 
             Indices can be obtained using :class:`transformers.BertTokenizer`.
             See :func:`transformers.PreTrainedTokenizer.encode` and
-            :func:`transformers.PreTrainedTokenizer.encode_plus` for details.
+            :func:`transformers.PreTrainedTokenizer.__call__` for details.
 
             `What are input IDs? <../glossary.html#input-ids>`__
         attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):

diff --git a/examples/multiple-choice/utils_multiple_choice.py b/examples/multiple-choice/utils_multiple_choice.py
@@ -510,12 +510,13 @@ def convert_examples_to_features(
             else:
                 text_b = example.question + " " + ending
 
-            inputs = tokenizer.encode_plus(
+            inputs = tokenizer(
                 text_a,
                 text_b,
                 add_special_tokens=True,
                 max_length=max_length,
-                pad_to_max_length=True,
+                padding="max_length",
+                truncation=True,
                 return_overflowing_tokens=True,
             )
             if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:

diff --git a/examples/seq2seq/run_eval.py b/examples/seq2seq/run_eval.py
@@ -45,9 +45,9 @@ def generate_summaries_or_translations(
     for batch in tqdm(list(chunks(examples, batch_size))):
         if "t5" in model_name:
             batch = [model.config.prefix + text for text in batch]
-        batch = tokenizer.batch_encode_plus(
-            batch, max_length=1024, return_tensors="pt", truncation=True, pad_to_max_length=True
-        ).to(device)
+        batch = tokenizer(batch, max_length=1024, return_tensors="pt", truncation=True, padding="max_length").to(
+            device
+        )
         summaries = model.generate(**batch, **gen_kwargs)
         dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
         for hypothesis in dec:

diff --git a/examples/seq2seq/utils.py b/examples/seq2seq/utils.py
@@ -41,12 +41,12 @@ def encode_file(
     assert lns, f"found empty file at {data_path}"
     examples = []
     for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
-        tokenized = tokenizer.batch_encode_plus(
+        tokenized = tokenizer(
             [text],
             max_length=max_length,
-            pad_to_max_length=pad_to_max_length,
-            add_prefix_space=True,
+            padding="max_length" if pad_to_max_length else None,
             truncation=True,
+            add_prefix_space=True,
             return_tensors=return_tensors,
         )
         assert tokenized.input_ids.shape[1] == max_length

diff --git a/model_cards/SparkBeyond/roberta-large-sts-b/README.md b/model_cards/SparkBeyond/roberta-large-sts-b/README.md
@@ -40,7 +40,7 @@ def roberta_similarity_batches(to_predict):
   return similarity_scores
 
 def similarity_roberta(model, tokenizer, sent_pairs):
-  batch_token = tokenizer.batch_encode_plus(sent_pairs, pad_to_max_length=True, max_length=500)
+  batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
   res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())  
   return res
 

diff --git a/model_cards/a-ware/bart-squadv2/README.md b/model_cards/a-ware/bart-squadv2/README.md
@@ -60,7 +60,7 @@ tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
 model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')
 
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']
 

diff --git a/model_cards/a-ware/xlmroberta-squadv2/README.md b/model_cards/a-ware/xlmroberta-squadv2/README.md
@@ -43,7 +43,7 @@ tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
 model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
 
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']
 

diff --git a/model_cards/google/reformer-enwik8/README.md b/model_cards/google/reformer-enwik8/README.md
@@ -14,7 +14,7 @@ Therefore, this model does not need a tokenizer. The following function can inst
 import torch
 
 # Encoding
-def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
+def encode(list_of_strings, pad_token_id=0):
     max_length = max([len(string) for string in list_of_strings])
 
     # create emtpy tensors

diff --git a/model_cards/lserinol/bert-turkish-question-answering/README.md b/model_cards/lserinol/bert-turkish-question-answering/README.md
@@ -43,7 +43,7 @@ questions = [
 ]
 
 for question in questions:
-    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
+    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
     input_ids = inputs["input_ids"].tolist()[0]
 
     text_tokens = tokenizer.convert_ids_to_tokens(input_ids)

diff --git a/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md b/model_cards/mrm8488/longformer-base-4096-finetuned-squadv2/README.md
@@ -50,7 +50,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4
 
 text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
 question = "What has Huggingface done ?"
-encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
+encoding = tokenizer(question, text, return_tensors="pt")
 input_ids = encoding["input_ids"]
 
 # default is local attention everywhere

diff --git a/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md b/model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
@@ -55,7 +55,7 @@ model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")
 
 def get_answer(question, context):
   input_text = "question: %s  context: %s </s>" % (question, context)
-  features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
+  features = tokenizer([input_text], return_tensors='pt')
 
   output = model.generate(input_ids=features['input_ids'], 
                attention_mask=features['attention_mask'])

diff --git a/model_cards/oliverguhr/german-sentiment-bert/README.md b/model_cards/oliverguhr/german-sentiment-bert/README.md
@@ -55,7 +55,7 @@ class SentimentModel():
     def predict_sentiment(self, texts: List[str])-> List[str]:
         texts = [self.clean_text(text) for text in texts]
         # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
-        input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
+        input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
         input_ids = torch.tensor(input_ids["input_ids"])
 
         with torch.no_grad():

diff --git a/model_cards/valhalla/bart-large-finetuned-squadv1/README.md b/model_cards/valhalla/bart-large-finetuned-squadv1/README.md
@@ -50,7 +50,7 @@ tokenizer = BartTokenizer.from_pretrained('valhalla/bart-large-finetuned-squadv1
 model = BartForQuestionAnswering.from_pretrained('valhalla/bart-large-finetuned-squadv1')
 
 question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
-encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
+encoding = tokenizer(question, text, return_tensors='pt')
 input_ids = encoding['input_ids']
 attention_mask = encoding['attention_mask']
 

diff --git a/model_cards/valhalla/longformer-base-4096-finetuned-squadv1/README.md b/model_cards/valhalla/longformer-base-4096-finetuned-squadv1/README.md
@@ -33,7 +33,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("valhalla/longformer-base-
 
 text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
 question = "What has Huggingface done ?"
-encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
+encoding = tokenizer(question, text, return_tensors="pt")
 input_ids = encoding["input_ids"]
 
 # default is local attention everywhere

diff --git a/model_cards/valhalla/t5-base-squad/README.md b/model_cards/valhalla/t5-base-squad/README.md
@@ -19,7 +19,7 @@ model = AutoModelWithLMHead.from_pretrained("valhalla/t5-base-squad")
 
 def get_answer(question, context):
   input_text = "question: %s  context: %s </s>" % (question, context)
-  features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
+  features = tokenizer([input_text], return_tensors='pt')
 
   out = model.generate(input_ids=features['input_ids'], 
                attention_mask=features['attention_mask'])