Skip to content

Commit

Permalink
[tokenizers] Updates data processors, docstring, examples and model c…
Browse files Browse the repository at this point in the history
…ards to the new API (#5308)

* remove references to old API in docstring - update data processors

* style

* fix tests - better type checking error messages

* better type checking

* include awesome fix by @LysandreJik for #5310

* updated doc and examples
  • Loading branch information
thomwolf authored Jun 26, 2020
1 parent fd405e9 commit 601d4d6
Show file tree
Hide file tree
Showing 73 changed files with 180 additions and 138 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,8 +287,8 @@ pytorch_model = BertForSequenceClassification.from_pretrained('./save/', from_tf
sentence_0 = "This research was consistent with his findings."
sentence_1 = "His findings were compatible with this research."
sentence_2 = "His findings were not compatible with this research."
inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')
inputs_1 = tokenizer(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt')
inputs_2 = tokenizer(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt')

pred_1 = pytorch_model(inputs_1['input_ids'], token_type_ids=inputs_1['token_type_ids'])[0].argmax().item()
pred_2 = pytorch_model(inputs_2['input_ids'], token_type_ids=inputs_2['token_type_ids'])[0].argmax().item()
Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ Here's an example showcasing everything so far:
Indices can be obtained using :class:`transformers.AlbertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
:func:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
```
Expand Down
2 changes: 1 addition & 1 deletion docs/source/main_classes/tokenizer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The base classes ``PreTrainedTokenizer`` and ``PreTrainedTokenizerFast`` impleme
- adding new tokens to the vocabulary in a way that is independant of the underlying structure (BPE, SentencePiece...),
- managing special tokens like mask, beginning-of-sentence, etc tokens (adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization)

``BatchEncoding`` holds the output of the tokenizer's encoding methods (``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).
``BatchEncoding`` holds the output of the tokenizer's encoding methods (``__call__``, ``encode_plus`` and ``batch_encode_plus``) and is derived from a Python dictionary. When the tokenizer is a pure python tokenizer, this class behave just like a standard python dictionary and hold the various model inputs computed by these methodes (``input_ids``, ``attention_mask``...). When the tokenizer is a "Fast" tokenizer (i.e. backed by HuggingFace tokenizers library), this class provides in addition several advanced alignement methods which can be used to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token).

``PreTrainedTokenizer``
~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
14 changes: 7 additions & 7 deletions docs/source/task_summary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ of each other. The process is the following:
with the weights stored in the checkpoint.
- Build a sequence from the two sentences, with the correct model-specific separators token type ids
and attention masks (:func:`~transformers.PreTrainedTokenizer.encode` and
:func:`~transformers.PreTrainedTokenizer.encode_plus` take care of this)
:func:`~transformers.PreTrainedTokenizer.__call__` take care of this)
- Pass this sequence through the model so that it is classified in one of the two available classes: 0
(not a paraphrase) and 1 (is a paraphrase)
- Compute the softmax of the result to get probabilities over the classes
Expand All @@ -95,8 +95,8 @@ of each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
>>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
>>> paraphrase_classification_logits = model(**paraphrase)[0]
>>> not_paraphrase_classification_logits = model(**not_paraphrase)[0]
Expand Down Expand Up @@ -128,8 +128,8 @@ of each other. The process is the following:
>>> sequence_1 = "Apples are especially bad for your health"
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
>>> paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf")
>>> not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf")
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
>>> paraphrase_classification_logits = model(paraphrase)[0]
>>> not_paraphrase_classification_logits = model(not_paraphrase)[0]
Expand Down Expand Up @@ -221,7 +221,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
... ]
>>> for question in questions:
... inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
... inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
... input_ids = inputs["input_ids"].tolist()[0]
...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
Expand Down Expand Up @@ -263,7 +263,7 @@ Here is an example of question answering using a model and a tokenizer. The proc
... ]
>>> for question in questions:
... inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf")
... inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
... input_ids = inputs["input_ids"].numpy()[0]
...
... text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ other than bias and layer normalization terms:
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
Now we can set up a simple dummy training batch using
:func:`~transformers.PreTrainedTokenizer.batch_encode_plus`. This returns a
:func:`~transformers.PreTrainedTokenizer.__call__`. This returns a
:func:`~transformers.BatchEncoding` instance which
prepares everything we might need to pass to the model.

Expand Down
5 changes: 3 additions & 2 deletions examples/adversarial/utils_hans.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,12 +298,13 @@ def hans_convert_examples_to_features(
if ex_index % 10000 == 0:
logger.info("Writing example %d" % (ex_index))

inputs = tokenizer.encode_plus(
inputs = tokenizer(
example.text_a,
example.text_b,
add_special_tokens=True,
max_length=max_length,
pad_to_max_length=True,
padding="max_length",
truncation=True,
return_overflowing_tokens=True,
)

Expand Down
12 changes: 6 additions & 6 deletions examples/longform-qa/eli5_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,12 +193,12 @@ def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from
def make_qa_retriever_batch(qa_list, tokenizer, max_len=64, device="cuda:0"):
q_ls = [q for q, a in qa_list]
a_ls = [a for q, a in qa_list]
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device),
)
a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
a_toks = tokenizer(a_ls, max_length=max_len, padding="max_length", truncation=True)
a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device),
Expand Down Expand Up @@ -375,12 +375,12 @@ def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="
def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"):
q_ls = [q for q, a in qa_list]
a_ls = [a for q, a in qa_list]
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
q_toks = tokenizer(q_ls, max_length=max_len, padding="max_length", truncation=True)
q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device),
)
a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
a_toks = tokenizer(a_ls, max_length=min(max_len, max_a_len), padding="max_length", truncation=True)
a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device),
Expand Down Expand Up @@ -531,7 +531,7 @@ def qa_s2s_generate(
# ELI5-trained retrieval model usage
###############
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda:0"):
a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
a_toks = tokenizer(passages, max_length=max_length, padding="max_length", truncation=True)
a_ids, a_mask = (
torch.LongTensor(a_toks["input_ids"]).to(device),
torch.LongTensor(a_toks["attention_mask"]).to(device),
Expand All @@ -542,7 +542,7 @@ def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=12


def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda:0"):
q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
q_toks = tokenizer(q_ls, max_length=128, padding="max_length", truncation=True)
q_ids, q_mask = (
torch.LongTensor(q_toks["input_ids"]).to(device),
torch.LongTensor(q_toks["attention_mask"]).to(device),
Expand Down
2 changes: 1 addition & 1 deletion examples/movement-pruning/emmental/modeling_bert_masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,7 @@ def _init_weights(self, module):
Indices can be obtained using :class:`transformers.BertTokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.encode_plus` for details.
:func:`transformers.PreTrainedTokenizer.__call__` for details.
`What are input IDs? <../glossary.html#input-ids>`__
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Expand Down
5 changes: 3 additions & 2 deletions examples/multiple-choice/utils_multiple_choice.py
Original file line number Diff line number Diff line change
Expand Up @@ -510,12 +510,13 @@ def convert_examples_to_features(
else:
text_b = example.question + " " + ending

inputs = tokenizer.encode_plus(
inputs = tokenizer(
text_a,
text_b,
add_special_tokens=True,
max_length=max_length,
pad_to_max_length=True,
padding="max_length",
truncation=True,
return_overflowing_tokens=True,
)
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
Expand Down
6 changes: 3 additions & 3 deletions examples/seq2seq/run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ def generate_summaries_or_translations(
for batch in tqdm(list(chunks(examples, batch_size))):
if "t5" in model_name:
batch = [model.config.prefix + text for text in batch]
batch = tokenizer.batch_encode_plus(
batch, max_length=1024, return_tensors="pt", truncation=True, pad_to_max_length=True
).to(device)
batch = tokenizer(batch, max_length=1024, return_tensors="pt", truncation=True, padding="max_length").to(
device
)
summaries = model.generate(**batch, **gen_kwargs)
dec = tokenizer.batch_decode(summaries, skip_special_tokens=True, clean_up_tokenization_spaces=False)
for hypothesis in dec:
Expand Down
6 changes: 3 additions & 3 deletions examples/seq2seq/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,12 @@ def encode_file(
assert lns, f"found empty file at {data_path}"
examples = []
for text in tqdm(lns, desc=f"Tokenizing {data_path.name}"):
tokenized = tokenizer.batch_encode_plus(
tokenized = tokenizer(
[text],
max_length=max_length,
pad_to_max_length=pad_to_max_length,
add_prefix_space=True,
padding="max_length" if pad_to_max_length else None,
truncation=True,
add_prefix_space=True,
return_tensors=return_tensors,
)
assert tokenized.input_ids.shape[1] == max_length
Expand Down
2 changes: 1 addition & 1 deletion model_cards/SparkBeyond/roberta-large-sts-b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def roberta_similarity_batches(to_predict):
return similarity_scores

def similarity_roberta(model, tokenizer, sent_pairs):
batch_token = tokenizer.batch_encode_plus(sent_pairs, pad_to_max_length=True, max_length=500)
batch_token = tokenizer(sent_pairs, padding='max_length', truncation=True, max_length=500)
res = model(torch.tensor(batch_token['input_ids']).cuda(), attention_mask=torch.tensor(batch_token["attention_mask"]).cuda())
return res

Expand Down
2 changes: 1 addition & 1 deletion model_cards/a-ware/bart-squadv2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ tokenizer = BartTokenizer.from_pretrained('a-ware/bart-squadv2')
model = BartForQuestionAnswering.from_pretrained('a-ware/bart-squadv2')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

Expand Down
2 changes: 1 addition & 1 deletion model_cards/a-ware/xlmroberta-squadv2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

Expand Down
2 changes: 1 addition & 1 deletion model_cards/google/reformer-enwik8/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Therefore, this model does not need a tokenizer. The following function can inst
import torch

# Encoding
def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
def encode(list_of_strings, pad_token_id=0):
max_length = max([len(string) for string in list_of_strings])

# create emtpy tensors
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ questions = [
]

for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("mrm8488/longformer-base-4

text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]

# default is local attention everywhere
Expand Down
2 changes: 1 addition & 1 deletion model_cards/mrm8488/t5-base-finetuned-squadv2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-squadv2")

def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context)
features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
features = tokenizer([input_text], return_tensors='pt')

output = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'])
Expand Down
2 changes: 1 addition & 1 deletion model_cards/oliverguhr/german-sentiment-bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ class SentimentModel():
def predict_sentiment(self, texts: List[str])-> List[str]:
texts = [self.clean_text(text) for text in texts]
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
input_ids = self.tokenizer(texts, padding=True, truncation=True, add_special_tokens=True)
input_ids = torch.tensor(input_ids["input_ids"])

with torch.no_grad():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ tokenizer = BartTokenizer.from_pretrained('valhalla/bart-large-finetuned-squadv1
model = BartForQuestionAnswering.from_pretrained('valhalla/bart-large-finetuned-squadv1')

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ model = AutoModelForQuestionAnswering.from_pretrained("valhalla/longformer-base-

text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer.encode_plus(question, text, return_tensors="pt")
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]

# default is local attention everywhere
Expand Down
2 changes: 1 addition & 1 deletion model_cards/valhalla/t5-base-squad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ model = AutoModelWithLMHead.from_pretrained("valhalla/t5-base-squad")
def get_answer(question, context):
input_text = "question: %s context: %s </s>" % (question, context)
features = tokenizer.batch_encode_plus([input_text], return_tensors='pt')
features = tokenizer([input_text], return_tensors='pt')
out = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'])
Expand Down
Loading

0 comments on commit 601d4d6

Please sign in to comment.