-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links
- Loading branch information
Showing
18 changed files
with
3,608 additions
and
3,836 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
# Multi-lingual models | ||
|
||
[[open-in-colab]] | ||
|
||
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual | ||
models are available and have a different mechanisms than mono-lingual models. This page details the usage of these | ||
models. | ||
|
||
## XLM | ||
|
||
XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can | ||
be split in two categories: the checkpoints that make use of language embeddings, and those that don't | ||
|
||
### XLM & Language Embeddings | ||
|
||
This section concerns the following checkpoints: | ||
|
||
- `xlm-mlm-ende-1024` (Masked language modeling, English-German) | ||
- `xlm-mlm-enfr-1024` (Masked language modeling, English-French) | ||
- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) | ||
- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) | ||
- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + Translation, XNLI languages) | ||
- `xlm-clm-enfr-1024` (Causal language modeling, English-French) | ||
- `xlm-clm-ende-1024` (Causal language modeling, English-German) | ||
|
||
These checkpoints require language embeddings that will specify the language used at inference time. These language | ||
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in | ||
these tensors depend on the language used and are identifiable using the `lang2id` and `id2lang` attributes from | ||
the tokenizer. | ||
|
||
Here is an example using the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French): | ||
|
||
|
||
```py | ||
>>> import torch | ||
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel | ||
>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") | ||
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") | ||
``` | ||
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the | ||
`lang2id` attribute: | ||
```py | ||
>>> print(tokenizer.lang2id) | ||
{'en': 0, 'fr': 1} | ||
``` | ||
|
||
These ids should be used when passing a language parameter during a model pass. Let's define our inputs: | ||
|
||
```py | ||
>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1 | ||
``` | ||
|
||
We should now define the language embedding by using the previously defined language id. We want to create a tensor | ||
filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0: | ||
|
||
```py | ||
>>> language_id = tokenizer.lang2id['en'] # 0 | ||
>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0]) | ||
|
||
>>> # We reshape it to be of size (batch_size, sequence_length) | ||
>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1) | ||
``` | ||
|
||
You can then feed it all as input to your model: | ||
|
||
```py | ||
>>> outputs = model(input_ids, langs=langs) | ||
``` | ||
|
||
The example [run_generation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation/run_generation.py) can generate text | ||
using the CLM checkpoints from XLM, using the language embeddings. | ||
|
||
### XLM without Language Embeddings | ||
|
||
This section concerns the following checkpoints: | ||
|
||
- `xlm-mlm-17-1280` (Masked language modeling, 17 languages) | ||
- `xlm-mlm-100-1280` (Masked language modeling, 100 languages) | ||
|
||
These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence | ||
representations, differently from previously-mentioned XLM checkpoints. | ||
|
||
|
||
## BERT | ||
|
||
BERT has two checkpoints that can be used for multi-lingual tasks: | ||
|
||
- `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages) | ||
- `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages) | ||
|
||
These checkpoints do not require language embeddings at inference time. They should identify the language used in the | ||
context and infer accordingly. | ||
|
||
## XLM-RoBERTa | ||
|
||
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains | ||
over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence | ||
labeling and question answering. | ||
|
||
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks: | ||
|
||
- `xlm-roberta-base` (Masked language modeling, 100 languages) | ||
- `xlm-roberta-large` (Masked language modeling, 100 languages) |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
# Perplexity of fixed-length models | ||
|
||
[[open-in-colab]] | ||
|
||
Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note | ||
that the metric applies specifically to classical language models (sometimes called autoregressive or causal language | ||
models) and is not well defined for masked language models like BERT (see [summary of the models](model_summary)). | ||
|
||
Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized | ||
sequence \\(X = (x_0, x_1, \dots, x_t)\\), then the perplexity of \\(X\\) is, | ||
|
||
$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{<i}) } \right\}$$ | ||
|
||
where \\(\log p_\theta (x_i|x_{<i})\\) is the log-likelihood of the ith token conditioned on the preceding tokens \\(x_{<i}\\) according to our model. Intuitively, it can be thought of as an evaluation of the model's ability to predict uniformly among the set of specified tokens in a corpus. Importantly, this means that the tokenization procedure has a direct impact on a model's perplexity which should always be taken into consideration when comparing different models. | ||
|
||
This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. For more | ||
intuition about perplexity and its relationship to Bits Per Character (BPC) and data compression, check out this | ||
[fantastic blog post on The Gradient](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/). | ||
|
||
## Calculating PPL with fixed-length models | ||
|
||
If we weren't limited by a model's context size, we would evaluate the model's perplexity by autoregressively | ||
factorizing a sequence and conditioning on the entire preceding subsequence at each step, as shown below. | ||
|
||
<img width="600" alt="Full decomposition of a sequence with unlimited context length" src="/imgs/ppl_full.gif"/> | ||
|
||
When working with approximate models, however, we typically have a constraint on the number of tokens the model can | ||
process. The largest version of [GPT-2](model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we | ||
cannot calculate \\(p_\theta(x_t|x_{<t})\\) directly when \\(t\\) is greater than 1024. | ||
|
||
Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max | ||
input size is \\(k\\), we then approximate the likelihood of a token \\(x_t\\) by conditioning only on the | ||
\\(k-1\\) tokens that precede it rather than the entire context. When evaluating the model's perplexity of a | ||
sequence, a tempting but suboptimal approach is to break the sequence into disjoint chunks and add up the decomposed | ||
log-likelihoods of each segment independently. | ||
|
||
<img width="600" alt="Suboptimal PPL not taking advantage of full available context" src="/imgs/ppl_chunked.gif"/> | ||
|
||
This is quick to compute since the perplexity of each segment can be computed in one forward pass, but serves as a poor | ||
approximation of the fully-factorized perplexity and will typically yield a higher (worse) PPL because the model will | ||
have less context at most of the prediction steps. | ||
|
||
Instead, the PPL of fixed-length models should be evaluated with a sliding-window strategy. This involves repeatedly | ||
sliding the context window so that the model has more context when making each prediction. | ||
|
||
<img width="600" alt="Sliding window PPL taking advantage of all available context" src="/imgs/ppl_sliding.gif"/> | ||
|
||
This is a closer approximation to the true decomposition of the sequence probability and will typically yield a more | ||
favorable score. The downside is that it requires a separate forward pass for each token in the corpus. A good | ||
practical compromise is to employ a strided sliding window, moving the context by larger strides rather than sliding by | ||
1 token a time. This allows computation to proceed much faster while still giving the model a large context to make | ||
predictions at each step. | ||
|
||
## Example: Calculating perplexity with GPT-2 in 🤗 Transformers | ||
|
||
Let's demonstrate this process with GPT-2. | ||
|
||
```python | ||
from transformers import GPT2LMHeadModel, GPT2TokenizerFast | ||
device = 'cuda' | ||
model_id = 'gpt2-large' | ||
model = GPT2LMHeadModel.from_pretrained(model_id).to(device) | ||
tokenizer = GPT2TokenizerFast.from_pretrained(model_id) | ||
``` | ||
|
||
We'll load in the WikiText-2 dataset and evaluate the perplexity using a few different sliding-window strategies. Since | ||
this dataset is small and we're just doing one forward pass over the set, we can just load and encode the entire | ||
dataset in memory. | ||
|
||
```python | ||
from datasets import load_dataset | ||
test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test') | ||
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt') | ||
``` | ||
|
||
With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative | ||
log-likelihood for each token is returned as the loss. With our sliding window approach, however, there is overlap in | ||
the tokens we pass to the model at each iteration. We don't want the log-likelihood for the tokens we're just treating | ||
as context to be included in our loss, so we can set these targets to `-100` so that they are ignored. The following | ||
is an example of how we could do this with a stride of `512`. This means that the model will have at least 512 tokens | ||
for context when calculating the conditional likelihood of any one token (provided there are 512 preceding tokens | ||
available to condition on). | ||
|
||
```python | ||
import torch | ||
from tqdm import tqdm | ||
|
||
max_length = model.config.n_positions | ||
stride = 512 | ||
|
||
nlls = [] | ||
for i in tqdm(range(0, encodings.input_ids.size(1), stride)): | ||
begin_loc = max(i + stride - max_length, 0) | ||
end_loc = min(i + stride, encodings.input_ids.size(1)) | ||
trg_len = end_loc - i # may be different from stride on last loop | ||
input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device) | ||
target_ids = input_ids.clone() | ||
target_ids[:,:-trg_len] = -100 | ||
|
||
with torch.no_grad(): | ||
outputs = model(input_ids, labels=target_ids) | ||
neg_log_likelihood = outputs[0] * trg_len | ||
|
||
nlls.append(neg_log_likelihood) | ||
|
||
ppl = torch.exp(torch.stack(nlls).sum() / end_loc) | ||
``` | ||
|
||
Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window | ||
strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction, | ||
and the better the reported perplexity will typically be. | ||
|
||
When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same | ||
as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window | ||
strategy, this jumps down to `16.53`. This is not only a more favorable score, but is calculated in a way that is | ||
closer to the true autoregressive decomposition of a sequence likelihood. |
Oops, something went wrong.