-
Notifications
You must be signed in to change notification settings - Fork 27.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add model card for Google AI's BERT Miniatures (#3301)
This model card is intended to be shared among all models under google/bert_uncased_* (We'll need some support from HuggingFace to get this card cross-linked from all models)
- Loading branch information
1 parent
4759176
commit b2028cc
Showing
1 changed file
with
70 additions
and
0 deletions.
There are no files selected for viewing
70 changes: 70 additions & 0 deletions
70
model_cards/iuliaturc/bert_uncased_L-2_H-128_A-2/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
BERT Miniatures | ||
=== | ||
|
||
This is the set of 24 BERT models referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962) (English only, uncased, trained with WordPiece masking). | ||
|
||
We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. | ||
|
||
Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. | ||
|
||
You can download the 24 BERT miniatures either from the [official BERT Github page](https://github.com/google-research/bert/), or via HuggingFace from the links below: | ||
|
||
| |H=128|H=256|H=512|H=768| | ||
|---|:---:|:---:|:---:|:---:| | ||
| **L=2** |[**2/128 (BERT-Tiny)**][2_128]|[2/256][2_256]|[2/512][2_512]|[2/768][2_768]| | ||
| **L=4** |[4/128][4_128]|[**4/256 (BERT-Mini)**][4_256]|[**4/512 (BERT-Small)**][4_512]|[4/768][4_768]| | ||
| **L=6** |[6/128][6_128]|[6/256][6_256]|[6/512][6_512]|[6/768][6_768]| | ||
| **L=8** |[8/128][8_128]|[8/256][8_256]|[**8/512 (BERT-Medium)**][8_512]|[8/768][8_768]| | ||
| **L=10** |[10/128][10_128]|[10/256][10_256]|[10/512][10_512]|[10/768][10_768]| | ||
| **L=12** |[12/128][12_128]|[12/256][12_256]|[12/512][12_512]|[**12/768 (BERT-Base)**][12_768]| | ||
|
||
Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. | ||
|
||
Here are the corresponding GLUE scores on the test set: | ||
|
||
|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| | ||
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | ||
|BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| | ||
|BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| | ||
|BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| | ||
|BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| | ||
|
||
For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: | ||
- batch sizes: 8, 16, 32, 64, 128 | ||
- learning rates: 3e-4, 1e-4, 5e-5, 3e-5 | ||
|
||
If you use these models, please cite the following paper: | ||
|
||
``` | ||
@article{turc2019, | ||
title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, | ||
author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, | ||
journal={arXiv preprint arXiv:1908.08962v2 }, | ||
year={2019} | ||
} | ||
``` | ||
|
||
[2_128]: https://huggingface.co/google/bert_uncased_L-2_H-128_A-2 | ||
[2_256]: https://huggingface.co/google/bert_uncased_L-2_H-256_A-4 | ||
[2_512]: https://huggingface.co/google/bert_uncased_L-2_H-512_A-8 | ||
[2_768]: https://huggingface.co/google/bert_uncased_L-2_H-768_A-12 | ||
[4_128]: https://huggingface.co/google/bert_uncased_L-4_H-128_A-2 | ||
[4_256]: https://huggingface.co/google/bert_uncased_L-4_H-256_A-4 | ||
[4_512]: https://huggingface.co/google/bert_uncased_L-4_H-512_A-8 | ||
[4_768]: https://huggingface.co/google/bert_uncased_L-4_H-768_A-12 | ||
[6_128]: https://huggingface.co/google/bert_uncased_L-6_H-128_A-2 | ||
[6_256]: https://huggingface.co/google/bert_uncased_L-6_H-256_A-4 | ||
[6_512]: https://huggingface.co/google/bert_uncased_L-6_H-512_A-8 | ||
[6_768]: https://huggingface.co/google/bert_uncased_L-6_H-768_A-12 | ||
[8_128]: https://huggingface.co/google/bert_uncased_L-8_H-128_A-2 | ||
[8_256]: https://huggingface.co/google/bert_uncased_L-8_H-256_A-4 | ||
[8_512]: https://huggingface.co/google/bert_uncased_L-8_H-512_A-8 | ||
[8_768]: https://huggingface.co/google/bert_uncased_L-8_H-768_A-12 | ||
[10_128]: https://huggingface.co/google/bert_uncased_L-10_H-128_A-2 | ||
[10_256]: https://huggingface.co/google/bert_uncased_L-10_H-256_A-4 | ||
[10_512]: https://huggingface.co/google/bert_uncased_L-10_H-512_A-8 | ||
[10_768]: https://huggingface.co/google/bert_uncased_L-10_H-768_A-12 | ||
[12_128]: https://huggingface.co/google/bert_uncased_L-12_H-128_A-2 | ||
[12_256]: https://huggingface.co/google/bert_uncased_L-12_H-256_A-4 | ||
[12_512]: https://huggingface.co/google/bert_uncased_L-12_H-512_A-8 | ||
[12_768]: https://huggingface.co/google/bert_uncased_L-12_H-768_A-12 |