Skip to content

Commit

Permalink
docs: tweak descriptions
Browse files Browse the repository at this point in the history
  • Loading branch information
leandro committed Mar 30, 2020
1 parent 1460633 commit b5c1df4
Show file tree
Hide file tree
Showing 10 changed files with 32 additions and 32 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Welcome to trl
> Train transformer language models with Reinforcement Learning.
# Welcome to Transformer Reinforcement Learning (trl)
> Train transformer language models with reinforcement learning.

## What is it?
With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the `transformer` library by 🤗Huggingface. Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.
With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the `transformer` library by 🤗 Hugging Face ([link](https://github.com/huggingface/transformers)). Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.

**Highlights:**
- GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in Reinforcement Learning.
- GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
- PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
- Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.

Expand Down Expand Up @@ -108,10 +108,10 @@ This library is built with `nbdev` and as such all the library code as well as e
- `04-gpt2-sentiment-ppo-training.ipynb`: Fine-tune GPT2 with the BERT sentiment classifier to produce positive movie reviews.


## Reference
## References

### Proximal Policy Optimisation
The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].

### Language models
The language models utilize the `transformer` library by 🤗Huggingface.
The language models utilize the `transformer` library by 🤗Hugging Face.
4 changes: 2 additions & 2 deletions docs/01-gpt2-with-value-head.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
keywords: fastai
sidebar: home_sidebar

summary: "A GPT2 model with a value head built on the transformer library by huggingface."
description: "A GPT2 model with a value head built on the transformer library by huggingface."
summary: "A GPT2 model with a value head built on the `transformer` library by Hugging Face."
description: "A GPT2 model with a value head built on the `transformer` library by Hugging Face."
---
<!--
Expand Down
6 changes: 3 additions & 3 deletions docs/02-ppo.html
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>This follows the language model approach proposed in paper <a href="https://arxiv.org/pdf/1909.08593.pdf">"Fine-Tuning Language Models from Human Preferences"</a> and is similar to the <a href="https://github.com/openai/lm-human-preferences">original implementation</a>. The two main differences are 1) the method is implemented in Pytorch and 2) works with the transformer library by Huggingface.</p>
<p>This follows the language model approach proposed in paper <a href="https://arxiv.org/pdf/1909.08593.pdf">"Fine-Tuning Language Models from Human Preferences"</a> and is similar to the <a href="https://github.com/openai/lm-human-preferences">original implementation</a>. The two main differences are 1) the method is implemented in Pytorch and 2) works with the <code>transformer</code> library by Hugging Face.</p>

</div>
</div>
Expand Down Expand Up @@ -187,8 +187,8 @@ <h2 id="FixedKLController" class="doc_header"><code>class</code> <code>FixedKLCo
<span class="sd"> Initialize PPOTrainer.</span>
<span class="sd"> </span>
<span class="sd"> Args:</span>
<span class="sd"> model (torch.model): Huggingface GPT2 model</span>
<span class="sd"> ref_model (torch.model): Huggingface GPT2 refrence model used for KL penalty</span>
<span class="sd"> model (torch.model): Hugging Face transformer GPT2 model with value head</span>
<span class="sd"> ref_model (torch.model): Hugging Face transformer GPT2 refrence model used for KL penalty</span>
<span class="sd"> ppo_params (dict or None): PPO parameters for training. Can include following keys:</span>
<span class="sd"> &#39;lr&#39; (float): Adam learning rate, default: 1.41e-5</span>
<span class="sd"> &#39;batch_size&#39; (int): Number of samples per optimisation step, default: 256</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/04-gpt2-sentiment-ppo-training.html
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
<div class="text_cell_render border-box-sizing rendered_html">
<div style="text-align: center">
{% include image.html max-width="600" file="/trl/images/gpt2_bert_training.png" %}
<p style="text-align: center;"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Huggingface. </p>
<p style="text-align: center;"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face. </p>
</div><p>In this notebook we fine-tune GPT2 (small) to generate positive movie reviews based on the IMDB dataset. The model gets 5 tokens from a real review and is tasked to produce positive continuations. To reward positive continuations we use a BERT classifier to analyse the sentiment of the produced sentences and use the classifier's outputs as rewards signals for PPO training.</p>

</div>
Expand Down
14 changes: 7 additions & 7 deletions docs/index.html
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---

title: Welcome to trl
title: Welcome to Transformer Reinforcement Learning (trl)

keywords: fastai
sidebar: home_sidebar

summary: "Train transformer language models with Reinforcement Learning."
description: "Train transformer language models with Reinforcement Learning."
summary: "Train transformer language models with reinforcement learning."
description: "Train transformer language models with reinforcement learning."
---
<!--
Expand All @@ -29,10 +29,10 @@

<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="What-is-it?">What is it?<a class="anchor-link" href="#What-is-it?"> </a></h2><p>With <code>trl</code> you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the <code>transformer</code> library by 🤗Huggingface. Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.</p>
<h2 id="What-is-it?">What is it?<a class="anchor-link" href="#What-is-it?"> </a></h2><p>With <code>trl</code> you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the <code>transformer</code> library by 🤗 Hugging Face (<a href="https://github.com/huggingface/transformers">link</a>). Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.</p>
<p><strong>Highlights:</strong></p>
<ul>
<li>GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in Reinforcement Learning.</li>
<li>GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.</li>
<li>PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.</li>
<li>Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.</li>
</ul>
Expand Down Expand Up @@ -163,8 +163,8 @@ <h2 id="Notebooks">Notebooks<a class="anchor-link" href="#Notebooks"> </a></h2><
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Reference">Reference<a class="anchor-link" href="#Reference"> </a></h2><h3 id="Proximal-Policy-Optimisation">Proximal Policy Optimisation<a class="anchor-link" href="#Proximal-Policy-Optimisation"> </a></h3><p>The PPO implementation largely follows the structure introduced in the paper <strong>"Fine-Tuning Language Models from Human Preferences"</strong> by D. Ziegler et al. [<a href="https://arxiv.org/pdf/1909.08593.pdf">paper</a>, <a href="https://github.com/openai/lm-human-preferences">code</a>].</p>
<h3 id="Language-models">Language models<a class="anchor-link" href="#Language-models"> </a></h3><p>The language models utilize the <code>transformer</code> library by 🤗Huggingface.</p>
<h2 id="References">References<a class="anchor-link" href="#References"> </a></h2><h3 id="Proximal-Policy-Optimisation">Proximal Policy Optimisation<a class="anchor-link" href="#Proximal-Policy-Optimisation"> </a></h3><p>The PPO implementation largely follows the structure introduced in the paper <strong>"Fine-Tuning Language Models from Human Preferences"</strong> by D. Ziegler et al. [<a href="https://arxiv.org/pdf/1909.08593.pdf">paper</a>, <a href="https://github.com/openai/lm-human-preferences">code</a>].</p>
<h3 id="Language-models">Language models<a class="anchor-link" href="#Language-models"> </a></h3><p>The language models utilize the <code>transformer</code> library by 🤗Hugging Face.</p>

</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion nbs/01-gpt2-with-value-head.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"metadata": {},
"source": [
"# GPT2 with value head\n",
"> A GPT2 model with a value head built on the transformer library by huggingface."
"> A GPT2 model with a value head built on the `transformer` library by Hugging Face."
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions nbs/02-ppo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"metadata": {},
"source": [
"This follows the language model approach proposed in paper [\"Fine-Tuning Language Models from Human Preferences\"](\n",
"https://arxiv.org/pdf/1909.08593.pdf) and is similar to the [original implementation](https://github.com/openai/lm-human-preferences). The two main differences are 1) the method is implemented in Pytorch and 2) works with the transformer library by Huggingface."
"https://arxiv.org/pdf/1909.08593.pdf) and is similar to the [original implementation](https://github.com/openai/lm-human-preferences). The two main differences are 1) the method is implemented in Pytorch and 2) works with the `transformer` library by Hugging Face."
]
},
{
Expand Down Expand Up @@ -137,8 +137,8 @@
" Initialize PPOTrainer.\n",
" \n",
" Args:\n",
" model (torch.model): Huggingface GPT2 model\n",
" ref_model (torch.model): Huggingface GPT2 refrence model used for KL penalty\n",
" model (torch.model): Hugging Face transformer GPT2 model with value head\n",
" ref_model (torch.model): Hugging Face transformer GPT2 refrence model used for KL penalty\n",
" ppo_params (dict or None): PPO parameters for training. Can include following keys:\n",
" 'lr' (float): Adam learning rate, default: 1.41e-5\n",
" 'batch_size' (int): Number of samples per optimisation step, default: 256\n",
Expand Down
2 changes: 1 addition & 1 deletion nbs/04-gpt2-sentiment-ppo-training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"source": [
"<div style=\"text-align: center\">\n",
"<img src='images/gpt2_bert_training.png' width='600'>\n",
"<p style=\"text-align: center;\"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Huggingface. </p>\n",
"<p style=\"text-align: center;\"> <b>Figure:</b> Experiment setup to tune GPT2. The yellow arrows are outside the scope of this notebook, but the trained models are available through Hugging Face. </p>\n",
"</div>\n",
"\n",
"\n",
Expand Down
12 changes: 6 additions & 6 deletions nbs/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Welcome to trl\n",
"# Welcome to Transformer Reinforcement Learning (trl)\n",
"\n",
"> Train transformer language models with Reinforcement Learning."
"> Train transformer language models with reinforcement learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is it?\n",
"With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the `transformer` library by 🤗Huggingface. Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.\n",
"With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built with the `transformer` library by 🤗 Hugging Face ([link](https://github.com/huggingface/transformers)). Therefore, pre-trained language models can be directly loaded via the transformer interface. At this point only GTP2 is implemented.\n",
"\n",
"**Highlights:**\n",
"- GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in Reinforcement Learning.\n",
"- GPT2 model with a value head: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.\n",
"- PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.\n",
"- Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier."
]
Expand Down Expand Up @@ -162,13 +162,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reference\n",
"## References\n",
"\n",
"### Proximal Policy Optimisation\n",
"The PPO implementation largely follows the structure introduced in the paper **\"Fine-Tuning Language Models from Human Preferences\"** by D. Ziegler et al. \\[[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)].\n",
"\n",
"### Language models\n",
"The language models utilize the `transformer` library by 🤗Huggingface."
"The language models utilize the `transformer` library by 🤗Hugging Face."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions trl/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ def __init__(self, model, ref_model, **ppo_params):
Initialize PPOTrainer.
Args:
model (torch.model): Huggingface GPT2 model
ref_model (torch.model): Huggingface GPT2 refrence model used for KL penalty
model (torch.model): Hugging Face transformer GPT2 model with value head
ref_model (torch.model): Hugging Face transformer GPT2 refrence model used for KL penalty
ppo_params (dict or None): PPO parameters for training. Can include following keys:
'lr' (float): Adam learning rate, default: 1.41e-5
'batch_size' (int): Number of samples per optimisation step, default: 256
Expand Down

0 comments on commit b5c1df4

Please sign in to comment.