Fix nits & missing things (huggingface#108)

* fix nits & missing things * Update README.md
shizhediao · Jan 25, 2023 · d4aca61 · d4aca61
1 parent c1b328e
commit d4aca61
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -7,11 +7,11 @@
 
 
 ## What is it?
-With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by  🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point only decoder architectures such as GPT2 are implemented.
+With `trl` you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library by  🤗 Hugging Face. Therefore, pre-trained language models can be directly loaded via `transformers`. At this point most of decoder architectures and encoder-decoder architectures are supported. 
 
 **Highlights:**
-- PPOTrainer: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
-- AutoModelForCausalLMWithValueHead: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
+- `PPOTrainer`: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.
+- `AutoModelForCausalLMWithValueHead` & `AutoModelForSeq2SeqLMWithValueHead`: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.
 - Example: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier.
 
 ## How it works
@@ -78,7 +78,6 @@ response_tensor  = respond_to_batch(model_ref, query_tensor)
 
 # create a ppo trainer
 ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
-device = ppo_trainer.accelerator.device
 
 # define a reward for response
 # (this could be any reward such as human feedback or output from another model)

diff --git a/examples/sentiment/scripts/gpt2-sentiment.py b/examples/sentiment/scripts/gpt2-sentiment.py
@@ -59,7 +59,36 @@
 # Below is an example function to build the dataset. In our case, we use the IMDB dataset
 # from the `datasets` library. One should customize this function to train the model on
 # its own dataset.
-
+def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
+    """
+    Build dataset for training. This builds the dataset from `load_dataset`, one should 
+    customize this function to train the model on its own dataset.
+    
+    Args:
+        dataset_name (`str`): 
+            The name of the dataset to be loaded.
+    
+    Returns:
+        dataloader (`torch.utils.data.DataLoader`):
+            The dataloader for the dataset.
+    """
+    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+    # load imdb with datasets
+    ds = load_dataset(dataset_name, split='train')
+    ds = ds.rename_columns({'text': 'review'})
+    ds = ds.filter(lambda x: len(x["review"])>200, batched=False)
+
+    input_size = LengthSampler(input_min_text_length, input_max_text_length)
+
+    def tokenize(sample):
+        sample["input_ids"] = tokenizer.encode(sample["review"])[:input_size()]
+        sample["query"] = tokenizer.decode(sample["input_ids"])
+        return sample
+
+    ds = ds.map(tokenize, batched=False)
+    ds.set_format(type='torch')
+    return ds
 
 # We retrieve the dataloader by calling the `build_dataset` function.
 dataset = build_dataset(config)