`xxxForCausalLM` support #53

younesbelkada · 2022-12-20T08:37:28Z

What does this PR do?

This PR adds support to all xxxForCausalLM models from transformers

TODO

testing suite
Push tiny models on https://huggingface.co/trl-internal-testing

Draft for now
cc @lvwerra @edbeeching

…into decoder-support

younesbelkada · 2022-12-20T10:28:13Z

trl/models/modeling_vhead.py

+        if hasattr(config, "summary_use_proj") and config.summary_use_proj:
+            if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
+                num_classes = config.num_labels
+            else:
+                num_classes = config.hidden_size
+            self.summary = nn.Linear(config.hidden_size, num_classes)
+
+        self.activation = nn.Identity()
+        if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
+            self.activation = nn.Tanh()
+
+        self.first_dropout = nn.Identity()
+        if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
+            self.first_dropout = nn.Dropout(config.summary_first_dropout)
+
+        self.last_dropout = nn.Identity()
+        if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
+            self.last_dropout = nn.Dropout(config.summary_last_dropout)


This needs to be simplified in the future

From what I can see, we'll probably need an extra arg (v_head_activation) and simplify this, how does that sound @edbeeching @lvwerra ?

I guess the v_head never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout).

also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?

Here are the defaults:

(v_head): ValueHead( (summary): Linear(in_features=768, out_features=2, bias=True) (activation): Identity() (first_dropout): Dropout(p=0.1, inplace=False) (last_dropout): Identity() (flatten): Flatten(start_dim=1, end_dim=-1) )

Let's stick to this architecture now and address a potential adding of activation etc in a follow up PR

also it seems that flatten is never used, I removed it

I have never seen a vhead with an activation. You can imagine situations that might use one, but they are so rare that I agree this is not needed.

Perfect thank you for confirming !!

lvwerra

Looks really good @younesbelkada! Just a few minor comments and nits and then this is good to go!

tests/test_modeling_vhead.py

tests/utils/testing_utils_common.py

lvwerra · 2022-12-21T10:12:18Z

trl/models/modeling_vhead.py

+        if hasattr(config, "summary_use_proj") and config.summary_use_proj:
+            if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
+                num_classes = config.num_labels
+            else:
+                num_classes = config.hidden_size
+            self.summary = nn.Linear(config.hidden_size, num_classes)
+
+        self.activation = nn.Identity()
+        if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
+            self.activation = nn.Tanh()
+
+        self.first_dropout = nn.Identity()
+        if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
+            self.first_dropout = nn.Dropout(config.summary_first_dropout)
+
+        self.last_dropout = nn.Identity()
+        if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
+            self.last_dropout = nn.Dropout(config.summary_last_dropout)


I guess the v_head never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout).

lvwerra · 2022-12-21T10:19:51Z

trl/models/modeling_vhead.py

+        if hasattr(config, "summary_use_proj") and config.summary_use_proj:
+            if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
+                num_classes = config.num_labels
+            else:
+                num_classes = config.hidden_size
+            self.summary = nn.Linear(config.hidden_size, num_classes)
+
+        self.activation = nn.Identity()
+        if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
+            self.activation = nn.Tanh()
+
+        self.first_dropout = nn.Identity()
+        if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
+            self.first_dropout = nn.Dropout(config.summary_first_dropout)
+
+        self.last_dropout = nn.Identity()
+        if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
+            self.last_dropout = nn.Dropout(config.summary_last_dropout)


also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?

lvwerra · 2022-12-21T10:23:27Z

tests/utils/testing_utils_common.py

+    all_model_names = None
+    trl_model_class = None
+
+    def test_from_save(self):


another useful test would be if we can load a LM with a v head, save it and then load it as a pure LM. that's what most people will likely use for inference.

Great catch!
Added a test test_from_save_transformers, you can now train a model with trl save it, and load it back with transformers, ins't that cool? :D

lvwerra · 2022-12-21T10:24:29Z

tests/test_modeling_vhead.py

@@ -0,0 +1,76 @@
+import unittest


I think we can just call this the test_modeling.py file and add the BaseModelTester.

Should be addressed now ;)

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

- add more tests - added support for additional keyword arguments

younesbelkada · 2022-12-21T12:17:55Z

Seems to be converging with the latest changes: https://wandb.ai/distill-bloom/gpt2-test/runs/1sxufahx?workspace=user-younesbelkada

lvwerra

Just two small questions otherwise looks good to me :)

lvwerra · 2022-12-23T18:00:29Z

trl/ppo.py

@@ -260,6 +260,9 @@ def loss(self, old_logprobs, values, rewards, query, response, model_input):

        ratio = torch.exp(logprob - old_logprobs)

+        if len(ratio.shape) == 2:


Why is that step needed now?

using the latest version of transformers, it seems that ratio needs a thrid dimension otherwise the broadcasting op wont work when calling * in pg_losses = -advantages * ratio (for instance using gpt2, advantages.shape==(1, 9, 2) and ratio.shape == (1, 9)).
I got this issue with the latest versions of transformers, so I suspect that the way the outputs are dealt changed between the versions but didn't investigated further

lvwerra · 2022-12-23T18:02:19Z

examples/scripts/04-ppo-sentiment.py

@@ -55,8 +55,8 @@

 sentiment_pipe = pipeline("sentiment-analysis","lvwerra/distilbert-imdb", device=pipe_device)

-gpt2_model = GPT2HeadWithValueModel.from_pretrained(config['model_name'])
-gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained(config['model_name'])
+gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(config['model_name'], summary_dropout_prob=0.1)


shouldn't the dropout still be the same?

By default the dropout is set to None (Identity) --> https://github.com/younesbelkada/trl/blob/8390e3bfabdb70da558361483981199304095f05/trl/models/modeling_vhead.py#L26
I think we have to set it to 0.1 since by default it was set to 0.1 on the previous script

As discussed, let's set the dropout to 0.1 by default , cf77820

younesbelkada and others added 5 commits December 19, 2022 23:16

v1

51043c5

Merge branch 'master' into decoder-support

e908173

Merge remote-tracking branch 'upstream/master' into decoder-support

8c1c0d4

Merge branch 'decoder-support' of https://github.com/younesbelkada/trl …

b89fcb6

…into decoder-support

fix conflicts

7df8baf

younesbelkada commented Dec 20, 2022

View reviewed changes

add tests

4c6b530

younesbelkada requested a review from lvwerra December 20, 2022 11:16

lvwerra mentioned this pull request Dec 21, 2022

Can we use GPT - J [EleutherAI/gpt-j-6B] ? Instead of GPT 2 ? trl is only available for GPT 2 #54

Closed

fix import issue

07a8f95

younesbelkada marked this pull request as ready for review December 21, 2022 09:48

lvwerra reviewed Dec 21, 2022

View reviewed changes

younesbelkada and others added 6 commits December 21, 2022 11:57

Apply suggestions from code review

986988d

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

add more support

c3fd8c2

- add more tests - added support for additional keyword arguments

add copyrights

ea26289

add more tests

25348cc

Merge remote-tracking branch 'origin/master' into decoder-support

c202a27

update gpt2 test

df4c083

younesbelkada added 2 commits December 21, 2022 15:20

more docstring

292448c

support edge-cases

8390e3b

lvwerra approved these changes Dec 23, 2022

View reviewed changes

default to 0.1

cf77820

younesbelkada merged commit 9ceee31 into huggingface:master Dec 27, 2022

This was referenced Dec 28, 2022

[VHead] Fix slow convergence issue #60

Merged

Roadmap - trl 0.2 #64

Closed

August-murr mentioned this pull request Jan 6, 2025

onlinedpo error when use deepspeed zero3 August-murr/trl#7

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`xxxForCausalLM` support #53

`xxxForCausalLM` support #53

younesbelkada commented Dec 20, 2022 •

edited

Loading

younesbelkada Dec 20, 2022

younesbelkada Dec 21, 2022

lvwerra Dec 21, 2022

lvwerra Dec 21, 2022

younesbelkada Dec 21, 2022

younesbelkada Dec 21, 2022

edbeeching Dec 22, 2022

younesbelkada Dec 22, 2022

lvwerra left a comment

lvwerra Dec 21, 2022

lvwerra Dec 21, 2022

lvwerra Dec 21, 2022

younesbelkada Dec 21, 2022

lvwerra Dec 21, 2022

younesbelkada Dec 21, 2022

younesbelkada commented Dec 21, 2022

lvwerra left a comment

lvwerra Dec 23, 2022

younesbelkada Dec 26, 2022

lvwerra Dec 23, 2022

younesbelkada Dec 26, 2022

younesbelkada Dec 27, 2022

		@@ -260,6 +260,9 @@ def loss(self, old_logprobs, values, rewards, query, response, model_input):

		ratio = torch.exp(logprob - old_logprobs)

		if len(ratio.shape) == 2:

xxxForCausalLM support #53

xxxForCausalLM support #53

Conversation

younesbelkada commented Dec 20, 2022 • edited Loading

What does this PR do?

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Dec 21, 2022

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`xxxForCausalLM` support #53

`xxxForCausalLM` support #53

younesbelkada commented Dec 20, 2022 •

edited

Loading