-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xxxForCausalLM
support
#53
xxxForCausalLM
support
#53
Conversation
trl/models/modeling_vhead.py
Outdated
if hasattr(config, "summary_use_proj") and config.summary_use_proj: | ||
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0: | ||
num_classes = config.num_labels | ||
else: | ||
num_classes = config.hidden_size | ||
self.summary = nn.Linear(config.hidden_size, num_classes) | ||
|
||
self.activation = nn.Identity() | ||
if hasattr(config, "summary_activation") and config.summary_activation == "tanh": | ||
self.activation = nn.Tanh() | ||
|
||
self.first_dropout = nn.Identity() | ||
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0: | ||
self.first_dropout = nn.Dropout(config.summary_first_dropout) | ||
|
||
self.last_dropout = nn.Identity() | ||
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0: | ||
self.last_dropout = nn.Dropout(config.summary_last_dropout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be simplified in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, we'll probably need an extra arg (v_head_activation
) and simplify this, how does that sound @edbeeching @lvwerra ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the v_head
never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the defaults:
(v_head): ValueHead(
(summary): Linear(in_features=768, out_features=2, bias=True)
(activation): Identity()
(first_dropout): Dropout(p=0.1, inplace=False)
(last_dropout): Identity()
(flatten): Flatten(start_dim=1, end_dim=-1)
)
Let's stick to this architecture now and address a potential adding of activation etc in a follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it seems that flatten
is never used, I removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never seen a vhead with an activation. You can imagine situations that might use one, but they are so rare that I agree this is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect thank you for confirming !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good @younesbelkada! Just a few minor comments and nits and then this is good to go!
trl/models/modeling_vhead.py
Outdated
if hasattr(config, "summary_use_proj") and config.summary_use_proj: | ||
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0: | ||
num_classes = config.num_labels | ||
else: | ||
num_classes = config.hidden_size | ||
self.summary = nn.Linear(config.hidden_size, num_classes) | ||
|
||
self.activation = nn.Identity() | ||
if hasattr(config, "summary_activation") and config.summary_activation == "tanh": | ||
self.activation = nn.Tanh() | ||
|
||
self.first_dropout = nn.Identity() | ||
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0: | ||
self.first_dropout = nn.Dropout(config.summary_first_dropout) | ||
|
||
self.last_dropout = nn.Identity() | ||
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0: | ||
self.last_dropout = nn.Dropout(config.summary_last_dropout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the v_head
never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout
).
trl/models/modeling_vhead.py
Outdated
if hasattr(config, "summary_use_proj") and config.summary_use_proj: | ||
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0: | ||
num_classes = config.num_labels | ||
else: | ||
num_classes = config.hidden_size | ||
self.summary = nn.Linear(config.hidden_size, num_classes) | ||
|
||
self.activation = nn.Identity() | ||
if hasattr(config, "summary_activation") and config.summary_activation == "tanh": | ||
self.activation = nn.Tanh() | ||
|
||
self.first_dropout = nn.Identity() | ||
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0: | ||
self.first_dropout = nn.Dropout(config.summary_first_dropout) | ||
|
||
self.last_dropout = nn.Identity() | ||
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0: | ||
self.last_dropout = nn.Dropout(config.summary_last_dropout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?
tests/utils/testing_utils_common.py
Outdated
all_model_names = None | ||
trl_model_class = None | ||
|
||
def test_from_save(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another useful test would be if we can load a LM with a v head, save it and then load it as a pure LM. that's what most people will likely use for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
Added a test test_from_save_transformers
, you can now train a model with trl
save it, and load it back with transformers
, ins't that cool? :D
tests/test_modeling_vhead.py
Outdated
@@ -0,0 +1,76 @@ | |||
import unittest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call this the test_modeling.py
file and add the BaseModelTester
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed now ;)
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
- add more tests - added support for additional keyword arguments
Seems to be converging with the latest changes: https://wandb.ai/distill-bloom/gpt2-test/runs/1sxufahx?workspace=user-younesbelkada |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two small questions otherwise looks good to me :)
@@ -260,6 +260,9 @@ def loss(self, old_logprobs, values, rewards, query, response, model_input): | |||
|
|||
ratio = torch.exp(logprob - old_logprobs) | |||
|
|||
if len(ratio.shape) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that step needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the latest version of transformers
, it seems that ratio
needs a thrid dimension otherwise the broadcasting op wont work when calling *
in pg_losses = -advantages * ratio
(for instance using gpt2, advantages.shape==(1, 9, 2)
and ratio.shape == (1, 9)
).
I got this issue with the latest versions of transformers
, so I suspect that the way the outputs are dealt changed between the versions but didn't investigated further
examples/scripts/04-ppo-sentiment.py
Outdated
@@ -55,8 +55,8 @@ | |||
|
|||
sentiment_pipe = pipeline("sentiment-analysis","lvwerra/distilbert-imdb", device=pipe_device) | |||
|
|||
gpt2_model = GPT2HeadWithValueModel.from_pretrained(config['model_name']) | |||
gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained(config['model_name']) | |||
gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(config['model_name'], summary_dropout_prob=0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the dropout still be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default the dropout is set to None
(Identity) --> https://github.com/younesbelkada/trl/blob/8390e3bfabdb70da558361483981199304095f05/trl/models/modeling_vhead.py#L26
I think we have to set it to 0.1
since by default it was set to 0.1
on the previous script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's set the dropout to 0.1
by default , cf77820
What does this PR do?
This PR adds support to all
xxxForCausalLM
models fromtransformers
TODO
Draft for now
cc @lvwerra @edbeeching