Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xxxForCausalLM support #53

Merged
merged 16 commits into from
Dec 27, 2022

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Dec 20, 2022

What does this PR do?

This PR adds support to all xxxForCausalLM models from transformers

TODO

Draft for now
cc @lvwerra @edbeeching

Comment on lines 31 to 48
if hasattr(config, "summary_use_proj") and config.summary_use_proj:
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
num_classes = config.num_labels
else:
num_classes = config.hidden_size
self.summary = nn.Linear(config.hidden_size, num_classes)

self.activation = nn.Identity()
if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
self.activation = nn.Tanh()

self.first_dropout = nn.Identity()
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
self.first_dropout = nn.Dropout(config.summary_first_dropout)

self.last_dropout = nn.Identity()
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
self.last_dropout = nn.Dropout(config.summary_last_dropout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be simplified in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, we'll probably need an extra arg (v_head_activation) and simplify this, how does that sound @edbeeching @lvwerra ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the v_head never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the defaults:

(v_head): ValueHead(
    (summary): Linear(in_features=768, out_features=2, bias=True)
    (activation): Identity()
    (first_dropout): Dropout(p=0.1, inplace=False)
    (last_dropout): Identity()
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )

Let's stick to this architecture now and address a potential adding of activation etc in a follow up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it seems that flatten is never used, I removed it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen a vhead with an activation. You can imagine situations that might use one, but they are so rare that I agree this is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect thank you for confirming !!

@younesbelkada younesbelkada marked this pull request as ready for review December 21, 2022 09:48
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good @younesbelkada! Just a few minor comments and nits and then this is good to go!

tests/test_modeling_vhead.py Outdated Show resolved Hide resolved
tests/utils/testing_utils_common.py Outdated Show resolved Hide resolved
Comment on lines 31 to 48
if hasattr(config, "summary_use_proj") and config.summary_use_proj:
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
num_classes = config.num_labels
else:
num_classes = config.hidden_size
self.summary = nn.Linear(config.hidden_size, num_classes)

self.activation = nn.Identity()
if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
self.activation = nn.Tanh()

self.first_dropout = nn.Identity()
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
self.first_dropout = nn.Dropout(config.summary_first_dropout)

self.last_dropout = nn.Identity()
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
self.last_dropout = nn.Dropout(config.summary_last_dropout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the v_head never has an activation but maybe @edbeeching knows better so activation is actually not need at all. Also not clear if we ever want to apply dropout to the output (last_dropout).

Comment on lines 31 to 48
if hasattr(config, "summary_use_proj") and config.summary_use_proj:
if hasattr(config, "summary_proj_to_labels") and config.summary_proj_to_labels and config.num_labels > 0:
num_classes = config.num_labels
else:
num_classes = config.hidden_size
self.summary = nn.Linear(config.hidden_size, num_classes)

self.activation = nn.Identity()
if hasattr(config, "summary_activation") and config.summary_activation == "tanh":
self.activation = nn.Tanh()

self.first_dropout = nn.Identity()
if hasattr(config, "summary_first_dropout") and config.summary_first_dropout > 0:
self.first_dropout = nn.Dropout(config.summary_first_dropout)

self.last_dropout = nn.Identity()
if hasattr(config, "summary_last_dropout") and config.summary_last_dropout > 0:
self.last_dropout = nn.Dropout(config.summary_last_dropout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also all the summary attributes are not needed as far as I can tell. maybe all we need is dropout and then a linear layer. can you check what the defaults are?

all_model_names = None
trl_model_class = None

def test_from_save(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another useful test would be if we can load a LM with a v head, save it and then load it as a pure LM. that's what most people will likely use for inference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!
Added a test test_from_save_transformers, you can now train a model with trl save it, and load it back with transformers, ins't that cool? :D

@@ -0,0 +1,76 @@
import unittest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call this the test_modeling.py file and add the BaseModelTester.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed now ;)

younesbelkada and others added 6 commits December 21, 2022 11:57
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
- add more tests
- added support for additional keyword arguments
@younesbelkada
Copy link
Contributor Author

Seems to be converging with the latest changes: https://wandb.ai/distill-bloom/gpt2-test/runs/1sxufahx?workspace=user-younesbelkada

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small questions otherwise looks good to me :)

@@ -260,6 +260,9 @@ def loss(self, old_logprobs, values, rewards, query, response, model_input):

ratio = torch.exp(logprob - old_logprobs)

if len(ratio.shape) == 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that step needed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the latest version of transformers, it seems that ratio needs a thrid dimension otherwise the broadcasting op wont work when calling * in pg_losses = -advantages * ratio (for instance using gpt2, advantages.shape==(1, 9, 2) and ratio.shape == (1, 9)).
I got this issue with the latest versions of transformers, so I suspect that the way the outputs are dealt changed between the versions but didn't investigated further

@@ -55,8 +55,8 @@

sentiment_pipe = pipeline("sentiment-analysis","lvwerra/distilbert-imdb", device=pipe_device)

gpt2_model = GPT2HeadWithValueModel.from_pretrained(config['model_name'])
gpt2_model_ref = GPT2HeadWithValueModel.from_pretrained(config['model_name'])
gpt2_model = AutoModelForCausalLMWithValueHead.from_pretrained(config['model_name'], summary_dropout_prob=0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the dropout still be the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the dropout is set to None (Identity) --> https://github.com/younesbelkada/trl/blob/8390e3bfabdb70da558361483981199304095f05/trl/models/modeling_vhead.py#L26
I think we have to set it to 0.1 since by default it was set to 0.1 on the previous script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's set the dropout to 0.1 by default , cf77820

@younesbelkada younesbelkada merged commit 9ceee31 into huggingface:master Dec 27, 2022
This was referenced Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants