[Feature] RLHF dataloading #1309

vmoens · 2023-06-23T15:20:25Z

There are 3 TODO i need help with

tcbegley · 2023-06-23T15:36:12Z

torchrl/data/rlhf/dataset.py

+    Args:
+        split (str): One of ``"train"`` or ``"valid"``.
+        dataset_name (str): the name or path of the dataset.
+        pre_tokenization_hook (callable): TODO


pre_tokenization_hook is called on the Dataset before tokenization and should return a modified Dataset object. The intended use is for carrying out tasks that require modifying the dataset as a whole as opposed to modifying individual datapoints, for example discarding certain datapoints based on a particular condition. Tokenization and other "elementwise" operations on the data are performed by the process function which is mapped over the dataset.

tcbegley · 2023-06-23T15:36:32Z

torchrl/data/rlhf/dataset.py

+        max_length (int): the maximum sequence length.
+        dataset_name (str): the name of the dataset.
+        make_process_fn (callable): a preprocess function.
+        pre_tokenization_hook (callable): TODO


See comment below, this is passed on to the load_dataset function.

tcbegley · 2023-06-26T08:55:57Z

test/datasets_mini/openai_summarize_comparisons/test/data-00000-of-00001.arrow

Do we want to commit all of these arrow files etc. to the repo?

It's a dummy dataset. I agree it isn't perfect but I'm not sure what choice we have. could we zip it?

Vision has something of similar taste
https://github.com/pytorch/vision/tree/main/test/assets
I can put it all in a .pt

apbard

in general I noticed that everything is assuming we use gpt2 models but we are not saying explicitly. E.g. tokenize function

LGTM

test/assets/generate.py

apbard · 2023-06-27T07:31:21Z

torchrl/data/rlhf/dataset.py

+    )[prefix]
+
+
+def load_dataset(


this is only used internally? should we put a "_" in front?

Suggested change

def load_dataset(

def _load_dataset(

Is it? IDK if we want to expose it or not. To me it's a well self-contained function properly documented that could be used somewhere else, but I'm happy to make it private if there's no use.

I think in the common case user will want to (possibly) create and load. Hence my proposal to keep this internal and expose only the maybe-renamed other function

apbard · 2023-06-27T07:33:41Z

torchrl/data/rlhf/dataset.py

+_has_datasets = importlib.util.find_spec("datasets") is not None
+
+
+def create_or_load_dataset(


what about calling this just "load_dataset" and maybe add an argument such as "exists" in mkdir?

not sure i see what you mean but the reason I named it that way is that it can either create a dataset for the first time or load it if the memmap tensordict copy is already there. load_dataset already exists and does just that, nothing more.

My suggestion/discussion is whether we could rename this to load_dataset without changing it's current behaviour and add an argument to specify the behaviour we want in case it does not exists.

Im fine with renaming it, though I think that users should be informed that this function caches the results and the name says that quite explicitly.
I'm not sure I see what the extra arg should do. If it does not exist we must load it, and if it does we should use the cache. What extra control should we give?

just a flag like exists_ok of mkdir that raises if it does not exists or something similar to from_disk but control whether we want to store cache or not.
I don't have strong opinion on this I am just reasoning if it's really necessary such a long name (even if definitely more explicit) or if it will be enough to say that we cache in the doc.

tcbegley

LGTM, a couple of comments on the generated data / zipping + unzipping it.

tcbegley · 2023-06-27T08:08:14Z

test/test_rlhf.py

+    lmemmap_save = deepcopy(TensorDict.load_memmap)
+    mocked_hello = mocker.patch("tensordict.TensorDict.load_memmap")
+    mocked_hello.side_effect = lmemmap_save
+    with zipfile.ZipFile(dataset_path, "r") as zip_ref:


Instead of doing this in every test that loads the data, perhaps we should write a conftest.py file that extracts on setup, then cleans up on completion?

Good point yeah
But then how do we tell conftest to do it only when the test_rlhf.py script is executed?

But then how do we tell conftest to do it only when the test_rlhf.py script is executed?

How about instead of conftest.py we add a fixture that has session scope and yields a directory which is the location of the extracted data? I think the fixture should only be executed if it is used, and it will only be executed once.

I think fixtures are executed in each test, otherwise the same temporary directories would be reused multiple times

Yes, that's the default, but you can change that behaviour

@pytest.fixture(scope="session") def minidata_dir(): # unzip # yield path

Oh now i remember why i did not do that
we have 2 mini datasets but you can't pass a fixture to parametrize, so I couldn't write 2 different fixtures for the tests
The "ugly" solution would be

@pytest.mark.parametrize("dataset", ["comp", "prompt"]) def test_stuff(dataset, comp_fixture, prompt_fixture): if dataset == "comp": dataset = comp_fixture else: dataset = prompt_fixture

tcbegley · 2023-06-27T08:09:14Z

test/assets/generate.py

Should we add logic here to zip the generated data so that the committed .zip files are reproducible?

I did not use python for that but we could do it

torchrl/data/rlhf/reward.py

torchrl/data/rlhf/dataset.py

vmoens · 2023-06-27T10:01:55Z

in general I noticed that everything is assuming we use gpt2 models but we are not saying explicitly. E.g. tokenize function

Would the tokenization change with another model?
Any pointer to a place where that would break?

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

Co-authored-by: Tom Begley <tomcbegley@gmail.com>

apbard · 2023-06-27T10:49:41Z

in general I noticed that everything is assuming we use gpt2 models but we are not saying explicitly. E.g. tokenize function

Would the tokenization change with another model? Any pointer to a place where that would break?

If the tokenizer is huggingface-like I think the tokenize function should work provided that we somehow expose: https://github.com/pytorch/rl/pull/1309/files#r1243540647

apbard · 2023-06-27T10:50:29Z

torchrl/data/rlhf/dataset.py

+        num_workers = max(os.cpu_count() // 2, 1)
+    if excluded_features is None:
+        excluded_features = {"text", "prompt", "label", "valid_sample"}
+    tokenizer = AutoTokenizer.from_pretrained("gpt2")


got it
Let's parametrize it then

apbard · 2023-06-27T11:27:45Z

torchrl/data/rlhf/dataset.py

+    return dataset
+
+
+def tokenize(


tokenizers already have a method tokenize. maybe it's better to be more explicit about what this does

Suggested change

def tokenize(

def tokenize_dataset(

torchrl/data/rlhf/dataset.py

apbard

LGTM

tcbegley

LGTM!

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

vmoens added 3 commits June 17, 2023 21:27

init

3f1f0ee

amend

38a1aef

init

c554967

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2023

vmoens added 4 commits June 23, 2023 16:25

amend

ce04a5f

amend

efc6216

Merge branch 'main' into tensorstorage

d782ee7

Merge branch 'tensorstorage' into rlhf_data

3ccde0e

vmoens added the enhancement New feature or request label Jun 23, 2023

tcbegley reviewed Jun 23, 2023

View reviewed changes

amend

21ae0df

tcbegley reviewed Jun 26, 2023

View reviewed changes

tcbegley mentioned this pull request Jun 26, 2023

[Feature] RLHF Reward Model #1315

Closed

vmoens added 4 commits June 26, 2023 21:22

amend

41b7cc6

amend

42c0ac9

amend

6044278

amend

dc0c06f

apbard approved these changes Jun 27, 2023

View reviewed changes

tcbegley approved these changes Jun 27, 2023

View reviewed changes

tcbegley reviewed Jun 27, 2023

View reviewed changes

torchrl/data/rlhf/reward.py Outdated Show resolved Hide resolved

tcbegley reviewed Jun 27, 2023

View reviewed changes

torchrl/data/rlhf/dataset.py Outdated Show resolved Hide resolved

vmoens and others added 5 commits June 27, 2023 12:04

fix transformers import

3e9ab03

lint

77e3d86

Update test/assets/generate.py

b109e23

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

Update torchrl/data/rlhf/reward.py

735e5bb

Co-authored-by: Tom Begley <tomcbegley@gmail.com>

Update torchrl/data/rlhf/dataset.py

2370207

Co-authored-by: Tom Begley <tomcbegley@gmail.com>

apbard reviewed Jun 27, 2023

View reviewed changes

vmoens added 2 commits June 27, 2023 14:04

TokenizedDatasetLoader

461c371

typo

48659ee

apbard reviewed Jun 27, 2023

View reviewed changes

torchrl/data/rlhf/dataset.py Outdated Show resolved Hide resolved

apbard approved these changes Jun 27, 2023

View reviewed changes

vmoens added 2 commits June 27, 2023 14:53

lint + doc + tests

0493d26

refactor unzip

0fb5514

tcbegley approved these changes Jun 27, 2023

View reviewed changes

vmoens and others added 3 commits June 27, 2023 15:15

Update torchrl/data/rlhf/dataset.py

8fc7dd5

Co-authored-by: Alessandro Pietro Bardelli <apbard@users.noreply.github.com>

amend

1a3e6fb

Merge remote-tracking branch 'origin/rlhf_data' into rlhf_data

b9b0a6d

apbard mentioned this pull request Jun 27, 2023

[Example] RLHF end to end example #1324

Closed

amend

4ffcf72

vmoens merged commit 2c47fa3 into main Jun 27, 2023

vmoens deleted the rlhf_data branch June 27, 2023 17:16

vmoens restored the rlhf_data branch June 28, 2023 09:04

vmoens deleted the rlhf_data branch July 11, 2023 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] RLHF dataloading #1309

[Feature] RLHF dataloading #1309

vmoens commented Jun 23, 2023

tcbegley Jun 23, 2023

tcbegley Jun 23, 2023

tcbegley Jun 26, 2023

vmoens Jun 26, 2023

vmoens Jun 26, 2023

apbard left a comment

apbard Jun 27, 2023

vmoens Jun 27, 2023

apbard Jun 27, 2023 •

edited

Loading

apbard Jun 27, 2023

vmoens Jun 27, 2023

apbard Jun 27, 2023

vmoens Jun 27, 2023

apbard Jun 27, 2023

tcbegley left a comment

tcbegley Jun 27, 2023

vmoens Jun 27, 2023

tcbegley Jun 27, 2023 •

edited

Loading

vmoens Jun 27, 2023

tcbegley Jun 27, 2023 •

edited

Loading

vmoens Jun 27, 2023

tcbegley Jun 27, 2023

vmoens Jun 27, 2023

vmoens commented Jun 27, 2023

apbard commented Jun 27, 2023 •

edited

Loading

apbard Jun 27, 2023

vmoens Jun 27, 2023

apbard Jun 27, 2023

apbard left a comment

tcbegley left a comment

		_has_datasets = importlib.util.find_spec("datasets") is not None


		def create_or_load_dataset(

[Feature] RLHF dataloading #1309

[Feature] RLHF dataloading #1309

Conversation

vmoens commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbard Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcbegley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcbegley Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcbegley Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmoens commented Jun 27, 2023

apbard commented Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apbard left a comment

Choose a reason for hiding this comment

tcbegley left a comment

Choose a reason for hiding this comment

apbard Jun 27, 2023 •

edited

Loading

tcbegley Jun 27, 2023 •

edited

Loading

tcbegley Jun 27, 2023 •

edited

Loading

apbard commented Jun 27, 2023 •

edited

Loading