Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dolphin gpt 3.5 data mix #3606

Merged
merged 2 commits into from
Jul 25, 2023
Merged

Dolphin gpt 3.5 data mix #3606

merged 2 commits into from
Jul 25, 2023

Conversation

shahules786
Copy link
Collaborator

@shahules786 shahules786 commented Jul 25, 2023

  • Added dolphin random data mix to form conversations from gpt 3.5 file.
  • Instructions of the same type are only considered while picking at random to form conversation
  • Also ensured that same samples are not considered more than once

Configure

- dolphin-mix
        num_samples: 100000
        max_char_len: 32000
        seed: 44

self.dataset = load_dataset(
"ehartford/dolphin", data_files="flan5m-alpaca-uncensored.jsonl", cache_dir=cache_dir
)
self.dataset = self.dataset["train"].shuffle(seed).select(range(num_samples))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have used shuffle(seed) too in my code ;-) .. will update later.

conversation_len += len(input) + len(output)
removed_indices.append(idx)
while conversation_len < self.max_char_len:
indices_to_pick = np.setdiff1d(available_indices, removed_indices)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating the indices to pick set every time seems a bit costly .. but since it is only done during startup it might be ok.

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot .. very pythonic. ;-)

@andreaskoepf andreaskoepf merged commit c2f444d into LAION-AI:main Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants