Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the exact source of pre-training data? #24

Open
shamanez opened this issue Dec 8, 2023 · 1 comment
Open

How to get the exact source of pre-training data? #24

shamanez opened this issue Dec 8, 2023 · 1 comment

Comments

@shamanez
Copy link

shamanez commented Dec 8, 2023

Amazing work!!! Is there anyway I can have access to the pre-training dataset?

If it is not possible, can you please guide us to the sources?

@chaoyi-wu
Copy link
Owner

Hello, the books we used are listed here, https://github.com/chaoyi-wu/PMC-LLaMA/blob/main/MedicalBook.xlsx. Because of the license, I cannot share the exact contents with you, you may collect them online. The other parts for training can be get from the following link:

Papers: https://github.com/allenai/s2orc
Instruction data: https://huggingface.co/datasets/axiong/pmc_llama_instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants