Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add korean dataset #1157

Open
jeonsworld opened this issue Feb 5, 2023 · 5 comments
Open

Add korean dataset #1157

jeonsworld opened this issue Feb 5, 2023 · 5 comments
Assignees
Labels

Comments

@jeonsworld
Copy link
Contributor

In order for open-assistant to work in Korean, we are working on adding a Korean dataset.

Creating a dataset from zero to end is quite difficult. To efficiently add a dataset, we will proceed as follows.

Current progress is as follows.

  1. Crawl the dataset
    • A dataset that includes Korean, such as wikihow, and can be used in the form of "instruction-fulfillment"
  2. Machine translation
    • Obtain the Korean dataset by machine-translating the en dataset from the OA Dataset List.
    • Machine translation may not be perfect, but I think it can be filtering in the labeling task.
    • In addition, if you have a list of datasets that I can refer to, please share!

These datasets are expected to be used for labeling and learning for RLHF.

@AbdBarho AbdBarho added the data label Feb 5, 2023
@hyunwoongko
Copy link
Contributor

hyunwoongko commented Feb 6, 2023

Hi. I am a project lead of EleutherAI polyglot team.
FYI, We have many Korean dataset. If you want, we can support it.

@jeonsworld
Copy link
Contributor Author

Hi, @hyunwoongko

Currently, I am working on converting the public Korean data into a "instruction-fulfillment" format.

I don't know what type of data you can provide, but any data you provide will be of great help.
If possible, please let me know what data you can provide!

Thank you!

@CertifiedJoon
Copy link
Contributor

@ontocord If this issue is relevant, I would love to take over from here. Would that be possible?

@camsdixon1
Copy link
Collaborator

@CertifiedJoon if you are still interested in taking on this project, I can assign it to you.

@CertifiedJoon
Copy link
Contributor

@camsdixon1 yeah! please do :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants