Add korean dataset #1157

jeonsworld · 2023-02-05T07:45:36Z

In order for open-assistant to work in Korean, we are working on adding a Korean dataset.

Creating a dataset from zero to end is quite difficult. To efficiently add a dataset, we will proceed as follows.

Current progress is as follows.

Crawl the dataset
- A dataset that includes Korean, such as wikihow, and can be used in the form of "instruction-fulfillment"
Machine translation
- Obtain the Korean dataset by machine-translating the en dataset from the OA Dataset List.
- Machine translation may not be perfect, but I think it can be filtering in the labeling task.
- In addition, if you have a list of datasets that I can refer to, please share!

These datasets are expected to be used for labeling and learning for RLHF.

hyunwoongko · 2023-02-06T09:52:00Z

Hi. I am a project lead of EleutherAI polyglot team.
FYI, We have many Korean dataset. If you want, we can support it.

jeonsworld · 2023-02-06T13:04:26Z

Hi, @hyunwoongko

Currently, I am working on converting the public Korean data into a "instruction-fulfillment" format.

I don't know what type of data you can provide, but any data you provide will be of great help.
If possible, please let me know what data you can provide!

Thank you!

CertifiedJoon · 2023-06-08T16:09:55Z

@ontocord If this issue is relevant, I would love to take over from here. Would that be possible?

camsdixon1 · 2023-07-05T23:32:25Z

@CertifiedJoon if you are still interested in taking on this project, I can assign it to you.

CertifiedJoon · 2023-07-06T00:36:24Z

@camsdixon1 yeah! please do :)

AbdBarho added the data label Feb 5, 2023

huu4ontocord assigned jeonsworld Feb 7, 2023

camsdixon1 assigned camsdixon1 and CertifiedJoon Jul 6, 2023

CertifiedJoon mentioned this issue Jul 6, 2023

Add: Korean QA dataset #3551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add korean dataset #1157

Add korean dataset #1157

jeonsworld commented Feb 5, 2023

hyunwoongko commented Feb 6, 2023 •

edited

Loading

jeonsworld commented Feb 6, 2023

CertifiedJoon commented Jun 8, 2023

camsdixon1 commented Jul 5, 2023

CertifiedJoon commented Jul 6, 2023

Add korean dataset #1157

Add korean dataset #1157

Comments

jeonsworld commented Feb 5, 2023

hyunwoongko commented Feb 6, 2023 • edited Loading

jeonsworld commented Feb 6, 2023

CertifiedJoon commented Jun 8, 2023

camsdixon1 commented Jul 5, 2023

CertifiedJoon commented Jul 6, 2023

hyunwoongko commented Feb 6, 2023 •

edited

Loading