Skip to content
/ CIDAR Public

Instruction dataset for Arabic with 10,000 instruction and output pairs. CIDAR can be used to fine-tune LLMs to follow instructions.

Notifications You must be signed in to change notification settings

ARBML/CIDAR

Repository files navigation

🌴CIDAR

Culturally Relevant Instruction Dataset For Arabic

[ Paper - HuggingFace ]

CIDAR contains 10,000 instructions and their output. The dataset was created by selecting around 9,109 samples from Alpagasus dataset and then translating it to Arabic using ChatGPT. In addition, we append that with around 891 Arabic grammar instructions from the website Ask the teacher. All the 10,000 samples were reviewed by around 12 reviewers.

📝 CIDAR Annotation

Our data annotation platform was built on top of Flask library. You can follow the next steps to run the app.

⚡️To Run The App:

  1. Install the requirements pip install -r requirements.txt.
  2. Run the app using flask flask --app app run .

✨ Demo

The annotation app contains two sections: English and Arabic. Reviewers can make changes to Instruction, Input, and Output to fix mistakes and align data with Arabic culture.

instruction(str): Required instruction.

input(str): [Optional] input for the instruction. For example, when the instruction is to find the mistake in the next sentence, the input is the sentence.

output(str): Response for the given instruction.

Also, reviewers are able to track their contribution record and the number of instructions remaining.

📁 Loading The Dataset

You can download the dataset from HuggingFace - CIDAR.

from datasets import load_dataset
cidar = load_dataset('arbml/CIDAR')

📄 Sample From The Dataset:

This sample shows Arabic-specific instructions related to البلاغة(rhetoric) and الإعراب(grammatical status). The sentence is originally in Arabic.

index: 10030

instruction: لو سمحت اين صيغة المبالغة في " الجاحظ اديب سيارة اراؤه" و ما اعراب الجملة؟

output:

       'صيغة المبالغة في هذه الجملة هي "سيّارة"، فاسم الفاعل من "السير" هو '
       '"سائر/ سائرة" وإذا أردت الدلالة على المبالغة في سير الآراء (أي '
       'شهرتها وكثرة تداولها) قلت "آراء سيّارة".\r\n'
       'أما إعراب الجملة فهو: "الجاحظُ": مبتدأ مرفوع بالضمة. "أديبٌ": خبر '
       'مرفوع بالضمة. "سيارةٌ": خبر متقدم، و"آراؤُه": مبتدأ متأخر، وهو '
       'مضاف والهاء ضمير متصل مضاف إليه في محل جر.  ويمكن اعتبار "سيارة" '
       'مبتدأ وهو وصف يعمل عمل فعله، و"آراؤُه" فاعل سدّ مسدّ الخبر.\r\n'
       'وفي الحالتين فجملة "سيارة آراؤه" جملة اسمية في محل رفع نعت '
       'لـ"أديب".'

🧩 Contributers

There were at least 12 contributors to the annotation of CIDAR. You can check the list here.

⛔️ Limitations and Future Work

CIDAR is intended for research purposes only. The authors disclaim any responsibility for misuse and condemn any use contrary to Arabic culture or Islamic values. Even though subjected to human verification, there is no guarantee that responses are entirely aligned with Arabic culture and Islamic values. Users of the dataset are urged to exercise caution, employ critical thinking, and seek guidance from representative figures when necessary.

🔑 License

CIDAR is intended and licensed for research use only. The dataset and weight diffs are licensed under CC BY NC 4.0 (LIMITED TO NON-COMMERCIAL USE). Models trained using the dataset should not be used outside of research purposes. Creative Commons NonCommercial (CC BY-NC 4.0).

Citation

@misc{alyafeai2024cidar,
      title={CIDAR: Culturally Relevant Instruction Dataset For Arabic}, 
      author={Zaid Alyafeai and Khalid Almubarak and Ahmed Ashraf and Deema Alnuhait and Saied Alshahrani and Gubran A. Q. Abdulrahman and Gamil Ahmed and Qais Gawah and Zead Saleh and Mustafa Ghaleb and Yousef Ali and Maged S. Al-Shaibani},
      year={2024},
      eprint={2402.03177},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Instruction dataset for Arabic with 10,000 instruction and output pairs. CIDAR can be used to fine-tune LLMs to follow instructions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published