Keep it Local: Comparing Domain-Specific LLMs in Native Language and Machine Translation using Parallel Corpora
- Javier Osorio, University of Arizona
- Sultan Alsarra, King Saud University
- Amber Converse, University of Arizona
- Afraa Alshammari, University of Texas - Dallas
- Dagmar Heintze, University of Texas - Dallas
- Latifur Khan, University of Texas - Dallas
- Naif Alatrush, University of Texas - Dallas
- Patrick T. Brandt, University of Texas - Dallas
- Vito D'Orazio, West Virginia University
- Niamat Zawad, University of Texas - Dallas
- Mahrusa Billah, University of Texas - Dallas
This repository contains the replication files for the paper "Keep it Local: Comparing Domain-Specific LLMs in Native Language and Machine Translation using Parallel Corpora"
The repository contains the following folders:
- 1_data: includes the raw text data in English, Spanish, and Arabic, as well as the annotations.
- 2_quality_analysis: includes the Python scripts used to generate the translation quality metrics and their corresponding data output.
- 3_downstream_tasks: includes the Python scripts used to fine-tune the different models on the binary and multi-class classification tasks.
- 4_analysis: includes the R scripts used to generate the Figures and Tables reported in the paper.
The research reported herein was supported in part by NSF awards DMS-1737978, DGE-2039542, OAC-1828467, OAC-1931541, OAC-2311142, and DGE-1906630, ONR awards N00014-17-1-2995 and N00014-20-1-2738, Army Research Office Contract No. W911NF2110032.
[1] Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.