NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual Prediction of Human Reading Behavior in Universal Language Space
This repository contains the singly Python notebook for extracting features for eye-tracking prediction such as frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors. From the title, it is worth noting that these feature values were extracted from the coverted IPA form of the words.
Paper: https://arxiv.org/abs/2202.10855
- Epitran for converting words to IPA form. Can be done for English, German, Hindi, Dutch, Russian, and Mandarin.
- Imageability and concreteness estimates from word embedding from the work of Ljubešić et al, 2018. Download the files here.
- If you want to reproduce the results for the crosslingual task, you need phonetic transcriptions of the surprise language (Danish) data. You may subscribe to this paid service or.... you may email me for the file 😉.
Please refer to the official Shared Task website for more information and to get the train/valid/test dataset: https://cmclorg.github.io/shared_task
If you need any help reproducing the results, please don't hesitate to contact me through
Joseph Marvin Imperial
jrimperial@national-u.edu.ph
www.josephimperial.com