ITALIC: An ITALian Intent Classification Dataset

This repository contains the code and the dataset for the paper ITALIC: An ITALian Intent Classification Dataset.

ITALIC is a new intent classification dataset for the Italian language, which is the first of its kind. It includes spoken and written utterances and is annotated with 60 intents. The dataset is available on the Hugging Face Hub.

Data collection

The data collection follows the MASSIVE NLU dataset which contains an annotated textual dataset for 60 intents. The data collection process is described in the paper Massive Natural Language Understanding.

Following the MASSIVE NLU dataset, a pool of 70+ volunteers has been recruited to annotate the dataset. The volunteers were asked to record their voice while reading the utterances (the original text is available on MASSIVE dataset). Together with the audio, the volunteers were asked to provide a self-annotated description of the recording conditions (e.g., background noise, recording device). The audio recordings have also been validated and, in case of errors, re-recorded by the volunteers.

All the audio recordings included in the dataset have received a validation from at least two volunteers. All the audio recordings have been validated by native italian speakers (self-annotated).

Dataset

The dataset is available on the Hugging Face Hub. It is composed of 3 different splits:

easy: all the utterances are randomly shuffled and divided into 3 splits (train, validation, test).
speaker: the utterances are divided into 3 splits (train, validation, test) based on the speaker. Each split only contains utterances from a pool of speakers that do not overlap with the other splits.
noisy: the utterances are divided into 3 splits (train, validation, test) based on the recording conditions. The test split only contains utterances with the highest level of noise.

Each split contains the following annotations:

utt: the original text of the utterance.
audio: the audio recording of the utterance.
intent: the intent of the utterance.
speaker: the speaker of the utterance. The speaker is identified by a unique identifier and has been anonymized.
age: the age of the speaker.
is_native: whether the speaker is a native italian speaker or not.
gender: the gender of the speaker (self-annotated).
region: the region of the speaker (self-annotated).
nationality: the nationality of the speaker (self-annotated).
lisp: any kind of lisp of the speaker (self-annotated). It can be empty in case of no lisp.
education: the education level of the speaker (self-annotated).
environment: the environment of the recording (self-annotated).
device: the device used for the recording (self-annotated).

Usage

The dataset can be loaded using the datasets library:

from datasets import load_dataset
...
# complete information will be provided upon publication

The dataset has been designed for intent classification tasks. The intent column can be used as the label. However, the dataset can be used for other tasks as well.

Intent classification: the intent column can be used as the label.
Speaker identification: the speaker column can be used as the label.
Automatic speech recognition: the utt column can be used as the label.
Accent identification: the region column can be used as the label.

For more information about the dataset, please refer to the paper.

Models used in the paper

Parameter settings

The parameters used for the training of the models are set to allow a fair comparison between the different models and to follow the recommendations of the related literature. The parameters are summarized in the following table:

Model	Task	Parameters	Learning rate	Batch size	Max epochs	Warmup	Weight decay	Avg. training time	Avg. inference time
facebook/wav2vec2-xls-r-300m	SLU	300M	1e-4	128	30	0.1 ratio	0.01	9m 35s per epoch	13ms per sample
facebook/wav2vec2-xls-r-1b	SLU	1B	1e-4	32	30	0.1 ratio	0.01	21m 30s per epoch	29ms per sample
jonatasgrosman/wav2vec2-large-xlsr-53-italian	SLU	300M	1e-4	128	30	0.1 ratio	0.01	9m 35s per epoch	13ms per sample
jonatasgrosman/wav2vec2-xls-r-1b-italian	SLU	1B	1e-4	32	30	0.1 ratio	0.01	21m 30s per epoch	29ms per sample
ALM/whisper-it-small-augmented	ASR	224M	1e-5	8	5	500 steps	0.01	XX	XX
EdoAbati/whisper-medium-it-2	ASR	769M	1e-5	8	5	500 steps	0.01	XX	XX
EdoAbati/whisper-large-v2-it	ASR	1.5B	1e-5	8	5	500 steps	0.01	XX	XX
bert-base-multilingual-uncased	NLU	167M	5e-5	8	5	500 steps	0.01	XX	XX
facebook/mbart-large-cc25	NLU	611M	5e-5	8	5	500 steps	0.01	XX	XX
dbmdz/bert-base-italian-xxl-uncased	NLU	110M	5e-5	8	5	500 steps	0.01	XX	XX
morenolq/bart-it	NLU	141M	5e-5	8	5	500 steps	0.01	XX	XX

In all cases we opted for AdamW optimizer. All experiments were run on a single NVIDIA A6000 GPU.

SLU intent classification

The models used in the paper are available on the Hugging Face Hub.

🌍 facebook/wav2vec2-xls-r-300m
🌍 facebook/wav2vec2-xls-r-1b
🇮🇹 jonatasgrosman/wav2vec2-xls-r-1b-italian
🇮🇹 jonatasgrosman/wav2vec2-large-xlsr-53-italian

ASR

The models used in the paper are available on the Hugging Face Hub.

🌍 Whisper large (zero-shot ASR): openai/whisper-large-v2
🇮🇹 Whisper small: ALM/whisper-it-small-augmented
🇮🇹 Whisper medium: EdoAbati/whisper-medium-it-2
🇮🇹 Whisper large: EdoAbati/whisper-large-v2-it

NLU intent classification

The models used in the paper are available on the Hugging Face Hub.

Citation

If you use this dataset in your research, please cite the following paper:

TO BE ADDED UPON PUBLICATION

License

The dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE		LICENSE
README.md		README.md
asr_finetuning.py		asr_finetuning.py
asr_inference.py		asr_inference.py
dataset.py		dataset.py
ft_eval_asr.sh		ft_eval_asr.sh
ic_finetuning.py		ic_finetuning.py
ic_inference.py		ic_inference.py
text_finetuning.py		text_finetuning.py
text_inference.py		text_inference.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ITALIC: An ITALian Intent Classification Dataset

Table of Contents

Data collection

Dataset

Usage

Models used in the paper

Parameter settings

SLU intent classification

ASR

NLU intent classification

Citation

License

About

Releases

Packages

Contributors 3

Languages

License

RiTA-nlp/ITALIC

Folders and files

Latest commit

History

Repository files navigation

ITALIC: An ITALian Intent Classification Dataset

Table of Contents

Data collection

Dataset

Usage

Models used in the paper

Parameter settings

SLU intent classification

ASR

NLU intent classification

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages