William N. Havard

The MaSS dataset is a large and clean speech dataset of 8,130 parallel spoken utterances across 8 languages (56 possible language pairs). The languages covered (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow to carry out research on speech-to-speech alignment as well as on translation for typologically different language pairs. This data set is extensively described in the following paper (Zanon Boito et al., 2020).

We are not allowed to share the audio files, however, we provide the extraction pipeline in the following repository. If you have troubles building the dataset yourself, feel free to email me!. Inside the dataset folder, for each language we provide:

Alignment textgrids (generated by the Maus forced aligner)
Final textual output and segments textgrids
Mel Filterbank Spectrograms (such as used in the paper’s experiments)

If you use this data in your own publications please cite our LREC paper.

Sample

Matt. 1:2	Verse	Audio
French	Abraham engendra Isaac Isaac engendra Jacob Jacob engendra Juda et ses frères
English	Abraham was the father of Isaac and Isaac the father of Jacob and Jacob the father of Judah and his brothers
Spanish	Abrahán engendró a Isaac Isaac engendró a Jacob Jacob engendró a Judá y a sus hermanos
Hungarian	Ábrahám fia volt Izsák Izsáké Jákób Jákób fiai pedig Júda és testvérei
Basque	Abrahamek Isaak sortu zuen Isaakek Jakob Jakobek Juda eta honen anaiak
Finnish	Aabrahamille syntyi Iisak Iisakille syntyi Jaakob Jaakobille syntyi Juuda ja tämän veljet
Russian	Авраам родил Исаака Исаак родил Иакова Иаков родил Иуду и братьев его
Romanian	Avraam a născut pe Isaac Isaac a născut pe Iacov Iacov a născut pe Iuda și fraţii lui

References

Zanon Boito, *M., Havard, *W., Garnerin, M., Le Ferrand, É., & Besacier, L. (2020). MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. Proceedings of the 12th Language Resources and Evaluation Conference, 6486–6493. https://aclanthology.org/2020.lrec-1.799