MaSS

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Posted: September 29, 2021 3 min.

The MaSS dataset is a large and clean speech dataset of 8,130 parallel spoken utterances across 8 languages (56 possible language pairs). The languages covered (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow to carry out research on speech-to-speech alignment as well as on translation for typologically different language pairs. This data set is extensively described in the following paper (Zanon Boito et al., 2020).

We are not allowed to share the audio files, however, we provide the extraction pipeline in the following repository. If you have troubles building the dataset yourself, feel free to email me!. Inside the dataset folder, for each language we provide:

If you use this data in your own publications please cite our LREC paper.


Sample

Matt. 1:2 Verse Audio
French Abraham engendra Isaac Isaac engendra Jacob Jacob engendra Juda et ses frères
English Abraham was the father of Isaac and Isaac the father of Jacob and Jacob the father of Judah and his brothers
Spanish Abrahán engendró a Isaac Isaac engendró a Jacob Jacob engendró a Judá y a sus hermanos
Hungarian Ábrahám fia volt Izsák Izsáké Jákób Jákób fiai pedig Júda és testvérei
Basque Abrahamek Isaak sortu zuen Isaakek Jakob Jakobek Juda eta honen anaiak
Finnish Aabrahamille syntyi Iisak Iisakille syntyi Jaakob Jaakobille syntyi Juuda ja tämän veljet
Russian Авраам родил Исаака Исаак родил Иакова Иаков родил Иуду и братьев его
Romanian Avraam a născut pe Isaac Isaac a născut pe Iacov Iacov a născut pe Iuda și fraţii lui

References

  1. Zanon Boito, *M., Havard, *W., Garnerin, M., Le Ferrand, É., & Besacier, L. (2020). MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. Proceedings of the 12th Language Resources and Evaluation Conference, 6486–6493. https://aclanthology.org/2020.lrec-1.799
Share on