MaSS
MaSS - Multilingual corpus of Sentence-aligned Spoken utterances
The MaSS dataset is a large and clean speech dataset of 8,130 parallel spoken utterances across 8 languages (56 possible language pairs). The languages covered (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow to carry out research on speech-to-speech alignment as well as on translation for typologically different language pairs. This data set is extensively described in the following paper (Zanon Boito et al., 2020).
We are not allowed to share the audio files, however, we provide the extraction pipeline in the following repository. If you have troubles building the dataset yourself, feel free to email me!. Inside the dataset folder, for each language we provide:
- Alignment textgrids (generated by the Maus forced aligner)
- Final textual output and segments textgrids
- Mel Filterbank Spectrograms (such as used in the paper’s experiments)
If you use this data in your own publications please cite our LREC paper.
Sample
Matt. 1:2 | Verse | Audio |
---|---|---|
French | Abraham engendra Isaac Isaac engendra Jacob Jacob engendra Juda et ses frères | |
English | Abraham was the father of Isaac and Isaac the father of Jacob and Jacob the father of Judah and his brothers | |
Spanish | Abrahán engendró a Isaac Isaac engendró a Jacob Jacob engendró a Judá y a sus hermanos | |
Hungarian | Ábrahám fia volt Izsák Izsáké Jákób Jákób fiai pedig Júda és testvérei | |
Basque | Abrahamek Isaak sortu zuen Isaakek Jakob Jakobek Juda eta honen anaiak | |
Finnish | Aabrahamille syntyi Iisak Iisakille syntyi Jaakob Jaakobille syntyi Juuda ja tämän veljet | |
Russian | Авраам родил Исаака Исаак родил Иакова Иаков родил Иуду и братьев его | |
Romanian | Avraam a născut pe Isaac Isaac a născut pe Iacov Iacov a născut pe Iuda și fraţii lui |
References
- _layouts/bibliography.html