Personalized feedback based on the automated analysis of audio samples could be useful in a wide range of intervention contexts, from early childhood to neurodegenerative programs, which target behaviors having vocal correlates. In this paper, we describe an automated pipeline that allows one to provide personalized feedback based on the automated analysis of audio samples of caregiver-child conversations captured using a smartphone. The pipeline relies on open-source packages and AWS in order to provide a cheap, reproducible, and considerably scalable solution for researchers and practitioners interested in early childhood development and caregiver-child interaction, and which could be adapted for other use cases. It processes conversation files that are 1-10 minutes long, with a cost of 0.20 US $ per hour of audio analyzed. It is currently operational in one large-scale experiment in Uruguay, where audio files are collected through a chatbot, whose implementation is not covered in this paper. Finally, we lay out limitations of our approach and potential improvements.
Speech Maturity Dataset: A cross-cultural corpus of naturalistic child and adult vocalizations
Kasia Hitczenko, Loann Peurey, William N. Havard, Kai Jia Tey, Amanda Seidl, Chiara Semenzin, Camila Scaff, Marvin Lavechin, Bridgette Kelleher, Lisa Hamrick, Lucas Gautheron, Margaret Cychosz, Marisa Casillas, and Alejandrina Cristia.
Over the first years of life, children’s spontaneous vocal productions become increasingly adult-like, both in their shape and phonetic properties, and lay the foundation for later phonetic and phonological development. Yet, research in this area has been limited to a narrow set of languages and communities, mainly Indo-European languages from Western(ised) speaker communities, and focused on a narrow age range (0 - 24mo).
We present a new publicly-available dataset, the Speech Maturity Dataset (SMD), consisting of 258,914 clips manually labelled for speaker and vocalisation type extracted from the long-form recordings of 398 children (209 male, 186 female) from 2 months to 6 years of age from 14 communities (ranging from rich industrialised societies to farmer-forager speaker communities) in 25+ languages. Albeit already massive, our dataset represents the first version of an ongoing and collaborative effort between field linguists, psycholinguists, and citizen scientists. The data set is expected to be expanded on a regular basis, since the project is still live (LINK REMOVED FOR DOUBLE-BLIND REVIEW).
SMD is a superset of the already existing BabbleCor dataset (Cychosz et al., 2019) which originally consisted of 15k vocalisations. We followed the same methodology to constitute our dataset, whereby all the clips received a label based on the majority vote of at least 3 citizen scientists (i.e., non-scientific volunteers who devote time to annotate and label scientific data). Contrary to BabbleCor, which used the smaller and closed iHEARu-PLAY platform, we turned to the world’s largest open citizen science platform, Zooniverse, as it had a larger and more diverse pool of citizen scientists. Citizen scientists labelled vocalisations taken from naturalistic long-form recordings with their vocalisation type: laughing, crying, canonical (speech-like vocalisation containing an adjacent consonant and vowel), non-canonical (speech-like vocalisation without an adjacent consonant and vowel), or junk (silence or non-human sounds). For a subset of the clips (N=110,577), citizen scientists also labelled the speaker type: baby (younger than 3 years), child (3-12 years), female/male adolescent (12-18 years), or female/male adult.
SMD, which includes a wealth of metadata (child’s age/sex, linguistic environment, normativity, etc.), lends itself to several use cases. It can be used to study child vocalisation development at an unprecedented scale in a wide variety of communities, by computing indices of vocal development such as canonical proportion (i.e. the proportion of speech-like vocalizations that contain an adjacent consonant and vowel – regardless of whether they are in babble or meaningful speech) or linguistic proportion (i.e. the proportion of vocalizations that are speech-like). This dataset can also be used to train vocalisation-type classifiers in an effort to make software dedicated to the study of child language acquisition free, open-source, and reproducible.
We showcase a potential use of this data set by presenting a preliminary analysis of canonical proportion and linguistic proportion. We fitted two linear mixed effect models to predict canonical proportion and separately, linguistic proportion from the child’s age, sex and monolingualism as fixed effects, and child ID nested in corpus as a random effect to account for individual variation. While for both models we observe a statistically significant positive effect of age (which is natural, as we expect these proportions to increase with age), we do not observe any significant effect of monolingualism or sex, suggesting that children follow a similar development trajectory. Results like these promise to allow researchers to significantly expand their knowledge of early vocal development.
Journals
2024
Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline
Alejandrina Cristia, Lucas Gautheron, Zixing Zhang, Björn Schuller, Camila Scaff, Caroline Rowland, Okko Räsänen, Loann Peurey, Marvin Lavechin, William Havard, Caitlin M. Fausey, Margaret Cychosz, Elika Bergelson, Heather Anderson, Najla Al Futaisi, and Melanie Soderstrom.
The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network’s architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.
LREC
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
*Marcely Zanon Boito, *William Havard, Mahault Garnerin, Éric Le Ferrand, and Laurent Besacier.
In Proceedings of the 12th Language Resources and Evaluation Conference
May
2020
The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date.Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for typologically different language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.
2019
CoNLL
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech
William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.
In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Nov
2019
In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks – the gating paradigm – and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally we suggest that word representation could be activated through a process of lexical competition.
ICASSP
Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese
William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.
In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
May
2019
We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.
Domestic
Conferences
2024
TALN
Technologies de la parole et données de terrain : le cas du créole haïtien
William N. Havard, Renauld Govain, Daphne Gonçalves Teixeira, Benjamin Lecouteux, and Emmanuel Schang.
In Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Jul
2024
Nous utilisons des données de terrain en créole haïtien, récoltées il y a 40 ans sur cassettes puis numérisées, pour entraı̂ner un modèle natif d’apprentissage auto-supervisé (SSL) de la parole (Wav2Vec2) en haïtien. Nous utilisons une approche de pré-entraı̂nement continu (CPT) sur des modèles SSL pré-entraı̂nés de deux langues étrangères : la langue lexificatrice – le français – et une langue non apparentée – l’anglais. Nous comparons les performances de ces trois modèles SSL, et de deux autres modèles SSL étrangers directement affinés, sur une tâche de reconnaissance de la parole. Nos résultats montrent que le modèle le plus performant est celui qui a été entraı̂né en utilisant une approche CPT sur la langue lexificatrice, suivi par le modèle natif. Nous concluons que l’approche de ”mobilisation des archives” préconisée par (Bird, 2020) est une voie prometteuse pour concevoir des technologies vocales pour de nouvelles langues.
2021
TALN
Contribution d’informations syntaxiques aux capacités de généralisation compositionelle des modèles seq2seq convolutifs
Diana Nicoleta Popa, William N. Havard, Maximin Coavoux, Laurent Besacier, and Eric Gaussier.
In Traitement Automatique journaldes Langues Naturelles
2021
FR : Les modèles neuronaux de type seq2seq manifestent d’étonnantes capacités de prédiction quand ils sont entraînés sur des données de taille suffisante. Cependant, ils échouent à généraliser de manière satisfaisante quand la tâche implique d’apprendre et de réutiliser des règles systématiques de composition et non d’apprendre simplement par imitation des exemples d’entraînement. Le jeu de données SCAN, constitué d’un ensemble de commandes en langage naturel associées à des séquences d’action, a été spécifiquement conçu pour évaluer les capacités des réseaux de neurones à apprendre ce type de généralisation compositionnelle. Dans cet article, nous nous proposons d’étudier la contribution d’informations syntaxiques sur les capacités de généralisation compositionnelle des réseaux de neurones seq2seq convolutifs.
EN: Classical sequence-to-sequence neural network architectures demonstrate astonishing prediction skills when they are trained on a sufficient amount of data. However, they fail to generalize when the task involves learning and reusing systematic rules rather than learning through imitation from examples. The SCAN dataset consists of a set of mapping between natural language commands and actions and was specifically introduced to assess the ability of neural networks to learn this type of compositional generalization. In this paper, we investigate to what extent the use of syntactic features help convolutional seq2seq models to better learn systematic compositionality.
International
Workshops
2025
Speech Technologies with Fieldwork Recordings: the Case of Haitian Creole
William N. Havard, Renauld Govain, Benjamin Lecouteux, and Emmanuel Schang.
In Proceedings of the 8th Workshop on Computational Methods for Endangered Languages (ComputEL-8)
Mar
2025
This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus, as demonstrated by a preliminary study conducted on a subset of the corpus (10h, 10k spoken captions).
Domestic
Workshops
2023
Outiller la documentation des langues créoles
Eric Le Ferrand, Claudel Pierre-Louis, Ruoran Dong, Benjamin Lecouteux, Daphné Gonçalves-Teixeira, William N. Havard, and Emmanuel Schang.
In LIFT 2023: Journées scientifiques du GdR Linguistique Informatique, Formelle et de Terrain
Nov
2023
A study of the production and perception of ’ in Tsimane’
William Havard, Camila Scaff, Loann Peurey, and Alejandrina Cristia.
In Journées Jointes des Groupements de Recherche Linguistique Informatique, Formelle et de Terrain (LIFT) et Traitement Automatique des Langues (TAL)
Nov
2022
A study of the production and perception of 〈’〉 in Tsimane’. Tsimane’ is a language spoken in Bolivia by several thousand people and yet the phonology of Tsimane’ has not been described in detail. With this project, we want to take a step towards better description by focusing on an aspect of language that we find particularly unusual : the sound represented in spelling with 〈’〉, usually analyzed as a glottal stop /P/. We hypothesized that 〈’〉 is a glottal flap. We recorded two adult speakers of Tsimane’ producing (near-)minimal pairs involving this sound. In this paper, we present analyses focused on a syllable extracted from six minimal pairs : /ki-kiP/. Analyses of the spectrograms suggested one speaker consistently used vowel glottalization and to a lesser extent closure, whereas these were ambiguous in our other informant. However, presentation of the key syllables to these two informants and two other adult Tsimane’ listeners revealed clear evidence that they could clearly recover the intended syllable. Together, these data suffice to rule out our initial hypothesis of a glottal flap, since a closure was never obvious in one of the speakers, and suggests instead a more complex set of acoustic cues may be at listeners’ disposal.
2018
Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation
Xuanli He, Quan Tran, William Havard, Laurent Besacier, Ingrid Zukerman, and Gholamreza Haffari.
In Proceedings of the Australasian Language Technology Association Workshop 2018
Dec
2018
In spite of the recent success of Dialogue Act (DA) classification, the majority of prior works focus on text-based classification with oracle transcriptions, i.e. human transcriptions, instead of Automatic Speech Recognition (ASR)’s transcriptions. Moreover, the performance of this classification task, because of speaker domain shift, may deteriorate. In this paper, we explore the effectiveness of using both acoustic and textual signals, either oracle or ASR transcriptions, and investigate speaker domain adaptation for DA classification. Our multimodal model proves to be superior to the unimodal models, particularly when the oracle transcriptions are not available. We also propose an effective method for speaker domain adaptation, which achieves competitive results.
Peer-reviewed
Abstracts
2025
Questioning Morphological Boundaries with the Help of Automatic Speech Recognition
Advancements in machine learning techniques open new avenues for investigating the identification of morphological boundaries in spoken Creole languages (Ferrand et al 2023). While it is a well-established point that Automatic Speech Recognition models can deliver accurate transcriptions even when trained on small corpora, the units discovered by Automatic Speech Recognition models often differ from those expected by linguists (Adda et al 2018; Scharenborg et al 2018; Bartelds et al 2023). Language models favor units optimized for information compression (e.g., Byte Pair Encoding) over linguistic units, despite using terms like ‘word’ and ‘subword’, raising questions about how ASR-derived segmentations align with linguistic expectations (cf. ANR DeepTypo project).
This study explores how automatic transcription can shed light on morphological boundaries based on analysis of approximately 1,400 hours of Haitian Creole recordings, reporting findings from the ANR CREAM project on the automatic transcription of spoken Kreyòl. Among our findings, we show that automatic transcription, although generally consistent across the corpus, reveals non-uniform segmentation for the word-initial attachment of ’la’, such as la limyè/lalimyè/limyè (‘light’). Our study also shows that ’la’ attachment in Hatitian Creole reveals more diverse attachment patterns than Martinican Creole, where l(a)- and lé form ’semantically definite’ DPs denoting specific items in the manner of proper names and forms without la imply indefiniteness or non-singularity (Zribi-Hertz & Jean-Louis 2013). However, in our corpus la cannot be reduced to a fixed role: la kilti popilè a ‘the popular culture’; lakilti popilè ‘the popular culture’; and kilti oksidantal ‘Western culture’.
Additionally, we discuss the challenges of using ASR for segmentation of morphological boundaries in spoken language by analysing suffix attachment errors (e.g., lang aj for langaj ‘language’ or siperyè ite for siperyorite ‘superiority’), unseen in the ASR fine-tuning data.
2024
Mobilising the Archive: Training Modern Speech Technology Models with Digitalised Fieldwork Recordings
William Havard, Emmanuel Schang, and Benjamin Lecouteux.
In Recent Advances in Language Documentation and Archiving (LD&A’24)
Sep
2024
Over the years, community members and linguists have recorded speakers and peers in the field to formally study their languages and write grammars, and to preserve cultural knowledge. Up to now, most of the gathered recordings are archived and remain untranscribed. They are therefore impossible to index and navigate, as indexing and navigation rely on the existence of transcriptions, and remain unsearchable (and potentially unusable) for both community members and linguists.
In our work, we leverage the power of modern self-supervised speech-processing tools (wav2vec, Baevski et al. 2020) and the existence of archival material. We pre-trained self-supervised models of speech processing on digitalised fieldwork recordings (350h) in Haitian Creole, collected 40 years ago in Haiti and digitalised by the French National Library. We further train the models on a speech recognition task, and obtain competitive results on fieldwork material (24.1% character error rate, CER) and read speech (15.2% CER), with models requiring only 40 minutes of transcribed speech to be trained.
To the best of our knowledge, our work is the first that only uses fieldwork recordings to train state-of-the-art speech processing models at every step of the training process. We show that old fieldwork recordings, that were not collected for computational applications, can be repurposed and used to train speech recognition models. We conclude that the ‘mobilising the archive’-approach advocated by (Bird, 2020) is a promising way forward to design speech technologies for new languages, and make archival material accessible both for community members and linguists. In future works, we would like to explore query-by-example approaches that would leverage the need for transcriptions altogether and allow users to query and navigate the archive by simply pronouncing a key word.
IASCL
Automated Pipeline Provides Personalized Feedback on Short Caregiver-Child Audio Conversations
Alejandrina Cristia, Loann Peurey, William Havard, Gwendal Virlet, Xuan-Nga Cao, Juanita Bloomfield Lescarboura, Ana Balsa, Alejandro Cid, Martín Ottavianelli, José Luis Horta Brasil, Camila Scaff, and Kai Jia Tey.
In Poster presentation at the International Association for the Study of Child Language (IASCL) Conference
Jul
2024
Automated analysis of short caregiver-child conversations could be useful in a wide range of research and intervention contexts. In this work, we describe an automated pipeline developed in a three-partite collaboration: 1) Economists deploying a randomized control trial (RCT) among low socio-economic status families in Uruguay; 2) a tech company that implemented a WhatsApp chatbot for use in the RCT; 3) a research team specialized in the intersection of speech technology and developmental psychology, whose contribution involves an open-source pipeline for analyses and the online platform Amazon Web Services (AWS).
Caregivers in the RCT record themselves in interaction with their infant (3-36 months) using WhatsApp’s audio recording feature. The audio file is uploaded to our pipeline, with a cost of 0.20 US$ per hour of audio analyzed. After fully automated processing, our pipeline calculates a series of metrics, including number of caregiver and child vocalizations, pitch in both types, and number of words in the caregivers’ vocalizations. These metrics are then integrated into the feedback the chatbot provides to parents the following day.
The accuracy of the automated metrics were established through comparison against human annotations of the same files for a subset of 20 files, selected from a variety of contexts (meal time, bath time) and annotated using ELAN. Correlations for key metrics were very high (adult vocalization duration, child vocalization duration, child-adult turn counts r>.9; pitch mean and range >.8). Given that all parts of the pipeline are open source, we trust that our pipeline could provide an economical, reproducible, and scalable solution for researchers and practitioners interested in caregiver-child interaction.
IASCL
Presenting LongFoRMer: A package to organize and analyze long-form recordings
Loann Peurey, Lucas Gautheron, William Havard, Camila Scaff, Shuvayanti Das, KaiJia Tey, and Alejandrina Cristia.
In Poster presentation at the International Association for the Study of Child Language (IASCL) Conference
Jul
2024
Long-form recordings (LFR) collected via child-worn devices are becoming increasingly common in the study of children’s input and production. This technique poses several technical and usability challenges, especially because of the sensitivity of the data and their sheer volume. Many researchers adapt their working practices from smaller datasets, leading them to painful situations, including: Having multiple copies of the same large audio files, having divergent spreadsheets describing samples from the audio files (e.g., one spreadsheet describes annotations done in a subset of the files with one annotation scheme, another has the automated counts at the whole file level, etc).
We have developed LongFoRMer (Long-form Recording Manager, formerly ChildProject), a package that allows researchers using LFR to organize their files in a standardized way to facilitate management of these data. This package also provides procedures to import annotations from a wide range of existing formats (LENA’s .its, ACLEW annotation structure in ELAN, Praat) into standardized .csv files. It includes clever solutions for the above-mentioned problems, such as annotations covering only sections of the audio and/or subsets of the participants. Through this standardized organization, researchers can also benefit from facilitated instructions to apply open-source and free automated algorithms to return adult word counts and child vocalization counts. The package also includes procedures to evaluate the reliability of automated annotations against their human equivalents. After accompanying several labs in their exploration of our package, we have developed improved tutorials and trouble-shooting sessions. Finally, the package relies on open source tools that facilitate other aspects of work with LFR, namely datalad to allow versions of the data that are lighter (by not including the recordings); and GIN to keep track of dataset versions and control sharing and collaboration.
Presenting LongFoRMer: A package to organize and analyze long-form recordings
Loann Peurey, Lucas Gautheron, William Havard, Camila Scaff, Shuvayanti Das, KaiJia Tey, and Alejandrina Cristia.
In ColDoc 2024: Linguistics in a New Era: Discourse, Methods, and Technologies in the Contemporary Landscape
Jul
2024
Long-form recordings (LFR) collected via child-worn devices are becoming increasingly common in the study of children’s input and production. This technique poses several technical and usability challenges, especially because of the sensitivity of the data and their sheer volume. Many researchers adapt their working practices from smaller datasets, leading them to painful situations, including: Having multiple copies of the same large audio files, having divergent spreadsheets describing samples from the audio files (e.g., one spreadsheet describes annotations done in a subset of the files with one annotation scheme, another has the automated counts at the whole file level, etc).
We have developed LongFoRMer (Long-form Recording Manager, formerly ChildProject), a package that allows researchers using LFR to organize their files in a standardized way to facilitate management of these data. This package also provides procedures to import annotations from a wide range of existing formats (LENA’s .its, ACLEW annotation structure in ELAN, Praat) into standardized .csv files. It includes clever solutions for the above-mentioned problems, such as annotations covering only sections of the audio and/or subsets of the participants. Through this standardized organization, researchers can also benefit from facilitated instructions to apply open-source and free automated algorithms to return adult word counts and child vocalization counts. The package also includes procedures to evaluate the reliability of automated annotations against their human equivalents. After accompanying several labs in their exploration of our package, we have developed improved tutorials and trouble-shooting sessions. Finally, the package relies on open source tools that facilitate other aspects of work with LFR, namely datalad to allow versions of the data that are lighter (by not including the recordings); and GIN to keep track of dataset versions and control sharing and collaboration.
Introducing the Speech Maturity Dataset: Research opportunities for speech scientists and linguistic fieldworkers
Margaret Cychosz, Kasia Hitczenko, William N. Havard, Loann Peurey, Madurya Suresh, Theo Zhang, and Alex Cristia.
Over the first years of life, children’s spontaneous vocal productions become increasingly adult-like in their shape and phonetic properties, laying the foundation for later phonological development (Oller, 2000). Yet, as in language development at large, research in this area has been limited to a narrow set of languages and communities, mainly Indo-European from Western(ized) speaker communities, limiting our understanding of cross-linguistic and cross-cultural variation in speech development (Kidd & Garcia, 2022). To address this issue, we introduce a new publicly-available corpus, the Speech Maturity Dataset (SMD), consisting of 258,914 labeled audio clips extracted from child-centered, longform audio recordings ( 8 continuous hours/child). Recordings came from 398 children (209 male, 186 female), aged 2 months to 6 years, from 14 communities (ranging from rich industrialized societies to farmer-forager speaker communities) learning 25+ languages. All clips were manually labeled for speaker and vocalization type by at least 3 citizen scientists (i.e., non-scientific volunteers who devote time to annotate and label scientific data) on Zooniverse, the world’s largest citizen science platform. Citizen scientists labeled each clip by vocalization type: laughing, crying, canonical (speech-like vocalization containing an adjacent consonant and vowel), non-canonical (speech-like vocalization without an adjacent consonant and vowel), or junk (silence or non-human sounds). For a subset of the clips (N=110,577), citizen scientists also labeled the speaker type: baby (younger than 3 years), child (3-12 years), female/male adolescent (12-18 years), or female/male adult. This demonstration and walk-about has two goals. First, albeit already massive, SMD represents the first version of an ongoing collaborative effort between field linguists, phoneticians, and developmental scientists. SMD continues to grow: the citizen science project is still live (LINK REMOVED FOR REVIEW) and we continue to accept new data for annotation into the dataset. So the first objective of our demonstration is to illustrate several case studies of how we helped traditional documentary field linguists, with no background in child language research or large-scale speech corpora, to collect and contribute data to SMD, resulting in several large-scale research collaborations. The second objective of our demonstration is to illustrate how SMD, which includes a wealth of metadata (child’s age, gender, linguistic environment, etc.), lends itself to the development of new tools to automate the processing of large-scale, spontaneous speech recordings. We will illustrate how SMD is already used to study child speech development at an unprecedented scale in a wide variety of communities, by computing indices of children’s vocal development such as canonical proportion (i.e. the proportion of speech-like vocalizations that contain an adjacent consonant and vowel) or linguistic proportion (i.e. the proportion of vocalizations that are speech-like) (Hitczenko et al., 2023). We will end by showcasing how we used SMD to train supervised vocalization-type classifiers in an effort to make software dedicated to large-scale speech corpus processing free, open-source, and reproducible.
Exploring the Impact of Syllable Complexity on Canonical Proportion in Children: Insights from a Multilingual and Cross-cultural Study
Kai Jia Tey, Sarah Walker, Amanda Seidl, Camila Scaff, Loann Peurey, Bridgette L. Kelleher, Kasia Hitczenko, William N. Havard, Lisa R. Hamrick, Pauline Grosjean, Margaret Cychosz, Heidi Colleran, Marisa Casillas, Elika Bergelson, and Alejandrina Cristia.
In Proceedings of the Workshop on Infant Language Development (WILD)
2024
One sign of early phonological development is the increasing prevalence of canonical syllables (consonant+vowel; Oller et al., 1998). A recently proposed metric is canonical proportion (CP): the proportion of a child’s speech-like vocalisations containing clear consonant-vowel transitions (Cychosz et al., 2021). Initial analyses of 129 children suggested that CP relates to age non-linearly, continuing to develop well beyond the appearance of children’s first words; and that it varies as a function of the ambient language structure (Hitczenko et al., 2023). Here we investigate CP further, considering potential effects of multilingualism (i.e., being exposed to 2+ languages). With the help of citizen scientists, we crowdsourced the annotation of 256,842 clips extracted from speech-like vocalisations by 371 children (2-77 months: 178 boys). The resulting dataset represents children from Bolivia (n=44), France (10), Mexico (10), Papua New Guinea (46), Solomon Islands (198), Vanuatu (40), and the USA (California 3, Indiana 10, New York 10). Children’s CP appears to depend on age, mono-/multilingualism, and ambient language complexity. First, a generalised linear model was fit to the monolingual data, declaring age in interaction with complexity (as in Hitczenko, languages were categorised as allowing only simple, moderately complex, or complex syllables following Maddieson, 2013). Second, a Levene Test confirmed significant difference in variance of CP between monolinguals and multilinguals (F = 11.47, p<.005). Our first analysis confirmed Hitczenko’s observation that CP develops more slowly in languages that allow more complex syllables, possibly due to the challenge posed by learning the complex syllables. Previous research suggests that children often mirror the characteristics of their ambient language in their canonical babbling (Andruski et al., 2014). In a subsequent analysis, significant differences in variance between monolinguals and multilinguals were observed, and the reasons for these differences will be discussed.
2023
The Speech Maturity Dataset
William N. Havard, Loann Peurey, Kasia Hitczenko, and Alejandrina Cristia.
In Proceedings of the Many Paths to Language (MPaL) Workshop
Nov
2023
Over the first years of life, children’s spontaneous vocal productions become increasingly adult-like, both in their shape and phonetic properties, and lay the foundation for later phonetic and phonological development. Yet, research in this area has been limited to a narrow set of languages and communities, mainly Indo-European languages from Western(ised) speaker communities, and focused on a narrow age range (0 - 24mo).
We present a new publicly-available dataset, the Speech Maturity Dataset (SMD), consisting of 258,914 clips manually labelled for speaker and vocalisation type extracted from the long-form recordings of 398 children (209 male, 186 female) from 2 months to 6 years of age from 14 communities (ranging from rich industrialised societies to farmer-forager speaker communities) in 25+ languages. Albeit already massive, our dataset represents the first version of an ongoing and collaborative effort between field linguists, psycholinguists, and citizen scientists. The data set is expected to be expanded on a regular basis, since the project is still live (https://www.zooniverse.org/projects/laac-lscp/maturity-of-baby-sounds).
SMD is a superset of the already existing BabbleCor dataset (Cychosz et al., 2019) which originally consisted of 15k vocalisations. We followed the same methodology to constitute our dataset, whereby all the clips received a label based on the majority vote of at least 3 citizen scientists (i.e., non-scientific volunteers who devote time to annotate and label scientific data). Contrary to BabbleCor, which used the smaller and closed iHEARu-PLAY platform, we turned to the world’s largest open citizen science platform, Zooniverse, as it had a larger and more diverse pool of citizen scientists. Citizen scientists labelled vocalisations taken from naturalistic long-form recordings with their vocalisation type: laughing, crying, canonical (speech-like vocalisation containing an adjacent consonant and vowel), non-canonical (speech-like vocalisation without an adjacent consonant and vowel), or junk (silence or non-human sounds). For a subset of the clips (N=110,577), citizen scientists also labelled the speaker type: baby (younger than 3 years), child (3-12 years), female/male adolescent (12-18 years), or female/male adult.
SMD, which includes a wealth of metadata (child’s age/sex, linguistic environment, normativity, etc.), lends itself to several use cases. It can be used to study child vocalisation development at an unprecedented scale in a wide variety of communities, by computing indices of vocal development such as canonical proportion (i.e. the proportion of speech-like vocalizations that contain an adjacent consonant and vowel – regardless of whether they are in babble or meaningful speech) or linguistic proportion (i.e. the proportion of vocalizations that are speech-like). This dataset can also be used to train vocalisation-type classifiers in an effort to make software dedicated to the study of child language acquisition free, open-source, and reproducible.
We showcase a potential use of this data set by presenting a preliminary analysis of canonical proportion and linguistic proportion. We fitted two linear mixed effect models to predict canonical proportion and separately, linguistic proportion from the child’s age, sex and monolingualism as fixed effects, and child ID nested in corpus as a random effect to account for individual variation. While for both models we observe a statistically significant positive effect of age (which is natural, as we expect these proportions to increase with age), we do not observe any significant effect of monolingualism or sex, suggesting that children follow a similar development trajectory. Results like these promise to allow researchers to significantly expand their knowledge of early vocal development.
2022
ESCOP
Lexical Acquisition: Start Small and Build up or Start Big and Break Down? A Study on Lexical Acquisition Using Visually Grounded Artificial Neural Networks
William N. Havard
In European Society for Cognitive Psychology
Aug
2022
Visually grounded speech (VGS) models are artificial neural networks (ANN) trained to retrieve an image given its spoken description. These models thus have to implicitly segment the speech signal into sub-units and associate the discovered items to their visual referents. In this experiment, instead of letting the VGS model latently infer boundaries by itself, we give the ANN the position of boundaries corresponding to units of different sizes: phones, syllables, or words. We study how well (in terms of recall@1) the network is able to retrieve the target image, given the size of the units given alongside. Our results show that the VGS network is better able to retrieve the target image if the speech signal is broken down into words than when it is broken down into smaller units such as phones or syllables. Our results agree with the child acquisition literature suggesting that children segment large units first.
ESCOP
Modeling and Measuring Children’s Language Development Using Language Models
Yaya Sy, William N. Havard, and Alejandrina Cristia.
In European Society for Cognitive Psychology
Aug
2022
Although research suggests children’s language develops fast, that work is based on a biased sample covering less than 1% of the world’s languages (Kidd & Garcia, 2022). To measure development in many more languages reliably, we assess a potential scalable method: language;models, computational models train to predict characters in a string. We assessed this for 12 languages for which there was conversational data for training (OpenSubtitles), and test data from the major child language development archive, containing adult-child interactions (CHILDES), which were phonemized. Results for most languages show adults’ utterances have low perplexity (indicating that strings of characters are predicted well), which is stable as a function of child age; whereas perplexity for children’s utterances at about 1 year of age are much higher and decrease to converge towards the adults’ by 5 years. This approach can help researchers measure language development, provided there are transcripts of adult-child interactions.
2018
Emergence of Attention in a neural model of Visually Grounded Speech
William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.
Context, be it visual, haptic, or auditory provides children with all necessary information to build a coherent mental representation of the world. While acquiring their native language, children learn to map portions of this mental representation to whole or part of acoustic realisations they perceive from surrounding speech. The process of extracting meaningful units from a continuous flow of speech is known as lexical segmentation. These units can then further be used by children to analyse novel acoustic realisations and adjust segmentation of new stimuli. Thus, language acquisition is a dynamic process in which one constantly re-evaluates its segmentation according to what is perceived, or its perception according to what is segmented. Context acts therefore as a weak supervision which makes the segmentation process easier.
Children are not born with a fully-fledged representation of the world they could use to help them detect and understand acoustic patterns. Rather, speech pattern detection and world understanding are two processes that occur simultaneously and both processes start from scratch. That means the extracted patterns and the mental representation of the world evolve during learning, and that the final extracted patterns will not necessarily be the same and may become more and more specific.
Afew studies try to emulate these processes using computer programs for pattern matching (Roy and Pennland, 2002) and more recently using deep-learning technologies such as end-to-end neural architectures (Harwarth and Glass, 2017; Chrupała et al., 2017) . The latter study is interesting as its architectures takes a speech signal and an image as inputs and is projects both in a common representation space. Chrupała et al. (2017) analysed the linguistic representations that were learnt and discovered that the first layers tend to encode acoustic information while higher levels encode semantic information. However, this study focused on the representation learnt once all the training data had been seen multiple times and did not explicitly analyse segmentation as a by-product of the original task. Rather than assessing the representation learnt by the neural network once the training stage is completed, our work (in progress) analyses how this representation changes over time. More specifically, we focus on the attention model of the network (a vector representing the focus of the system on different parts of the speech input) and how it correlates with true unit boundaries (at word and chunk level). As far as we know, no other work analysing the representation learnt by neural networks in a diachronic fashion has ever been conducted so far.
Non
Peer-reviewed
Abstracts
2024
Corpus francophones et créolophones à La Réunion
William Havard, and Gudrun Ledegen.
In Corpus et méthodes pour l’étude de la variation dans l’espace francophone et au-delà (CoMeVar)
Nov
2024
Nous proposons de présenter notre projet autour des corpus francophones et créolophones à La Réunion. Dans ce contexte sociolinguistique marqué par une forte perméabilité entre les deux langues, se traduisant notamment par la production toujours aussi fréquente d’énoncés interlectaux (Prudent, 1981) et de zones à double interprétation (français et créole), nommées "zones flottantes" (à double transcription, dite "flottante") (Ledegen 2012), il est incontournable d’étudier le français et le créole de concert, et de mesurer leurs évolutions conjointes (Ledegen & Simonin 2010 ; Ledegen 2017).
Le corpus synchronique Valirun (Variétés linguistiques de La Réunion) (Ledegen 2000-2011) portant sur les pratiques ’ordinaires’, plus particulièrement jeunes et médiatiques, sera ainsi comparé aux enregistrements de conversations complétant les enquêtes de l’Atlas linguistique de La Réunion et ceux des enquêtes de Nicole Gueunier, datant des années ’60-’70. Le travail préalable de transcription automatisée grâce à l’IA (Havard & al. 2024), dont les procédures seront exposées, est ainsi confronté à cette complexité complémentaire, certes propre à cette situation de contacts de langues mais à l’image de toute situation de contact.
Les modèles de transcriptions de la parole sont majoritairement monolingues, et entraînés sur des langues dites “bien dotées” qui disposent de nombreuses heures de parole transcrites. Aussi, ces modèles rencontrent-ils des difficultés lorsqu’ils sont utilisés dans des contextes multilingues ou diglossiques, où les phénomènes d’alternance codique sont nombreux. Afin de résoudre cette tâche, nous explorerons l’utilisation de modèles de transcription pré-entrainés soit sur le français (LeBenchmark, Evain et al. 2021), soit sur le haïtien (Havard et al., 2024), ou sur un créole proche, le mauricien (Havard et al. 2024, en préparation). Cela nous permettra d’explorer empiriquement des questions ouvertes, telle que la proximité entre un créole et sa langue lexificatrice (ici le français) ou la proximité de deux créoles différents entre eux (réunionnais v. mauricien ou haïtien) en mesurant le degré d’adaptation et de performance d’un modèle de transcription automatique pré-entraîné.
Thesis
2021
Lexical emergence from context : exploring unsupervised learning approaches on large multimodal language corpora
FR : Ces dernières années, les méthodes d’apprentissage profond ont permis de créer des modèles neuronaux capables de traiter plusieurs modalités à la fois. Les modèles neuronaux de traitement de la Parole Visuellement Contextualisée (PVC) sont des modèles de ce type, capables de traiter conjointement une entrée vocale et une entrée visuelle correspondante. Ils sont couramment utilisés pour résoudre une tâche de recherche d’image à partir d’une requête vocale: c’est-à-dire qu’à partir d’une description orale, ils sont entraînés à retrouver l’image correspondant à la description orale passée en entrée. Ces modèles ont suscité l’intérêt des linguistes et des chercheurs en sciences cognitives car ils sont capables de modéliser des interactions complexes entre deux modalités — la parole et la vision — et peuvent être utilisés pour simuler l’acquisition du langage chez l’enfant, et plus particulièrement l’acquisition lexicale.Dans cette thèse, nous étudions un modèle récurrent de PVC et analysons les connaissances linguistiques que de tels modèles sont capables d’inférer comme sous-produit de la tâche principale pour laquelle ils sont entraînés. Nous introduisons un nouveau jeu de données qui convient à l’entraînement des modèles de PVC. Contrairement à la plupart des jeux de données qui sont en anglais, ce jeu de données est en japonais, ce qui permet d’étudier l’impact de la langue d’entrée sur les représentations apprises par les modèles neuronaux.Nous nous concentrons ensuite sur l’analyse des mécanismes d’attention de deux modèles de PVC, l’un entrainé sur le jeu de données en anglais, l’autre sur le jeu de données en japonais, et montrons que les modèles ont développé un comportement général, valable quelle que soit la langue utilisée, en utilisant leur poids d’attention pour se focaliser sur des noms spécifiques dans la chaîne parlée. Nos expériences révèlent que ces modèles sont également capables d’adopter un comportement spécifique à la langue en prenant en compte les particularités de la langue d’entrée afin de mieux résoudre la tâche qui leur est donnée.Nous étudions ensuite si les modèles de PVC sont capables d’associer des mots isolés à leurs référents visuels. Cela nous permet d’examiner si le modèle a implicitement segmenté l’entrée parlée en sous-unités. Nous étudions ensuite comment les mots isolés sont stockés dans les poids des réseaux en empruntant une méthodologie issue de la linguistique, le paradigme de gating, et nous montrons que la partie initiale du mot joue un rôle majeur pour une activation réussie.Enfin, nous présentons une méthode simple pour introduire des informations sur les frontières des segments dans un modèle neuronal de traitement de la parole. Cela nous permet de tester si la segmentation implicite qui a lieu dans le réseau est aussi efficace qu’une segmentation explicite. Nous étudions plusieurs types de frontières, allant des frontières de phones aux frontières de mots, et nous montrons que ces dernières donnent les meilleurs résultats. Nous observons que donner au réseau plusieurs frontières en même temps est bénéfique. Cela permet au réseau de prendre en compte la nature hiérarchique de l’entrée linguistique.
EN: In recent years, deep learning methods allowed the creation of neural models that are able to process several modalities at once. Neural models of Visually Grounded Speech (VGS) are such kind of models and are able to jointly process a spoken input and a matching visual input. They are commonly used to solve a speech-image retrieval task: given a spoken description, they are trained to retrieve the closest image that matches the description. Such models sparked interest in linguists and cognitive scientists as they are able to model complex interactions between two modalities — speech and vision — and can be used to simulate child language acquisition and, more specifically, lexical acquisition.In this thesis, we study a recurrent-based model of VGS and analyse the linguistic knowledge such models are able to derive as a by-product of the main task they are trained to solve. We introduce a novel data set that is suitable to train models of visually grounded speech. Contrary to most data sets that are in English, this data set is in Japanese and allows us to study the impact of the input language on the representations learnt by the neural models.We then focus on the analysis of the attention mechanisms of two VGS models, one trained on the English data set, the other on the Japanese data set, and show the models have developed a language-general behaviour by using their attention weights to focus on specific nouns in the spoken input. Our experiments reveal that such models are able to adopt a language-specific behaviour by taking into account particularities of the input language so as to better solve the task they are given.We then study if VGS models are able to map isolated words to their visual referents. This allows us to investigate if the model has implicitly segmented the spoken input into sub-units. We further investigate how isolated words are stored in the weights of the network by borrowing a methodology stemming from psycholinguistics, the gating paradigm, and show that word onset plays a major role in successful activation.Finally, we introduce a simple method to introduce segment boundary information in a neural model of speech processing. This allows us to test if the implicit segmentation that takes place in the network is as effective as an explicit segmentation. We investigate several types of boundaries, ranging from phone to word boundaries, and show the latter yield the best results. We observe that giving the network several boundaries at the same is beneficial. This allows the network to take into account the hierarchical nature of the linguistic input.
2017
Découverte non supervisée de lexique à partir d’un corpus multimodal pour la documentation des langues en danger
FR : De nombreuses langues disparaissent tous les ans et ce à un rythme jamais atteint auparavant. Les linguistes de terrain manquent de temps et de moyens afin de pouvoir toutes les documenter et décrire avant qu’elles ne disparaissent à jamais. L’objectif de notre travail est donc de les aider dans leur tâche en facilitant le traitement des données. Nous proposons dans ce mémoire des méthodes d’extraction non supervisées de lexique à partir de corpus multimodaux incluant des signaux de parole et des images. Nous proposons également une méthode issue de la recherche d’information afin d’émettre des hypothèses de signification sur les éléments lexicaux découverts. Ce mémoire présente en premier lieu la constitution d’un corpus multimodal parole-image de grande taille. Ce corpus simulant une langue en danger permet ainsi de tester les approches computationnelles de découverte non supervisée de lexique. Dans une seconde partie, nous appliquons un algorithme de découverte non supervisée de lexique utilisant de l’alignement dynamique temporel segmental (S-DTW) sur un corpus multimodal synthétique de grande taille ainsi que sur un corpus multimodal d’une vraie langue en danger, le Mboshi.
EN : Many languages are on the brink of extinction and many disappear each and every year at a rate never seen before. Field linguists lack the time and the means to document and describe all of them before they die out. The goal of our work is to help them in their task, make it easier and speed up the data processing and annotation tasks. In this dissertation, we propose methods to use an unsupervised term discovery (UTD) system to extract lexicon from multimodal corpora consisting of speech and images. We also propose a method using information retrieval techniques to hypothesise the meaning of the discovered lexical items. In the first place, this dissertation presents the creation of a large multimodal corpus which includes speech and images. This corpus simulating that of an endangered language will allow us evaluate the performances of an unsupervised term discovery system. In the second place, we apply an unsupervised term discovery system based on segmental dynamic time warping (S-DTW) to a large synthetic multimodal corpus and also to the multimodal corpus of a real endangered language called Mboshi, spoken in Congo-Brazzaville.