William N. Havard | Publications

Conference Articles

2023

Interspeech

<’> in Tsimane’: a Preliminary Investigation

William Havard, Yaya Sy, Camila Scaff, Loann Peurey, and Alejandrina Cristia.

In Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20-24 August 2023 2023
Interspeech

Measuring language development from child-centered recordings

Yaya Sy, William Havard, Marvin Lavechin, Emmanuel Dupoux, and Alejandrina Cristia.

In Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20-24 August 2023 2023

2022

A study of the production and perception of ’ in Tsimane’

William Havard, Camila Scaff, Loann Peurey, and Alejandrina Cristia.

In Journées Jointes des Groupements de Recherche \emphLinguistique Informatique, Formelle et de Terrain (LIFT) et \emphTraitement Automatique des Langues (TAL) Nov 2022

Abs BibTeX URL

A study of the production and perception of 〈’〉 in Tsimane’. Tsimane’ is a language spoken in Bolivia by several thousand people and yet the phonology of Tsimane’ has not been described in detail. With this project, we want to take a step towards better description by focusing on an aspect of language that we find particularly unusual : the sound represented in spelling with 〈’〉, usually analyzed as a glottal stop /P/. We hypothesized that 〈’〉 is a glottal flap. We recorded two adult speakers of Tsimane’ producing (near-)minimal pairs involving this sound. In this paper, we present analyses focused on a syllable extracted from six minimal pairs : /ki-kiP/. Analyses of the spectrograms suggested one speaker consistently used vowel glottalization and to a lesser extent closure, whereas these were ambiguous in our other informant. However, presentation of the key syllables to these two informants and two other adult Tsimane’ listeners revealed clear evidence that they could clearly recover the intended syllable. Together, these data suffice to rule out our initial hypothesis of a glottal flap, since a closure was never obvious in one of the speakers, and suggests instead a more complex set of acoustic cues may be at listeners’ disposal.
@inproceedings{havard-etal-2022-lift, title = {{A study of the production and perception of ' in Tsimane'}}, author = {Havard, William and Scaff, Camila and Peurey, Loann and Cristia, Alejandrina}, year = {2022}, month = nov, booktitle = {{Journ{\'e}es Jointes des Groupements de Recherche \emph{Linguistique Informatique, Formelle et de Terrain} (LIFT) et \emph{Traitement Automatique des Langues} (TAL)}}, publisher = {{CNRS}}, address = {Marseille, France}, pages = {1--8}, url = {https://hal.archives-ouvertes.fr/hal-03846840}, editor = {Becerra, Leonor and Favre, Beno{\^i}t and Gardent, Claire and Parmentier, Yannick}, keywords = {domestic workshop}, hal_id = {hal-03846840}, hal_version = {v1} }

2021

Contribution d’informations syntaxiques aux capacités de généralisation compositionelle des modèles seq2seq convolutifs

Diana Nicoleta Popa, William N. Havard, Maximin Coavoux, Laurent Besacier, and Eric Gaussier.

In Traitement Automatique journaldes Langues Naturelles 2021

Abs BibTeX PDF URL

FR : Les modèles neuronaux de type seq2seq manifestent d’étonnantes capacités de prédiction quand ils sont entraînés sur des données de taille suffisante. Cependant, ils échouent à généraliser de manière satisfaisante quand la tâche implique d’apprendre et de réutiliser des règles systématiques de composition et non d’apprendre simplement par imitation des exemples d’entraînement. Le jeu de données SCAN, constitué d’un ensemble de commandes en langage naturel associées à des séquences d’action, a été spécifiquement conçu pour évaluer les capacités des réseaux de neurones à apprendre ce type de généralisation compositionnelle. Dans cet article, nous nous proposons d’étudier la contribution d’informations syntaxiques sur les capacités de généralisation compositionnelle des réseaux de neurones seq2seq convolutifs.

EN: Classical sequence-to-sequence neural network architectures demonstrate astonishing prediction skills when they are trained on a sufficient amount of data. However, they fail to generalize when the task involves learning and reusing systematic rules rather than learning through imitation from examples. The SCAN dataset consists of a set of mapping between natural language commands and actions and was specifically introduced to assess the ability of neural networks to learn this type of compositional generalization. In this paper, we investigate to what extent the use of syntactic features help convolutional seq2seq models to better learn systematic compositionality.
@inproceedings{popa-etal-2021, title = {{Contribution d'informations syntaxiques aux capacit{\'e}s de g{\'e}n{\'e}ralisation compositionelle des mod{\`e}les seq2seq convolutifs}}, author = {Popa, Diana Nicoleta and Havard, William N. and Coavoux, Maximin and Besacier, Laurent and Gaussier, Eric}, year = {2021}, booktitle = {{Traitement Automatique journaldes Langues Naturelles}}, publisher = {{ATALA}}, address = {Lille, France}, pages = {134--141}, url = {https://hal.archives-ouvertes.fr/hal-03265890}, editor = {Denis, Pascal and Grabar, Natalia and Fraisse, Amel and Cardon, R{\'e}mi and Jacquemin, Bernard and Kergosien, Eric and Balvet, Antonio}, keywords = {domestic conference} }

2020

CoNLL
Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

William Havard, Laurent Besacier, and Jean-Pierre Chevrot.

In Proceedings of the 24th Conference on Computational Natural Language Learning Nov 2020

Abs BibTeX PDF URL

The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network’s architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.
@inproceedings{havard-etal-2020-catplayinginthesnow, title = {{C}atplayinginthesnow: {I}mpact of {P}rior {S}egmentation on a {M}odel of {V}isually {G}rounded {S}peech}, author = {Havard, William and Besacier, Laurent and Chevrot, Jean-Pierre}, year = {2020}, month = nov, booktitle = {Proceedings of the 24th Conference on Computational Natural Language Learning}, publisher = {Association for Computational Linguistics}, address = {Online}, pages = {291--301}, doi = {10.18653/v1/2020.conll-1.22}, url = {https://www.aclweb.org/anthology/2020.conll-1.22}, keywords = {international conference} }
LREC
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

*Marcely Zanon Boito, *William Havard, Mahault Garnerin, Éric Le Ferrand, and Laurent Besacier.

In Proceedings of the 12th Language Resources and Evaluation Conference May 2020

Abs BibTeX PDF Suppl. Material URL

The CMU Wilderness Multilingual Speech Dataset (Black, 2019) is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible) is the same for all the languages is not exploited to date.Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for typologically different language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs). Lastly, we showcase the usefulness of the final product on a bilingual speech retrieval task.
@inproceedings{zanon-boito-etal-2020-mass, title = {{M}a{SS}: {A} {L}arge and {C}lean {M}ultilingual {C}orpus of {S}entence-aligned {S}poken {U}tterances {E}xtracted from the {B}ible}, author = {Zanon Boito, *Marcely and Havard, *William and Garnerin, Mahault and Le Ferrand, Éric and Besacier, Laurent}, year = {2020}, month = may, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, publisher = {European Language Resources Association}, address = {Marseille, France}, pages = {6486--6493}, isbn = {979-10-95546-34-4}, url = {https://aclanthology.org/2020.lrec-1.799}, language = {English}, keywords = {international conference} }

2019

CoNLL
Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech

William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.

In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) Nov 2019

Abs BibTeX PDF Poster URL

In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks – the gating paradigm – and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally we suggest that word representation could be activated through a process of lexical competition.
@inproceedings{havard-etal-2019-word, title = {{W}ord {R}ecognition, {C}ompetition, and {A}ctivation in a {M}odel of {V}isually {G}rounded {S}peech}, author = {Havard, William N. and Chevrot, Jean-Pierre and Besacier, Laurent}, year = {2019}, month = nov, booktitle = {Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)}, publisher = {Association for Computational Linguistics}, address = {Hong Kong, China}, pages = {339--348}, doi = {10.18653/v1/K19-1032}, url = {https://www.aclweb.org/anthology/K19-1032}, keywords = {international conference} }
ICASSP
Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese

William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.

In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) May 2019

Abs BibTeX PDF Suppl. Material Poster URL

We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.
@inproceedings{havard-etal-2019-vgs-attention, title = {{M}odels of {V}isually {G}rounded {S}peech {S}ignal {P}ay {A}ttention to {N}ouns: {A} {B}ilingual {E}xperiment on {E}nglish and {J}apanese}, author = {Havard, William N. and Chevrot, Jean-Pierre and Besacier, Laurent}, year = {2019}, month = may, booktitle = {ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages = {8618--8622}, doi = {10.1109/ICASSP.2019.8683069}, issn = {2379-190X}, url = {https://ieeexplore.ieee.org/document/8683069}, keywords = {international conference} }

2018

Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation

Xuanli He, Quan Tran, William Havard, Laurent Besacier, Ingrid Zukerman, and Gholamreza Haffari.

In Proceedings of the Australasian Language Technology Association Workshop 2018 Dec 2018

Abs BibTeX PDF URL

In spite of the recent success of Dialogue Act (DA) classification, the majority of prior works focus on text-based classification with oracle transcriptions, i.e. human transcriptions, instead of Automatic Speech Recognition (ASR)’s transcriptions. Moreover, the performance of this classification task, because of speaker domain shift, may deteriorate. In this paper, we explore the effectiveness of using both acoustic and textual signals, either oracle or ASR transcriptions, and investigate speaker domain adaptation for DA classification. Our multimodal model proves to be superior to the unimodal models, particularly when the oracle transcriptions are not available. We also propose an effective method for speaker domain adaptation, which achieves competitive results.
@inproceedings{he-etal-2018-exploring, title = {Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation}, author = {He, Xuanli and Tran, Quan and Havard, William and Besacier, Laurent and Zukerman, Ingrid and Haffari, Gholamreza}, year = {2018}, month = dec, booktitle = {Proceedings of the Australasian Language Technology Association Workshop 2018}, address = {Dunedin, New Zealand}, pages = {61--65}, url = {https://www.aclweb.org/anthology/U18-1007}, keywords = {domestic workshop} }

2017

SPEECH-COCO: 600k keywords = in pressVisually Grounded Spoken Captions Aligned to MSCOCO Data Set

William Havard, Laurent Besacier, and Olivier Rosec.

In Proc. GLU 2017 International Workshop on Grounding Language Understanding 2017

Abs BibTeX PDF Suppl. Material URL

This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text. Investigating multimodal learning schemes for unsupervised speech pattern discovery is also possible with this corpus, as demonstrated by a preliminary study conducted on a subset of the corpus (10h, 10k spoken captions).
@inproceedings{havard-etal-2017-speech-coco, title = {{SPEECH-COCO}: 600k { keywords = {in press}V}isually {G}rounded {S}poken {C}aptions {A}ligned to {MSCOCO} {D}ata {S}et}, author = {Havard, William and Besacier, Laurent and Rosec, Olivier}, year = {2017}, booktitle = {Proc. GLU 2017 International Workshop on Grounding Language Understanding}, pages = {42--46}, doi = {10.21437/GLU.2017-9}, url = {http://dx.doi.org/10.21437/GLU.2017-9}, keywords = {international workshop} }

Miscellaneous

2023

The Speech Maturity Dataset

William N. Havard, Loann Peurey, Kasia Hitczenko, and Alejandrina Cristia.

Nov 2023

Abs Poster

Over the first years of life, children’s spontaneous vocal productions become increasingly adult-like, both in their shape and phonetic properties, and lay the foundation for later phonetic and phonological development. Yet, research in this area has been limited to a narrow set of languages and communities, mainly Indo-European languages from Western(ised) speaker communities, and focused on a narrow age range (0 - 24mo).

We present a new publicly-available dataset, the Speech Maturity Dataset (SMD), consisting of 258,914 clips manually labelled for speaker and vocalisation type extracted from the long-form recordings of 398 children (209 male, 186 female) from 2 months to 6 years of age from 14 communities (ranging from rich industrialised societies to farmer-forager speaker communities) in 25+ languages. Albeit already massive, our dataset represents the first version of an ongoing and collaborative effort between field linguists, psycholinguists, and citizen scientists. The data set is expected to be expanded on a regular basis, since the project is still live (https://www.zooniverse.org/projects/laac-lscp/maturity-of-baby-sounds).

SMD is a superset of the already existing BabbleCor dataset (Cychosz et al., 2019) which originally consisted of 15k vocalisations. We followed the same methodology to constitute our dataset, whereby all the clips received a label based on the majority vote of at least 3 citizen scientists (i.e., non-scientific volunteers who devote time to annotate and label scientific data). Contrary to BabbleCor, which used the smaller and closed iHEARu-PLAY platform, we turned to the world’s largest open citizen science platform, Zooniverse, as it had a larger and more diverse pool of citizen scientists. Citizen scientists labelled vocalisations taken from naturalistic long-form recordings with their vocalisation type: laughing, crying, canonical (speech-like vocalisation containing an adjacent consonant and vowel), non-canonical (speech-like vocalisation without an adjacent consonant and vowel), or junk (silence or non-human sounds). For a subset of the clips (N=110,577), citizen scientists also labelled the speaker type: baby (younger than 3 years), child (3-12 years), female/male adolescent (12-18 years), or female/male adult.

SMD, which includes a wealth of metadata (child’s age/sex, linguistic environment, normativity, etc.), lends itself to several use cases. It can be used to study child vocalisation development at an unprecedented scale in a wide variety of communities, by computing indices of vocal development such as canonical proportion (i.e. the proportion of speech-like vocalizations that contain an adjacent consonant and vowel – regardless of whether they are in babble or meaningful speech) or linguistic proportion (i.e. the proportion of vocalizations that are speech-like). This dataset can also be used to train vocalisation-type classifiers in an effort to make software dedicated to the study of child language acquisition free, open-source, and reproducible.

We showcase a potential use of this data set by presenting a preliminary analysis of canonical proportion and linguistic proportion. We fitted two linear mixed effect models to predict canonical proportion and separately, linguistic proportion from the child’s age, sex and monolingualism as fixed effects, and child ID nested in corpus as a random effect to account for individual variation. While for both models we observe a statistically significant positive effect of age (which is natural, as we expect these proportions to increase with age), we do not observe any significant effect of monolingualism or sex, suggesting that children follow a similar development trajectory. Results like these promise to allow researchers to significantly expand their knowledge of early vocal development.

2022

Lexical Acquisition: Start Small and Build up or Start Big and Break Down? A Study on Lexical Acquisition Using Visually Grounded Artificial Neural Networks

William N. Havard

Aug 2022

Abs BibTeX

Visually grounded speech (VGS) models are artificial neural networks (ANN) trained to retrieve an image given its spoken description. These models thus have to implicitly segment the speech signal into sub-units and associate the discovered items to their visual referents. In this experiment, instead of letting the VGS model latently infer boundaries by itself, we give the ANN the position of boundaries corresponding to units of different sizes: phones, syllables, or words. We study how well (in terms of recall@1) the network is able to retrieve the target image, given the size of the units given alongside. Our results show that the VGS network is better able to retrieve the target image if the speech signal is broken down into words than when it is broken down into smaller units such as phones or syllables. Our results agree with the child acquisition literature suggesting that children segment large units first.
@misc{havard-2022-escop, title = {{Lexical Acquisition: Start Small and Build up or Start Big and Break Down? A Study on Lexical Acquisition Using Visually Grounded Artificial Neural Networks}}, author = {Havard, William N.}, year = {2022}, month = aug, note = {Talk}, howpublished = {{European Society for Cognitive Psychology}}, keywords = {abstract} }
Modeling and Measuring Children’s Language Development Using Language Models

Yaya Sy, William N. Havard, and Alejandrina Cristia.

Aug 2022

Abs BibTeX

Although research suggests children’s language develops fast, that work is based on a biased sample covering less than 1% of the world’s languages (Kidd & Garcia, 2022). To measure development in many more languages reliably, we assess a potential scalable method: language;models, computational models train to predict characters in a string. We assessed this for 12 languages for which there was conversational data for training (OpenSubtitles), and test data from the major child language development archive, containing adult-child interactions (CHILDES), which were phonemized. Results for most languages show adults’ utterances have low perplexity (indicating that strings of characters are predicted well), which is stable as a function of child age; whereas perplexity for children’s utterances at about 1 year of age are much higher and decrease to converge towards the adults’ by 5 years. This approach can help researchers measure language development, provided there are transcripts of adult-child interactions.
@misc{sy-2022-escop, title = {{Modeling and Measuring Children's Language Development Using Language Models}}, author = {Sy, Yaya and Havard, William N. and Cristia, Alejandrina}, year = {2022}, month = aug, note = {Talk}, howpublished = {{European Society for Cognitive Psychology}}, keywords = {abstract} }

2018

Emergence of Attention in a neural model of Visually Grounded Speech

William N. Havard, Jean-Pierre Chevrot, and Laurent Besacier.

Jul 2018

Abs BibTeX Poster URL

Context, be it visual, haptic, or auditory provides children with all necessary information to build a coherent mental representation of the world. While acquiring their native language, children learn to map portions of this mental representation to whole or part of acoustic realisations they perceive from surrounding speech. The process of extracting meaningful units from a continuous flow of speech is known as lexical segmentation. These units can then further be used by children to analyse novel acoustic realisations and adjust segmentation of new stimuli. Thus, language acquisition is a dynamic process in which one constantly re-evaluates its segmentation according to what is perceived, or its perception according to what is segmented. Context acts therefore as a weak supervision which makes the segmentation process easier.

Children are not born with a fully-fledged representation of the world they could use to help them detect and understand acoustic patterns. Rather, speech pattern detection and world understanding are two processes that occur simultaneously and both processes start from scratch. That means the extracted patterns and the mental representation of the world evolve during learning, and that the final extracted patterns will not necessarily be the same and may become more and more specific.

Afew studies try to emulate these processes using computer programs for pattern matching (Roy and Pennland, 2002) and more recently using deep-learning technologies such as end-to-end neural architectures (Harwarth and Glass, 2017; Chrupała et al., 2017) . The latter study is interesting as its architectures takes a speech signal and an image as inputs and is projects both in a common representation space. Chrupała et al. (2017) analysed the linguistic representations that were learnt and discovered that the first layers tend to encode acoustic information while higher levels encode semantic information. However, this study focused on the representation learnt once all the training data had been seen multiple times and did not explicitly analyse segmentation as a by-product of the original task. Rather than assessing the representation learnt by the neural network once the training stage is completed, our work (in progress) analyses how this representation changes over time. More specifically, we focus on the attention model of the network (a vector representing the focus of the system on different parts of the speech input) and how it correlates with true unit boundaries (at word and chunk level). As far as we know, no other work analysing the representation learnt by neural networks in a diachronic fashion has ever been conducted so far.
@misc{havard-etal-2018-emergence, title = {{Emergence of Attention in a neural model of Visually Grounded Speech}}, author = {Havard, William N. and Chevrot, Jean-Pierre and Besacier, Laurent}, year = {2018}, month = jul, url = {https://hal.archives-ouvertes.fr/hal-01970514}, note = {Poster}, howpublished = {{Learning Language in Humans and in Machines 2018 conference}}, keywords = {abstract} }

Theses

2021

Lexical emergence from context : exploring unsupervised learning approaches on large multimodal language corpora

William N. Havard

2021

Abs BibTeX PDF Slides URL

FR : Ces dernières années, les méthodes d’apprentissage profond ont permis de créer des modèles neuronaux capables de traiter plusieurs modalités à la fois. Les modèles neuronaux de traitement de la Parole Visuellement Contextualisée (PVC) sont des modèles de ce type, capables de traiter conjointement une entrée vocale et une entrée visuelle correspondante. Ils sont couramment utilisés pour résoudre une tâche de recherche d’image à partir d’une requête vocale: c’est-à-dire qu’à partir d’une description orale, ils sont entraînés à retrouver l’image correspondant à la description orale passée en entrée. Ces modèles ont suscité l’intérêt des linguistes et des chercheurs en sciences cognitives car ils sont capables de modéliser des interactions complexes entre deux modalités — la parole et la vision — et peuvent être utilisés pour simuler l’acquisition du langage chez l’enfant, et plus particulièrement l’acquisition lexicale.Dans cette thèse, nous étudions un modèle récurrent de PVC et analysons les connaissances linguistiques que de tels modèles sont capables d’inférer comme sous-produit de la tâche principale pour laquelle ils sont entraînés. Nous introduisons un nouveau jeu de données qui convient à l’entraînement des modèles de PVC. Contrairement à la plupart des jeux de données qui sont en anglais, ce jeu de données est en japonais, ce qui permet d’étudier l’impact de la langue d’entrée sur les représentations apprises par les modèles neuronaux.Nous nous concentrons ensuite sur l’analyse des mécanismes d’attention de deux modèles de PVC, l’un entrainé sur le jeu de données en anglais, l’autre sur le jeu de données en japonais, et montrons que les modèles ont développé un comportement général, valable quelle que soit la langue utilisée, en utilisant leur poids d’attention pour se focaliser sur des noms spécifiques dans la chaîne parlée. Nos expériences révèlent que ces modèles sont également capables d’adopter un comportement spécifique à la langue en prenant en compte les particularités de la langue d’entrée afin de mieux résoudre la tâche qui leur est donnée.Nous étudions ensuite si les modèles de PVC sont capables d’associer des mots isolés à leurs référents visuels. Cela nous permet d’examiner si le modèle a implicitement segmenté l’entrée parlée en sous-unités. Nous étudions ensuite comment les mots isolés sont stockés dans les poids des réseaux en empruntant une méthodologie issue de la linguistique, le paradigme de gating, et nous montrons que la partie initiale du mot joue un rôle majeur pour une activation réussie.Enfin, nous présentons une méthode simple pour introduire des informations sur les frontières des segments dans un modèle neuronal de traitement de la parole. Cela nous permet de tester si la segmentation implicite qui a lieu dans le réseau est aussi efficace qu’une segmentation explicite. Nous étudions plusieurs types de frontières, allant des frontières de phones aux frontières de mots, et nous montrons que ces dernières donnent les meilleurs résultats. Nous observons que donner au réseau plusieurs frontières en même temps est bénéfique. Cela permet au réseau de prendre en compte la nature hiérarchique de l’entrée linguistique.

EN: In recent years, deep learning methods allowed the creation of neural models that are able to process several modalities at once. Neural models of Visually Grounded Speech (VGS) are such kind of models and are able to jointly process a spoken input and a matching visual input. They are commonly used to solve a speech-image retrieval task: given a spoken description, they are trained to retrieve the closest image that matches the description. Such models sparked interest in linguists and cognitive scientists as they are able to model complex interactions between two modalities — speech and vision — and can be used to simulate child language acquisition and, more specifically, lexical acquisition.In this thesis, we study a recurrent-based model of VGS and analyse the linguistic knowledge such models are able to derive as a by-product of the main task they are trained to solve. We introduce a novel data set that is suitable to train models of visually grounded speech. Contrary to most data sets that are in English, this data set is in Japanese and allows us to study the impact of the input language on the representations learnt by the neural models.We then focus on the analysis of the attention mechanisms of two VGS models, one trained on the English data set, the other on the Japanese data set, and show the models have developed a language-general behaviour by using their attention weights to focus on specific nouns in the spoken input. Our experiments reveal that such models are able to adopt a language-specific behaviour by taking into account particularities of the input language so as to better solve the task they are given.We then study if VGS models are able to map isolated words to their visual referents. This allows us to investigate if the model has implicitly segmented the spoken input into sub-units. We further investigate how isolated words are stored in the weights of the network by borrowing a methodology stemming from psycholinguistics, the gating paradigm, and show that word onset plays a major role in successful activation.Finally, we introduce a simple method to introduce segment boundary information in a neural model of speech processing. This allows us to test if the implicit segmentation that takes place in the network is as effective as an explicit segmentation. We investigate several types of boundaries, ranging from phone to word boundaries, and show the latter yield the best results. We observe that giving the network several boundaries at the same is beneficial. This allows the network to take into account the hierarchical nature of the linguistic input.
@phdthesis{havard-these, title = {{Lexical emergence from context : exploring unsupervised learning approaches on large multimodal language corpora}}, author = {Havard, William N.}, year = {2021}, url = {https://tel.archives-ouvertes.fr/tel-03355571}, school = {{Universit{\'e} Grenoble Alpes}}, mont = jul, keywords = {thesis} }

2017

Découverte non supervisée de lexique à partir d’un corpus multimodal pour la documentation des langues en danger

William N. Havard

May 2017

Abs BibTeX URL

FR : De nombreuses langues disparaissent tous les ans et ce à un rythme jamais atteint auparavant. Les linguistes de terrain manquent de temps et de moyens afin de pouvoir toutes les documenter et décrire avant qu’elles ne disparaissent à jamais. L’objectif de notre travail est donc de les aider dans leur tâche en facilitant le traitement des données. Nous proposons dans ce mémoire des méthodes d’extraction non supervisées de lexique à partir de corpus multimodaux incluant des signaux de parole et des images. Nous proposons également une méthode issue de la recherche d’information afin d’émettre des hypothèses de signification sur les éléments lexicaux découverts. Ce mémoire présente en premier lieu la constitution d’un corpus multimodal parole-image de grande taille. Ce corpus simulant une langue en danger permet ainsi de tester les approches computationnelles de découverte non supervisée de lexique. Dans une seconde partie, nous appliquons un algorithme de découverte non supervisée de lexique utilisant de l’alignement dynamique temporel segmental (S-DTW) sur un corpus multimodal synthétique de grande taille ainsi que sur un corpus multimodal d’une vraie langue en danger, le Mboshi.

EN : Many languages are on the brink of extinction and many disappear each and every year at a rate never seen before. Field linguists lack the time and the means to document and describe all of them before they die out. The goal of our work is to help them in their task, make it easier and speed up the data processing and annotation tasks. In this dissertation, we propose methods to use an unsupervised term discovery (UTD) system to extract lexicon from multimodal corpora consisting of speech and images. We also propose a method using information retrieval techniques to hypothesise the meaning of the discovered lexical items. In the first place, this dissertation presents the creation of a large multimodal corpus which includes speech and images. This corpus simulating that of an endangered language will allow us evaluate the performances of an unsupervised term discovery system. In the second place, we apply an unsupervised term discovery system based on segmental dynamic time warping (S-DTW) to a large synthetic multimodal corpus and also to the multimodal corpus of a real endangered language called Mboshi, spoken in Congo-Brazzaville.
@mastersthesis{havard_master_2017, title = {D{\'e}couverte non supervis{\'e}e de lexique {\`a} partir d'un corpus multimodal pour la documentation des langues en danger}, author = {Havard, William N.}, year = {2017}, month = may, pages = {146}, url = {https://dumas.ccsd.cnrs.fr/dumas-01562024}, keywords = {thesis} }