Monday 5th July 2021
Lexical emergence from context: exploring unsupervised learning approaches on large multimodal language corpora
In recent years, deep learning methods allowed the creation of neural models that are able to process several modalities at once. Neural models of Visually Grounded Speech (VGS) are such kind of models and are able to jointly process a spoken input and a matching visual input. They are commonly used to solve a speech-image retrieval task: given a spoken description, they are trained to retrieve the closest image that matches the description. Such models sparked interest in linguists and cognitive scientists as they are able to model complex interactions between two modalities — speech and vision — and can be used to simulate child language acquisition and, more specifically, lexical acquisition.  
In this thesis, we study a recurrent-based model of VGS and analyse the linguistic knowledge such models are able to derive as a by-product of the main task they are trained to solve. We introduce a novel data set that is suitable to train models of visually grounded speech. Contrary to most data sets that are in English, this data set is in Japanese and allows us to study the impact of the input language on the representations learnt by the neural models. We then focus on the analysis of the attention mechanisms of two VGS models, one trained on the English data set, the other on the Japanese data set, and show the models have developed a language-general behaviour by using their attention weights to focus on specific nouns in the spoken input. Our experiments reveal that such models are able to adopt a language-specific behaviour by taking into account particularities of the input language so as to better solve the task they are given. We then study if VGS models are able to map isolated words to their visual referents. This allows us to investigate if the model has implicitly segmented the spoken input into sub-units. We further investigate how isolated words are stored in the weights of the network by borrowing a methodology stemming from psycholinguistics, the gating paradigm, and show that word onset plays a major role in successful activation. Finally, we introduce a simple method to introduce segment boundary information in a neural model of speech processing. This allows us to test if the implicit segmentation that takes place in the network is as effective as an explicit segmentation. We investigate several types of boundaries, ranging from phone to word boundaries, and show the latter yield the best results. We observe that giving the network several boundaries at the same time is beneficial. This allows the network to take into account the hierarchical nature of the linguistic input.
Mis à jour le 29 June 2021