A spoken extension of the MSCOCO dataset (English)

Posted: September 29, 2021 2 min.

Speech-COCO is an extension of the MSCOCO image recognition and captioning dataset. We used Voxygen’s text-to-speech system to synthesise the written captions. The addition of speech as a new modality enables MSCOCO to be used for research in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.

The dataset contains 616,767 spoken captions (representing roughtly 600+ hrs of speech) synthesised with 8 different voices (with British and American accents). This dataset includes audio recordings and their alignments at phoneme, syllable and word level but DOES NOT include the original images which should be downloaded separately from the MSCOCO website (supra).

The dataset is fully described in this paper (Havard et al., 2017) and may be downloaded on Zenodo. If you use this data in your own publications please cite our GLU paper.


Image 100008 from <a href='https://cocodataset.org/#explore?id=100008'>MSCOCO</a>.<br>Original photography by <a href='https://www.flickr.com/photos/shelly/3846434688/'>Shelly Sim BY-NC-ND 2.0</a>.
Image 100008 from MSCOCO.
Original photography by Shelly Sim BY-NC-ND 2.0.
Voice Caption Audio
Elizabeth a cat resting on the ground on top of various shoes.
Bruce a cat resting on the ground on a pile of shoes.
Elizabeth a brown and white cat is laying on a pile of shoes
Bruce a cat laying on top of shoes on a floor.
Paul a cat laying on top of some shoes


  1. Havard, W., Besacier, L., & Rosec, O. (2017). SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set. Proc. GLU 2017 International Workshop on Grounding Language Understanding, 42–46. https://doi.org/10.21437/GLU.2017-9
Share on