Speech-COCO is an extension of the MSCOCO image recognition and captioning dataset. We used Voxygen’s text-to-speech system to synthesise the written captions. The addition of speech as a new modality enables MSCOCO to be used for research in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.
The dataset contains 616,767 spoken captions (representing roughtly 600+ hrs of speech) synthesised with 8 different voices (with British and American accents). This dataset includes audio recordings and their alignments at phoneme, syllable and word level but DOES NOT include the original images which should be downloaded separately from the MSCOCO website (supra).
|Elizabeth||a cat resting on the ground on top of various shoes.|
|Bruce||a cat resting on the ground on a pile of shoes.|
|Elizabeth||a brown and white cat is laying on a pile of shoes|
|Bruce||a cat laying on top of shoes on a floor.|
|Paul||a cat laying on top of some shoes|
- Havard, W., Besacier, L., & Rosec, O. (2017). SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set. Proc. GLU 2017 International Workshop on Grounding Language Understanding, 42–46. https://doi.org/10.21437/GLU.2017-9