Synthetically Spoken STAIR

A spoken extension of the STAIR dataset (Japanese)

Posted: September 29, 2021 2 min.

This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as (Chrupała et al., 2017) we generated speech for each caption of the STAIR dataset using Google’s Text-to-Speech API.

The dataset contains 616,767 spoken captions (representing roughtly 790+ hrs of speech). This dataset includes audio recordings but DOES NOT include the original images which should be downloaded separately from the MSCOCO website.

The dataset is fully described in this paper (Havard et al., 2019) and may be downloaded on Zenodo. If you use this data in your own publications please cite our ICASSP paper.


Image 100008 from <a href=''>MSCOCO</a>.<br>Original photography by <a href=''>Shelly Sim BY-NC-ND 2.0</a>.
Image 100008 from MSCOCO.
Original photography by Shelly Sim BY-NC-ND 2.0.
Caption Audio
The cat is sleeping soundly in his shoes
A cat sits on the front door shoes and bites the shoes
Cat playing with shoes and sandals
Cat is relaxing on sandals and shoes
Cat is lying on shoes or sandals

See paired image on the MSCOCO website (Image ID: 100008). Translations are only given for illustration purposes and are not included in the data set (translations generated by Google’s MT system). Japanese captions are not included either and are to be downloaded from the STAIR website.


  1. Chrupała, G., Gelderloos, L., & Alishahi, A. (2017). Representations of language in a model of visually grounded speech signal. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 613–622.
  2. Havard, W. N., Chevrot, J.-P., & Besacier, L. (2019). Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8618–8622.
Share on