William N. Havard | Synthetically Spoken STAIR

This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as (Chrupała et al., 2017) we generated speech for each caption of the STAIR dataset using Google’s Text-to-Speech API.

The dataset contains 616,767 spoken captions (representing roughtly 790+ hrs of speech). This dataset includes audio recordings but DOES NOT include the original images which should be downloaded separately from the MSCOCO website.

The dataset is fully described in this paper (Havard et al., 2019) and may be downloaded on Zenodo. If you use this data in your own publications please cite our ICASSP paper.

Sample

Image 100008 from <a href='https://cocodataset.org/#explore?id=100008'>MSCOCO</a>.<br>Original photography by <a href='https://www.flickr.com/photos/shelly/3846434688/'>Shelly Sim BY-NC-ND 2.0</a>. — Image 100008 from MSCOCO.
Original photography by Shelly Sim BY-NC-ND 2.0.

Caption	Audio
靴に入って猫がぐっすりと寝ている The cat is sleeping soundly in his shoes
玄関の靴の上に猫が座って靴をかじっている A cat sits on the front door shoes and bites the shoes
靴やサンダルにじゃれている猫 Cat playing with shoes and sandals
猫がサンダルや靴の上で寛いでいる Cat is relaxing on sandals and shoes
猫が靴やサンダルの上に乗って寝転んでいる Cat is lying on shoes or sandals

See paired image on the MSCOCO website (Image ID: 100008). Translations are only given for illustration purposes and are not included in the data set (translations generated by Google’s MT system). Japanese captions are not included either and are to be downloaded from the STAIR website.

References

_layouts/bibliography.html
_layouts/bibliography.html