Synthetically Spoken STAIR

A spoken extension of the STAIR dataset (Japanese)

Posted: September 29, 2021 2 min.

This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as (Chrupała et al., 2017) we generated speech for each caption of the STAIR dataset using Google’s Text-to-Speech API.

The dataset contains 616,767 spoken captions (representing roughtly 790+ hrs of speech). This dataset includes audio recordings but DOES NOT include the original images which should be downloaded separately from the MSCOCO website.

The dataset is fully described in this paper (Havard et al., 2019) and may be downloaded on Zenodo. If you use this data in your own publications please cite our ICASSP paper.


Sample

Image 100008 from <a href='https://cocodataset.org/#explore?id=100008'>MSCOCO</a>.<br>Original photography by <a href='https://www.flickr.com/photos/shelly/3846434688/'>Shelly Sim BY-NC-ND 2.0</a>.
Image 100008 from MSCOCO.
Original photography by Shelly Sim BY-NC-ND 2.0.
Caption Audio
靴に入って猫がぐっすりと寝ている
The cat is sleeping soundly in his shoes
玄関の靴の上に猫が座って靴をかじっている
A cat sits on the front door shoes and bites the shoes
靴やサンダルにじゃれている猫
Cat playing with shoes and sandals
猫がサンダルや靴の上で寛いでいる
Cat is relaxing on sandals and shoes
猫が靴やサンダルの上に乗って寝転んでいる
Cat is lying on shoes or sandals

See paired image on the MSCOCO website (Image ID: 100008). Translations are only given for illustration purposes and are not included in the data set (translations generated by Google’s MT system). Japanese captions are not included either and are to be downloaded from the STAIR website.


References

  1. _layouts/bibliography.html
  2. _layouts/bibliography.html
Share on