Showa Speech Corpus (SSC)

▶Japanese

Regarding the full-fledged publication of SSC

In FY2016, the Chronological Change Team from the National Institute for Japanese Language and Linguistics' Everyday Conversation Corpus project began to collect audio data recorded by the institute from the 1950s through the 1970s, and has been organizing the data into the Showa Speech Corpus (SSC). It was completed in March 2021 as a corpus comprising approximately 44 hours of audio materials (about 530,000 words), which has been made available to the public. In addition, all related data, including the audio data, transcripts and morphological information, became downloadable in March 2022.

The Showa Speech Corpus is available through two channels: Chunagon and distribution of related data.

ninjal 1955

The National Institute for Japanese Language in 1955

Background of the corpus

The National Institute for Japanese Language and Linguistics, which was established in 1948, began to carry out research on spoken language in the early 1950s. They used reel-to-reel recorders, which had just entered the market at the time, to record conversations in many different situations of everyday life, as well as monologues such as public speeches, lectures and addresses. The recorded sounds were transcribed (into text) and the transcripts were utilized to analyze the characteristics of spoken language. The results of these research activities were compiled into reports, such as "Danwago no Jittai (Realities of Discourse Language)" (1955), " Hanashi Kotoba no Bunkei 1: Taiwa Shiryo niyoru Kenkyu (Structure Patterns of Spoken Language (1): Study Based on Conversations)" (1960), and "Hanashi Kotoba no Bunkei 2: Dokuwa Shiryo niyoru Kenkyu (Structure Patterns of Spoken Language (2): Study Based on Monologues)" (1963).

recording 1955

Recording scenes in 1950's
(National Institute for Japanese Language and Linguistics : Survey and guide 1955)

The collection of recorded audio materials was subsequently stored in the resource library of the National Institute for Japanese Language and Linguistics. Although the institute began to digitize its reel-to-reel audio materials in 1990s, the audio data remained unpublicized. However, the Chronological Change Team from the Institute's Everyday Conversation Corpus project, which was launched in 2016, decided to reorganize the data recorded at that time into the Showa Speech Corpus and release it to the public.

The Showa Speech Corpus comprises approximately 44 hours of audio data (17 hours of monologues and 27 hours of conversations) recorded from the 1950s through the 1970s, and related data (including the transcripts, morphological information on about 530,000 words, and meta-data).

Details of the corpus

The data size of the Showa Speech Corpus is as follows:

Number of files: 123 files (50 monologue files and 73 conversation files)
Number of recording hours: About 44 hours (17 hours of monologues and 27 hours of conversations)
Total number of words (excluding symbols): 528,589 words (180,272 words of monologues and 348,317 words of conversations)
Number of different speakers: 393 people (50 monologue speakers and 343 conversation speakers)

For lists of files included in the Showa Speech Corpus, see the PDF files below.

Here is the breakdown of recorded words, including symbols, by type (monologue or conversation) and gender (male or female):

Type／Gender	Male	Female	Unknown	Total
Monologue	177,656	2,616	---	180,272
Conversation	219,740	128,464	113	348,317

How to use the corpus

The Showa Speech Corpus can be utilized by the following two methods:

Perform searches on Chunagon
If you have a Chunagon account, you can apply for additional corpus use via this page to use the corpus. If you do not have a Chunagon account, you can register yourself as a user via this page to use the corpus.
Download Showa Speech Corpus-related data

You can start to use the corpus by downloading Showa Speech Corpus-related data on Chunagon. If you have a Chunagon account, you can apply for additional corpus use via this page to use the corpus.

The Showa Speech Corpus-related data comprises the following:
- Audio files (WAV format, 16 bit, 44.1 kHz): 123 files
- Transcripts (TSV format, with time information, UTF-8): 123 files
- Transcripts (TextGrid format): 123 files
- Morphological information data (TSV format, UniDic short unit): 534,128 words
- Search environment supported by the Himawari full-text search system
- Meta-data (information about the recorded materials, speakers, etc.)

Reference literature

To publish your research using the Showa Speech Corpus, it is necessary to list the following literature in the reference section (which may be subject to updates in the future):

Maruyama, Takehiko, Hanae Koiso, Ken'ya Nishikawa (2022) [Design and Construction of the Showa Speech Corpus] Syowa hanashi kotoba corpus no sekkei to kouchiku (in Japanese), NINJAL Research Papers 22. pp.197-221. [link]
Maruyama, Takehiko (2020) ``On the Possibility of a Diachronic Speech Corpus of Japanese'', Andrej Bekeš and Irena Srdanović (Ed.), Japanese Language from Empirical Perspective: Corpus-based studies and studies on discourse. pp.219-234. Ljubljana: Znanstvena založba FF. [link] / [PDF]

Contact information

A Comprehensive Study of Spoken Language Using a Multi-Generational Corpus of Japanese Conversation