Address: 10-2 Midori-cho, Tachikawa City, Tokyo 190-8561 Tel. +81-42-540-4300 / Fax +81-42-540-4333
© National Institute for Japanese Language and Linguistics
In FY2016, the Chronological Change Team from the National Institute for Japanese Language and Linguistics' Everyday Conversation Corpus project began to collect audio data recorded by the institute from the 1950s through the 1970s, and has been organizing the data into the Showa Speech Corpus (SSC). It was completed in March 2021 as a corpus comprising approximately 44 hours of audio materials (about 530,000 words), which has been made available to the public. In addition, all related data, including the audio data, transcripts and morphological information, became downloadable in March 2022.
The Showa Speech Corpus is available through two channels: Chunagon and distribution of related data.
The National Institute for Japanese Language in 1955
The National Institute for Japanese Language and Linguistics, which was established in 1948, began to carry out research on spoken language in the early 1950s. They used reel-to-reel recorders, which had just entered the market at the time, to record conversations in many different situations of everyday life, as well as monologues such as public speeches, lectures and addresses. The recorded sounds were transcribed (into text) and the transcripts were utilized to analyze the characteristics of spoken language. The results of these research activities were compiled into reports, such as "Danwago no Jittai (Realities of Discourse Language)" (1955), " Hanashi Kotoba no Bunkei 1: Taiwa Shiryo niyoru Kenkyu (Structure Patterns of Spoken Language (1): Study Based on Conversations)" (1960), and "Hanashi Kotoba no Bunkei 2: Dokuwa Shiryo niyoru Kenkyu (Structure Patterns of Spoken Language (2): Study Based on Monologues)" (1963).
Recording scenes in 1950's
(National Institute for Japanese Language and Linguistics : Survey and guide 1955)
The collection of recorded audio materials was subsequently stored in the resource library of the National Institute for Japanese Language and Linguistics. Although the institute began to digitize its reel-to-reel audio materials in 1990s, the audio data remained unpublicized. However, the Chronological Change Team from the Institute's Everyday Conversation Corpus project, which was launched in 2016, decided to reorganize the data recorded at that time into the Showa Speech Corpus and release it to the public.
The Showa Speech Corpus comprises approximately 44 hours of audio data (17 hours of monologues and 27 hours of conversations) recorded from the 1950s through the 1970s, and related data (including the transcripts, morphological information on about 530,000 words, and meta-data).
The data size of the Showa Speech Corpus is as follows:
For lists of files included in the Showa Speech Corpus, see the PDF files below.
Here is the breakdown of recorded words, including symbols, by type (monologue or conversation) and gender (male or female):
Type/Gender | Male | Female | Unknown | Total |
---|---|---|---|---|
Monologue | 178,407 | 2,621 | --- | 180,028 |
Conversation | 219,552 | 133,401 | 147 | 353,100 |
The Showa Speech Corpus can be utilized by the following two methods:
Perform searches on Chunagon
If you have a Chunagon account, you can apply for additional corpus use via this page to use the corpus. If you do not have a Chunagon account, you can register yourself as a user via this page to use the corpus.
Download Showa Speech Corpus-related data
You can start to use the corpus by downloading Showa Speech Corpus-related data on Chunagon. If you have a Chunagon account, you can apply for additional corpus use via this page to use the corpus.
The Showa Speech Corpus-related data comprises the following:
To publish your research using the Showa Speech Corpus, it is necessary to list the following literature in the reference section (which may be subject to updates in the future):