自発音声：データと分析

『自発音声：データと分析』ワークショップ開催のお知らせ

近年、自発音声(spontaneous speech)に対する本格的な取りくみが世界各地で始まりつつあります。国立国語研究所では本年度国際シンポジウムの一環として、自発音声をテーマとした下記ワークショップを開催いたします。みなさま奮ってご参加ください。

○タイトル：Spontaneous Speech: Data and Analysis
○日時：2002年8月29日（木曜）
　午前9：30～午後5：30
○会場：国立国語研究所講堂（1号館5階）
　〒115-8620 東京都北区西が丘3-9-14
○使用言語：英語（通訳はありません）
○参加費：無料ですが、事前登録をお願いしています。
--------------------------------------------------------------------------------
○プログラム

## 講演順などは今後変更の可能性があります。
## 確定プログラムは後日お知らせします
## プログラムの後に各講演のアブストラクトが掲載されています。

SESSION 1 (9:30-11:40 )

Yasuo Horiuchi (Chiba University) "Annotation of Gesture in Speech Dialogue"
Syun Tutiya (Chiba University) "Referring in Spontaneous Speech in a Language that Lacks Articles"
Janice Fon (Dept. English, National Taiwan Normal University) "A Cross-Linguistic Study of Discourse and Syntactic Boundary Cues in Spontaneous Speech"

SESSION 2 (13:00-15:00)

Keith Johnson (Dept. Linguistics, The Ohio-State University) "Choices and Strategies in the Construction of a New Spontaneous Speech Corpus"
William Raymond (Dept. Linguistics, The Ohio-State University) "Coding Consistency in the Transcription of Spontaneous Speech from the Buckeye Corpus"
Shu-Chuan Tseng (Institute of Linguistics, Academia Sinica) "Annotation and Database Construction of Spontaneous Mandarin Data"

SESSION 3 (15:15-16:35)

Kikuo Maekawa (National Inst. for Japanese Language)"Outline of the Corpus of Spontaneous Japanese Project"
Hideaki Kikuchi (National Inst. for Japanese Language)"Segmental and Prosodic Labeling of the Corpus of Spontaneous Japanese"

GENERAL DISCUSSION (16:45-17:30)
--------------------------------------------------------------------------------

ABSTRACTS

Referring in spontaneous speech in a language that lacks articles Syun Tutiya
The act of referring to an object which has been introduced to the universe of discourse has been discussed in philosophy and linguistics. Uses of definite and indefinite articles and their interpretations have been analyzed fruitfully with linguistic data from English and other modern European languages. Japanese lacks articles and requires a different approach to the analysis of referring. We will first describe how speakers of the language refer to objects already introduced, secondly characterize the uses of accompanying adnominal expressions, thirdly discuss the frequency of occurances of such expressions in the Japanese Map Task Corpus and finally suggest a generalized description of the referring act in Japanese.

Annotation of Gesture in Speech Dialogue Yasuo Horiuchi
A new annotation method of speakers' gesture in speech dialogue is proposed. This method, developped based upon the annotation methods of the Japanese Map Task Dialogue Corpus, can represent temporal relationship between gestures and utterances. Annotations given by the proposed method can be translated into the TEI P4 format with equivalent information for data exchange among researchers. Preliminary application of the method to some selected dialogues revealed its effectiveness for the analysis of speaker's gesture.

A Cross-Linguistic Study of Discourse and Syntactic Bboundary Cues in Spontaneous Speech Janice Fon
This study focuses on the relationship between discourse and syntactic boundaries and acoustic and prosodic cues in divergent languages-English, Guoyu, Putonghua, and Japanese. Speech was elicited by having talkers describe the events in The Pear Story film. Recorded data were transcribed and segmented into discourse and syntactic units while measurements of F0, syllable duration, syllable onset intervals (SOIs) and peak syllable RMS amplitude were taken. A comparison of different dimensions-discourse/syntax and acoustics, was made in order to examine boundary and hierarchy cues in speech.
Results showed that both language-universal and language-specific cues exist. Final lengthening and initial strengthening are the most universal cues for signaling boundary. Pitch reset is also prevalent and is found in all languages but English. However, there are also language-specific cues. Final strengthening exists in English and Japanese while final weakening is found in Guoyu and Putonghua. English also has an initial lengthening effect for pitch-accented syllables. Mandarin is interesting in that it starts the final lengthening process early.
Boundary cues are often modulated by hierarchy, although modulation differs with cues across languages. Syllable duration reflects discourse boundary strength in English, Guoyu, and Japanese, but not Putonghua.
Utterance-initial syllable duration correlates positively with hierarchy in English while utterance-final syllables correlate negatively in Guoyu and Japanese. In all four languages, boundary SOI is lengthened as boundary strength increases and is considered the most
universal cue. In terms of peak syllable RMS amplitude, English, Guoyu, and Putonghua correlate initial amplitude patterning with hierarchy positively while Guoyu, Putonghua, and Japanese correlate final amplitude patterning in a negative fashion. In Guoyu, Putonghua, and Japanese, the magnitude of pitch reset is reflective of hierarchy in a positive manner.

Choices and Strategies in the Construction of a New Spontaneous Speech Corpus Keith Johnson
The work that I will describe in this paper was done by a team, as is all corpus work. Our team includes Mark Pitt, Beth Hume, Scott Kiesling, Bill Raymond, and Jen Muller.
The ViC corpus is a new corpus that we are in the process of constructing at Ohio State University. ViC stand for "Variation in
Conversation", and the corpus is designed to be used as a tool to study the phonetics, phonology, and psycholinguistics of spoken language communication. This paper is about (1) the choices that we made in constructing the ViC corpus - the speakers, interview format, audio quality, etc. - and (2) the strategies that we are using to label the corpus. Because of our research interests, the ViC corpus was designed to provide high-quality audio signals, from a rather homogeneous group of speakers, who were speaking completely naturally. Our choices led to very natural, unmonitored speech, and this success actually complicates the corpus labelling work quite a bit.

Coding consistency in the transcription of spontaneous speech from the Buckeye corpus William Raymond
We present an analysis of transcription consistency in lexical and phonetic labeling and segmentation of the spontaneous speech from the Buckeye speech corpus. The corpus consists of recorded interviews with over 40 speakers and is currently being transcribed at the Ohio State University. Our research goal is to create a corpus of informal, spontaneous speech and ultimately to use the data to investigate the role of pronunciation variation in production and comprehension.
A test of inter-transcriber agreement across four transcribers was conducted using samples of speech from four speakers in the corpus. We find that there are patterns of disagreement among transcribers on the identification of phones by segment manner and place classes that are similar to consistency patterns found in read speech. Production fluency and prosodic factors also play a role in both phone identification consistency and consistency in label placement. A small amount of disgreement is even seen in word identification, but word disagreement also exhibits consistent patterns.
Measures of inter-transcriber agreement are being used to help maximize uniform coding of the corpus. In addition, consistency measures are helping our lab to identify phenomena that pose inherent coding difficulties in this type of speech, which may be a source of measured pronunciation variation.

Annotation and Database Construction of Spontaneous Mandarin Data Shu-Chuan Tseng
This talk is concerned with the collection and annotation of the Mandarin Conversational Dialogue Corpus (MCDC). MCDC consists of 30 digitized spontaneous Mandarin conversations. Total length is approximately 26 hours. In order to markup spontaneous speech phenomena in various linguistic aspects, the MCDC annotation system takes into account socio-linguistic phenomena such as code switching and pronunciation variation, disfluent sequences such as pause, repair, repetition and word fragment as well as other paralinguistic features such as non-speech verbal sounds. For transcribing and annotating the speech data, a working interface TransList has been developed 1) to note down the spoken content orthographically in both Pinyin and Chinese characters, 2) to include speaker and transcriber information, 3) to locate and operate audio files containing the speech data, 4) to mark start and end time of each speaker turn and 5) to insert annotation tags.
Subsequently, the transcribed and annotated data contents are transformed to a lexical database form. For constructing the lexical
database, "character" is taken to be the basic unit instead of "word". The reason is two-fold. First, word segmentation principles in Chinese vary from one definition to the other. Second, by using character as the construction unit, we are free to undertake queries and analyses on the data. In our database, each row represents a specific character with all its related information transformed from the annotation of the transcript as attributes. Queries can be done with respect to specific constraints such as speaker, annotated tag, pronunciation or character. Our database allows a high degree of flexibility to investigate spontaneous speech occurrences. The final part of this talk will take syllable contraction as example to illustrate how users can make use of our database. In Mandarin, a syllable is quasi-equivalent to a morpheme.
Even though the mapping is not exactly one-to-one, the interaction between syllabic structure and morpho-syntactic composition of syllables is worth to be dealt with.
Not only is this issue interesting for pure linguistic studies, for other research fields such automatic recognition system, knowledge of
syllabic structure and its realization in spontaneous speech are urgently needed, because deleted or contracted syllables occurred in
spontaneous speech often lead to serious problems in recognizing correct lexical words. Results of analyses on our data marked as "syllable contraction" by the MCDC annotation system will be discussed in detail.

Outline of the Corpus of Spontaneous Japanese Project Kikuo Maekawa
Corpus of Spontaneous Japanese, or CSJ, is the Japan's first full-fledged corpus of spontaneous speech that we have been compiling since 1999 aiming at the public release in the spring of 2004.
CSJ can be called full-fledged in terms of its size (over 7,000,000 morphemes) and richness of annotation. The target variety of the CSJ is so-called monologue of Standard Japanese. More concretely, CSJ contains both more than 300 hours of 'academic presentation speech' (APS, 1000 speakers) and more than 300 hours of 'simulated public speech' (SPS, 500 speakers balanced both in sex and age). The former is the live recording of academic presentations done in nine different academic societies, and the latter is the studio-recorded speech of everyday subjects in front of small audience. In addition to these two main speech categories, small amount (less than 20 hours) of dialogue speech by the participant of APS and/or SPS will be provided also.
Speech sounds are recorded using close-talking directional microphones and digital tape recorders (DAT). Recorded speech are transcribed and annotated using a tag set of about 20 tags covering both linguistic (filled-pauses, word-fragment, meta-linguistic expression etc.) and non-linguistic (noises, laughter, etc.) information. Morphological information (i.e., word boundary and part-of-speech information) is provided for all transcription texts.
In addition to the above, segmental and intonation labels are provided for about 40 hours of speech which is a true subset of the CSJ. This subset is called the Core.
In this talk, I will touch upon topics like the design issues including the development of various annotation schemes, the status-quo
of the compilation work, and some preliminary results of corpus evaluation.

Segmental and Prosodic Labeling of the Corpus of Spontaneous Japanese Hideaki Kikuchi
In this talk, I will show some guidelines and problems of segmental and prosodic labeling in the Corpus of Spontaneous Japanese (CSJ). Since 1999, we have been involved in the construction of a large-scale corpus of spontaneous speech known as the CSJ. This corpus involves the digitized speech, transcribed speech, POS annotation of about 650 hour spontaneous speech that corresponds to about 7000000 morphemes. In addition, we will provide segment labels and intonation labels for a true subset of the CSJ, called the Core, that contains about 45 hour speech, or 500000 morphemes.
For segmental labeling, we prepared the inventory of segmental labels which is not a purely phonemic labels but are phonetic, or sub-phonemic labels. We also prepared some symbols for labeling events like closure of stop consonants and continuation of formants after the end of phonation. For intonation labeling, we proposed the new X-JToBI prosodic labeling scheme, the extended version of J_ToBI which has grown out of our work on annotating prosodic features of spontaneous speech. Among the new characteristics of X-JToBI are 1)Exact match between the time-stamp of tone labels and the timing of physical events, 2)Enlargement of the inventory of boundary pitch movements, 3)Extension and ramification of the usage of break indices, and 4)Newly defined labels for filled-pause
and non-lexical prominence.
The results of preliminary analyses of inter-transcriber reliability on segmental and prosodic labeling will be also discussed.

[REFERENCE]

Venditti, J. (1997). "Japanese ToBI Labeling Guidelines." OSU Working
Papers in Linguistics, 50, 127-162. (http://www.ling.ohio- state.edu/phonetics/J_ToBI/)