International Symposium

[日本語]

International Symposium on Diachronic Speech Corpora

Date: 4th September, 2017
Venue: The National Institute for Japanese Language and Linguistics, NINJAL [Access]

Abstracts are available [here] .

Aims & Scope

We are pleased to announce that the first International Symposium on Diachronic Speech Corpora will be held at NINJAL on 4 September, 2017. Various types of corpora have been developed since the late 1990s and the 2000s, but one of the next targets for corpus linguistics will be diachronic speech corpora. Gathering recorded materials, from the earliest available sources to the latest, diachronic speech corpora will be effective language resources for examining how speech behavior in a given language has changed.

In this symposium five researchers will present their studies on diachronic speech corpora, corpus design, annotations, and diachronic change in spoken languages, including English, Finnish, Italian, French and Japanese.

Program

9:30-10:00　Registration
10:00-10:15　Opening Remarks
10:15-11:15　Bas Aarts (University College London, UK)
　　　　　　　 "Exploring the grammar of spoken English using the Diachronic Corpus of
　　　　　　　 Present-Day Spoken English"
　　　　　　　 [abstract]
11:15-12:15　Marja-Liisa Helasvuo (University of Turku, Finland)
　　　　　　　 "Finnish spoken corpora: A diachronic perspective"
　　　　　　　 [abstract]
13:15-14:15　Takehiko Maruyama (Senshu University / NINJAL, Japan)
　　　　　　　 "What's left for diachronic research of Japanese Speech?"
　　　　　　　 [abstract]
14:15-15:15　Alessandro Panunzi (University of Florence, Italy)
　　　　　　　 "The LABLITA Corpus of spoken Italian in diachrony: Theoretical framework, 　　　　　　　 corpus design, and a lexical comparison"
　　　　　　　 [abstract]
15:15-15:30　Break
15:30-16:30　Marie Skrovec (University of Orleans, France)
　　　　　　　 "A diachronic spoken corpus for French: ESLO, a variationist survey"
　　　　　　　 [abstract]
16:30-17:00　Commentaries and discussion

Pre-registration

Pre-registration is not necessary to attend the symposium. (maximum of 150 seats)

Organiser

Takehiko Maruyama (Senshu University / NINJAL, Japan)
maruyama <at> isc.senshu-u.ac.jp

Poster [download]

[English]

国際シンポジウム「通時音声コーパス」

日時： 2017年9月4日（月）10:00～17:00
場所：国立国語研究所 2階講堂 [アクセス]

講演の概要は、こちらからご覧ください。

開催の趣旨

1990年代以降、世界各地で様々なコーパスが構築されてきました。書き言葉コーパス、話し言葉コーパス、学習者コーパス、パラレルコーパスなど、コーパスの多様化が進む中、次のターゲットの一つとして目されるのが、「通時音声コーパス」です。古い音源資料を収集してコーパス化し、近年の音声資料と比較・対照することにより、話し言葉の経年変化（アクセント・イントネーション・語彙・文法など）を実証的に明らかにすることができると考えられます。

今回のシンポジウムでは、イギリス、フィンランド、イタリア、フランスからゲストをお招きし、日本を含めた5か国で、通時音声コーパスをどのように整備・分析しているかについて、デモを交えながらご紹介します。

プログラム

※ 本シンポジウムは英語での開催になります。

参加申し込み

参加申し込みや事前登録は必要ありません。直接、会場にお越しください。（定員150名）

オーガナイザ・問い合わせ先

丸山岳彦（専修大学・国立国語研究所）
maruyama <at> isc.senshu-u.ac.jp

※ 本シンポジウムは、国立国語研究所音声言語研究領域共同研究プロジェクト「大規模日常会話コーパスに基づく話し言葉の多角的研究」、およびJSPS科研費16H03426 「「昭和話し言葉コーパス」の構築による話し言葉の経年変化に関する実証的研究」（基盤B、研究代表者丸山岳彦）による共同開催です。

Abstracts

Bas Aarts (UCL)

Exploring the grammar of spoken English using the Diachronic Corpus of Present-Day Spoken English

In the first part of my talk, I will begin by presenting the corpus exploration software ICECUP (International Corpus of English Corpus Utility Program) that we developed at the Survey of English Usage (SEU) at UCL. This software can be used to explore the two corpora that we compiled, namely the British Component of the International Corpus of English (ICE-GB) and the Diachronic Corpus of Present-Day Spoken English (DCPSE). Both are fully tagged and parsed corpora of British English. I will demonstrate the functionality of the software and its capabilities. Specifically, I will show how the innovative Fuzzy Tree Fragment facility allows users to search for grammatical patterns in the corpora.

In the second part of my talk I will discuss some of the SEU's recent linguistic research on changes in the grammar of Present-Day English using DCPSE, with special attention being paid to the use of the progressive construction and the use of the core modal verbs.

Ferdinand De Saussure famously said that:

"The contrast between the two points of view, synchronic and diachronic, is absolute and allows no compromise." (Cours de Linguistique Générale)

In my talk I will argue that the research that we carried out in the SEU demonstrates that this view is contestable.

Links:
Survey of English Usage / ICECUP / ICE-GB / DCPSE

Marja-Liisa Helasvuo (University of Turku)

Finnish spoken corpora - a diachronic perspective

In Finnish studies, there is a long tradition of research on the spoken varieties of Finnish. The orientation was first dialectological: the research focused on areal characteristics and differences between different dialects. The earliest studies used direct observation of spontaneous speech as their data: the examples were written down immediately when they were heard.

However, there are also collections of spoken narratives from the late 19th century that have been published and used for research. These could be considered as the first corpora of the spoken language. With the development of recording equipment, more sophisticated data collection methods have been developed. In 1967, the first electronic corpus of spoken Finnish was started (project leader prof. Osmo Ikola, University of Turku).

In my presentation, I will give an overview of Finnish spoken corpora and discuss the possibilities they offer and their limitations.

Takehiko Maruyama (Senshu University / NINJAL)

What's left for diachronic researches of Japanese Speech?

In this talk I will investigate how a diachronic speech corpus of Japanese can be realized and how it should be analyzed.

A diachronic speech corpus must be a collection of recorded speech across multiple time periods. It should be carefully designed and systematically organized for analyzing diachronic changes of speech. The recorded data must be digitalized to enable playback and listening with as good sound quality as possible. Also rich annotation is needed, such as transcriptions, POS tagging, parsing, and various metadata such as speakers' info, recorded date, speaking situations and so on. The problem is that the amount of old recordings is much smaller and limited than that of written text.

In this talk I will illustrate what kinds of recorded materials are available for us to compile into a diachronic speech corpus of Japanese: These include political speeches recorded during the 1910s to the 1940s, NINJAL's pioneering records of Japanese daily conversations and lectures from the 1950s to the 1960s, and contemporary large corpora of spoken Japanese built in NINJAL after 2000.

In addition I will present some pilot studies analyzing these spoken data from the point of view of diachronic change, such as changes of intonation patterns and grammatical forms during the last 80 years.

Alessandro Panunzi (University of Florence)

The LABLITA Corpus of spoken Italian in diachrony: Theoretical framework, corpus design, and a lexical comparison

The LABLITA Linguistic Laboratory of the University of Florence collected a wide corpus of spontaneous spoken Italian, transcribed and analyzed on the basis of Language into Act Theory. This theory assumes that spoken language is governed by pragmatic principles, whose main features (illocutionary values and information structure) are conveyed by prosody.

The talk focuses on the description of two sub-corpora of the LABLITA collection, and namely the corpus Stammerjohann (recorded in 1965 in Florence), and a comparable corpus mainly derived from a sampling of the C-ORAL-ROM Italian corpus (texts collected in the Florence area in the years 1990-2002). The two resources share a common design, specifically adopted in order to assure the maximum comparability. The lexical comparison highlights that the regional lexicon decreased in the spontaneous speech of Florentine area by roughly 20%, but also that a high frequency Tuscan lexical core is nowadays lively.

Marie Skrovec (University of Orleans)

A diachronic spoken corpus for French : ESLO, a variationist survey

At the LLL (Orléans, France) researchers are constituting a reference corpus of spoken French, the ESLO corpus (Enquête Sociolinguistique à Orléans : Socio-Linguistic Survey in Orléans), which takes into account sociolinguistic variation with a micro-diachronic span, since ESLO contents two sets of data (ESLO1, ESLO2).

The first survey (ESLO 1) was undertaken from 1968 to 1971 by British scholars. Their aim was to record spontaneous interactions to teach French as a foreign language at secondary school level. The data gathered constitute an important spoken corpus of about 300 hours of speech (4,500,000 words), with interviews and other recordings. A new survey, ESLO 2, has been undertaken by the LLL since 2008, in order to constitute, forty years on, a corpus which may be comparable in terms of data gathering and archiving. The objective was set to 400 hours of speech data, that is about 6,000,000 words. Put together ESLO 1 and 2 form now a collection of 700 hours of recording and about 8 million words, which is today considered as a reference value for the processing and investigations planned.

In this presentation, I will first give an overview about the origin of the project in the 60's and the actual corpus design, and then address some diachronic studies investigating linguistic variation and change in the last 40 years, regarding different linguistic levels as phonology, morphosyntax or discourse. A focus will be given on the special case of future tense in modern French.