Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing

Dates

9 (Sat) – 10(Sun) December 2017

Venue

National Institute for Japanese Language and Linguistics (NINJAL), Tokyo

Aims

Over the last decades, corpora with comprehensive syntactic annotation (i.e., treebanks) for the major languages of the world have been created in various formats (e.g., Sampson 1995, Bies et al. 1995, Chen 1999, TIGER Project 2003, etc.). As certain modes of annotation have become more linguistically sophisticated, so they have become more relevant for linguistics in general by providing sources of insight into factors that only become visible through analysis generalized over structures: phenomena in co-occurrence, frequency, constituency, embeddability, scope, agreement, dependency, etc. These insights are spurring new research and refinements in both corpus techniques and theoretical understanding.

While much research has concentrated on challenges inherent in the creation as well as correction of annotated corpora (e.g., Hovy and Lavid 2010, Kulick et al. 2013, etc.), with the availability of digitized data on a large scale and the production of parsed corpora as available resources, new challenges have opened up for making use of corpus-building technologies and the resulting data in subsequent research. Examples include linking corpora to external resources like lexical databases, abstracting the contents sufficiently to be of use to non-experts, exploration of cross-linguistic patterns, etc.

Given the fast pace of development in the field, a survey of recent work applying corpora to problems both academic and practical marks the state of the art and suggests directions for the future. We intend to explore the potential benefits of parsed annotation for descriptive and theoretical linguistics, as well as for other application domains, such as automated systems of data extraction, development of resources for educational purposes, the evaluation of machine translation and dialogue systems, etc.

We will hold an international symposium entitled “Exploiting Parsed Corpora: Application in Research, Pedagogy, and Processing” at the National Institute for Japanese Language and Linguistics (NINJAL) on Dec. 9-10, 2017.

References

Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre (1995). Bracketing guidelines for Treebank II style Penn Treebank project. Tech. Rep. MS-CIS-95-06, LINC LAB 281, University of Pennsylvania, Computer and Information Science Department.
Chen, Keh-Jiann, Chi-Ching Luo, Zhao-Ming Gao, Ming-Chung Chang, Feng-Yi Chen, Chao-Jan Chen and Chu-Ren Huang (1999). “The CKIP Chinese Treebank: Guidelines for Annotation”, ATALA Workshop IV Treebanks, Paris, June 18-19, 1999: pp. 85-96.
Hovy, Eduard and Julia Lavid (2010). “Towards a ‘Science’ of Corpus Annotation: A New Methodological Challenge for Corpus Linguistics.” International Journal of Translation, Vol. 22, No. 1, Jan-Jun 2010.
Seth Kulick, Ann Bies, Justin Mott, Mohamed Maamouri, Beatrice Santorini, Anthony Kroch (2013). “Using Derivation Trees for Informative Treebank Inter-Annotator Agreement Evaluation”, NAACL 2013: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, June 9-15.
Sampson, Geoffrey (1995). English for the Computer: The SUSANNE Corpus and analytic scheme. Clarendon Press, Oxford.
TIGER Project (2003). TIGER Annotations schema. Manuscript. Universitat des Saarlands, Universitat Stuttgart, and Universitat Potsdam. July 2003.

Important Dates

30 November 2017: Deadline for Preregistration [extended]
1 November 2017 : Abstract submission deadline for poster presentation [closed]
9-10 December 2017: Symposium

Invited Speaker

Liesbeth Augustinus (University of Leuven)
Anthony Kroch (University of Pennsylvania)
Susan Pintzuk (University of York)
Beatrice Santorini (University of Pennsylvania)
Sean Wallis (University College London)
Nianwen Xue (Brandeis University)

Organizing Commitee

Prashant Pardeshi (NINJAL)
Alastair Butler (NINJAL)
Stephen Wright Horn (NINJAL)
Hideki Kishimoto (Kobe University)
Yusuke Kubota (University of Tsukuba)
Iku Nagasaki (NINJAL)
Kei Yoshimoto (Tohoku University )

Hosted by

Collaborative Research Project: Development of and Linguistic Research with a Parsed Corpus of Japanese

Funding Bodies

National Institute for Japanese Language and Linguistics (NINJAL)

Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing

Home

News