About this project – NPCMJ – Ninjal Parsed Corpus of Modern Japanese

In recent times, corpora with syntactic annotation (i.e., treebanks) for the major languages of the world have been created. This has stimulated remarkable progress in the study of linguistics and language processing. In order to make possible searching and extracting a wide variety of Japanese function words, phrase structures, clause types and grammatical patterns from a large amount of language data, we have started building the NINJAL Parsed Corpus of Modern Japanese (NPCMJ), which is a syntactically and semantically annotated corpus of both written and spoken Modern Japanese. (The NPCMJ is an extension of parts of the Keyaki Treebank. The Keyaki Treebank can be accessed here.)

Prioritizing general versatility, we adopt the annotation policy of the Penn Historical Corpus (Santorini 2010), which is a part of the Penn Treebank family.

The current release of the corpus is available to the public from the web page of the National Institute for Japanese Language and Linguistics, along with an interface for anyone to use without requiring technical skills. We plan to update the corpus periodically.

We would appreciate it if you could inform us about any research or project using our corpus. Please notify us through the contact mail address.

Project Leader

Prashant Pardeshi（Professor, Theory and Typology Division, National Institute for Japanese Language and Linguistics）