Introducing the NPCMJ
For the major languages of the world, progress has been made in the creation of corpora annotated with syntactic information (treebanks), and significant results have been gained in the fields of linguistics and language processing using these corpora for research. With regard to Japanese, at the National Institute for Japanese Language and Linguistics (NINJAL), the Collaborative Research Project “Development of and Linguistic Research with a Parsed Corpus of Japanese” began in 2016, and built the NPCMJ (NINJAL Parsed Corpus of Modern Japanese). This project aimed to annotate syntactic and semantic information to texts of written and spoken Contemporary Japanese, making it possible to search and extract from the data a rich inventory of function words, phrase structures, clause types, and complex constructions, and to use the results actively for research. The project ended in March 2022, but the results of the project are available to the public in the form of approximately 90,000 sentences (90,000 trees).
Source | Number of Trees | Word Count |
Aozora Bunko | 12,810 | 246,568 |
Bible | 1,664 | 26,089 |
Blog | 219 | 3,218 |
Book | 553 | 10,992 |
Dictionary | 26,279 | 141,297 |
Diet | 1,698 | 32,715 |
Essay | 3,264 | 70,167 |
Fiction | 7,597 | 84,169 |
Law | 337 | 6,943 |
News | 5,979 | 90,570 |
Nonfiction | 234 | 4,118 |
Patent | 261 | 8,636 |
Spoken | 2,382 | 12,720 |
TED Talk | 1,453 | 21,420 |
Textbook | 6,950 | 63,952 |
Whitepaper | 13,433 | 398,347 |
Wikipedia | 2,745 | 59,833 |
Misc. | 2,211 | 22,754 |
Total | 90,069 | 1,304,508 |
Online Tool for using the NPCMJ
The Kainoki Treebank Homepage
This site has been continuously maintained under the name “Kaikki Treebank” since the research project was completed. It has a powerful search interface that allows you to use almost all aspects of the annotations. We hope you will make use of it.
The Kainoki Treebank Homepage (External Link)
Annotation Manual
NPCMJ Annotation Manual
Download NPCMJ Annotation Manual
Full Download
Bracketed tree file format
This is a compressed zip file containing containing all the sample files of the NPCMJ in bracketed tree format.
Download bracketed kana tree files
Download bracketed romaji tree files
Exercises for Syntax Textbook
Exercises for Analyzing Japanese Syntax: A Generative Perspective
Analyzing Japanese Syntax: A Generative Perspective is an introductory syntax textbook that explains the basic ideas of generative grammar and uses concrete examples to show how Japanese syntax can be analyzed. The exercises in this textbook were developed in cooperation with the NPCMJ project. You can download the exercises for beginner and intermediate level. The advanced level exercises are no longer available as the project has been closed.
Download Exercises