【Agency for Cultural Affairs commissioned project】BCCWJ2 -Balanced Corpus of Contemporary Written Japanese

Design policy

We maintain continuity with BCCWJ1 while simplifying certain aspects to ensure smoother construction.
BCCWJ2 focuses on published books from 2006–2025 and will be built and released step by step (on an annual basis).
To improve construction efficiency, we prioritize metadata that is essential for search and browsing (e.g., items directly displayed in the search interface).

In BCCWJ1, samples were designed as “random page extraction → decide a sampling reference point (one character).”
In BCCWJ2, the population is defined as a set of books. We stratify by NDC for each publication year, and sample at the book level.
Because bibliographic information for 2006–2025 is not assumed to be fully available at the start of the project, we conduct stratified sampling year by year and set an annual acquisition target (5 million words per year).

We perform morphological analysis using the latest UniDic.
As in BCCWJ1, we exclude materials that are not primarily textual (e.g., manga, photo books, maps).
To keep the design consistent with a no-copyright-clearance policy, we do not include short works such as haiku, tanka, and poems.
We also apply outcomes from BCCWJ2 development to BCCWJ1, preparing for integrated use of the two corpora.

The 2018 revision of the Japanese Copyright Act introduced flexible limitations and exceptions that cover corpus construction and search services.
However, when content is made available without copyright clearance, the context length displayed in online search services must remain within the scope of “minor use.”
Since BCCWJ2 is designed without copyright clearance, publication and display formats (e.g., context length in search results) are constrained and must be designed in line with actual usage and demand.

Books: The core of BCCWJ2. We build it based on the design policy for published books, using NDC-stratified sampling.
Textbooks: A planned subcorpus consisting of full texts of authorized textbooks (one per subject per grade) across elementary, junior high, and high school, spanning multiple target years (e.g., 2005, 2014, 2025).
SNS: Positioned as a register that complements contemporary written Japanese. We organize key issues for collection and corpus construction (e.g., post types, how non-standard orthography affects processing, and which metadata should be retained) while developing the subcorpus.