Gen-Nichi-Ken
Corpus of Workplace Conversation

What is the Gen-Nichi-Ken Corpus of Workplace Conversation?

The Gen-Nichi-Ken Corpus of Workplace Conversation was published on August 20, 2018 via the online corpus search application Chunagon. This corpus has been created based on the transcripts obtained through the two research projects described below, namely "Josei no Kotoba: Shokuba Hen (Language of Women at Work)" and "Dansei no Kotoba: Shokuba Hen (Language of Men at Work)."
These projects were pioneering attempts made in the 1990s, in which research cooperators recorded their workplace conversations by themselves to collect natural conversations, which have received high praise as groundbreaking data. A total of 22 research articles utilizing these corpora are published in "Gappon Josei no Kotoba/Dansei no Kotoba: Shokuba Hen (Language of Women at Work and Language of Men at Work in One Volume)" (edited by Gendai Nihongo Kenkyukai and published by Hituzi Syobo in 2011). These research projects have revealed that gender-specific expressions, which are often referred to as "men's language" and "women's language," and words that are more frequently used by men or women, have been becoming less and less apparent today. At the same time, the realities of spoken Japanese, different from those of written Japanese, have been clarified in many different ways. For the Corpus of Everyday Japanese Conversation, which is now being developed, everyday conversations have been recorded by research cooperators themselves, with reference to how earlier studies like the above have been implemented. Gendai Nihongo Kenkyukai has also continued to collect and study everyday conversations, the results of which have been published in "Danwa Shiryo: Nichijo Seikatsu no Kotoba (Transcripts and Analysis: Japanese Daily Interaction)" (edited by Gendai Nihongo Kenkyukai and published by Hituzi Syobo in 2016).

〇 "Josei no Kotoba: Shokuba Hen (Language of Women at Work)"

Gendai Nihongo Kenkyukai carried out a research project in September and October 1993, in which 19 working women (in their 20s to 50s) in the Tokyo metropolitan area participated as research cooperators. The participants recorded their natural conversations in their respective workplaces. These conversations were recorded with recorders hung around the necks of cooperators or placed near them. These 19 cooperators were all working in different workplaces. Each person recorded one hour of conversation in the morning after arrival in their workplace, one hour of meetings, and one hour of break time, from each of which about 10 minutes of consecutive conversation was extracted and transcribed. "Josei no Kotoba: Shokuba Hen (Language of Women at Work)" (edited by Gendai Nihongo Kenkyukai), published by Hituzi Syobo, includes a CD-ROM containing the transcripts and 10 research articles based on them. (This book is now out of print. Consult the combined edition described below.)

〇 "Dansei no Kotoba: Shokuba Hen (Language of Men at Work)"

Gendai Nihongo Kenkyukai carried out a research project from October 1999 through December 2000, in which 21 working men (in their 20s to 50s) in the Tokyo metropolitan area participated as research cooperators. The participants recorded their natural conversations in their respective workplaces. These conversations were recorded with recorders hung around the necks of cooperators or placed near them. These 21 cooperators were all working in different workplaces. Each person recorded one hour of conversation in the morning after arrival in their workplace, one hour of meetings, and one hour of break time, from each of which about 10 minutes of consecutive conversation was extracted and transcribed. "Dansei no Kotoba: Shokuba Hen (Language of Men at Work)" (edited by Gendai Nihongo Kenkyukai), published by Hituzi Syobo, includes a CD-ROM containing the transcripts and 12 research articles based on them. (This book is now out of print. Consult the combined edition described below.) This research project received financial assistance from the Faculty of Language and Literature, Bunkyo University, in the form of joint research funding, from FY1999 through FY2001.

The above two books, including the CD-ROM data, were subsequently combined into "Gappon Josei no Kotoba/Dansei no Kotoba: Shokuba Hen (Language of Women at Work and Language of Men at Work in One Volume)" (edited by Gendai Nihongo Kenkyukai) and was published by Hituzi Syobo in 2011. This combined edition is still in print.

These transcripts have been offered to the National Institute for Japanese Language and Linguistics this time through the understanding and courtesy of Gendai Nihongo Kenkyukai and Isao Matsumoto of Hituzi Syobo.

◆ Regarding the name of the corpus ◆

The transcripts offered to the National Institute for Japanese Language and Linguistics have been analyzed using MeCab and UniDic, and the results have been published under the new name "Gen-Nichi-Ken Corpus of Workplace Conversation."

Publication of the corpus

In this project, the transcripts of the Gen-Nichi-Ken Corpus of Workplace Conversation, accompanied by morphological information (short-unit information), are published via the online corpus search application Chunagon.

The Gen-Nichi-Ken Corpus of Workplace Conversation is licensed under a Creative Commons Attribution – Non-Commercial – No Derivative Works 4.0 International License. creative commons


Reference literature

For use of the Gen-Nichi-Ken Corpus of Workplace Conversation in publishing your research results or for other publication purposes, the following literature information must be provided:

"Gappon Josei no Kotoba/Dansei no Kotoba: Shokuba Hen (Language of Women at Work and Language of Men at Work in One Volume)" (edited by Gendai Nihongo Kenkyukai)


Outline of the data organization process associated with the publication of the search systems

  • The morphological information was automatically attached using the morpheme analyzer Mecab (ver. 0.98) and the morpheme analysis dictionary UniDic. Some analysis results were manually corrected as well.
  • While the original data on women's language and men's language is provided in two separate files, the files are divided and named according to the file-naming rules shown below in this corpus.

    Example: M01A011
    syokuba-file.png

    NumberContentPossible valuesNotes
    (1) Male/FemaleM, FM: Datasourced from "Dansei no Kotoba: Shokuba Hen (Language of Men at Work)"
    F: Data sourced from "Josei no Kotoba: Shokuba Hen (Language of Women at Work)"
    (2)Cooperator code01, 02, ... These are the same identification codes as those of research cooperators in the original data
    (3)Scene 1A, K, Q"Asa (Morning)," "Kaigi (Meeting)," and "Kyukei (Break)" in the original data
    (4)Scene 201, 02, ...New serial numbers
  • Of the meta-information items given in the original data, the following items have been extracted and provided.
    Scene 1, Scene 2, Date of research, Location, Number of conversation participants, Cooperator code, Speaker code, Gender, Age group, Occupation Job category, Title, Home prefecture, Place of longest residence (No other meta-information is provided.)

    Scene 1 "Morning," "Meeting," or "Break."
    Scene 2 Subcategories of <Scene 1>.
    Cooperator code Identification codes of research cooperators (who recorded the conversations).
    Speaker code Identification codes of speakers.
    Gender Genders of speakers. "Male," "Female," or "*." "*" means "unknown" or "no information" (The same applies to other items as well).
    Age group Age groups the speakers were in at the time of the research. 10-year increments.
    Occupation Occupations of speakers. May be indistinguishable from "job category" in some cases.
    Job category Job categories of speakers. May be indistinguishable from "occupation" in some cases.
    Title Job titles of speakers. Enter "(None)" when the face sheet says that the speaker does not have a job title.
    Home prefecture Prefectures from which the speakers come.
    Place of longest residence Prefectures in which the speakers lived longest between the ages of 4 and 15 years (which are not necessarily prefectures in which they spent their formative years for language learning).
  • Elements that are withheld, for example [Surname] in "[Surname]-san," are categorized into a word class named "withheld information."
  • Non-linguistic information included in the source materials, for example "<laugh>," is excluded from the target of searches.
  • Inserted elements, such as back-channel sounds, are shown in different places from where they were originally spoken, independently from the statements that include them.
  • In the original data, both one-byte and two-byte characters are used in speaker codes, like "01A" in "Dansei no Kotoba: Shokuba Hen (Language of Men at Work)" and "01A" in "Josei no Kotoba: Shokuba Hen (Language of Women at Work)." In this corpus, however, only one-byte characters are used in speaker codes. In addition, M (indicating that the data is sourced from "Dansei no Kotoba: Shokuba Hen") or F (indicating that the data is sourced from "Josei no Kotoba: Shokuba Hen") is added to the beginning of each speaker code, like "M01A" and "F01A," to distinguish data from one source from that of another. Note that "M" and "F" do not represent the genders of speakers; instead, they show which source the original data comes from. As exceptions, two-byte question marks included in speaker codes in the original data remain two-byte.