FJU Test Collection for Evaluation of
Chinese Text Categorization

By Yuen-Hsien Tseng, Feb. 1, 2004
http://www.lins.fju.edu.tw/~tseng/Collections/Chinese_TC.html

Introduction

Text categorization (or document classification) is a process of assigning labels to documents according to the contents or topics of the documents. The labels (classes or categories) usually come from one or more predefined sets that reflect the knowledge structures intended to organize the documents.

Traditionally text categorization is carried out by human experts as it requires a certain level of vocabulary recognition and knowledge processing. As the amount of full-text documents rapidly increases in this digital age, automatic ways of assigning labels to documents to assist human experts become, in some cases, inevitable. Examples are news dispatching and spam email filtering. Both require labeling bulk messages in a short period of time according to the message contents, a task that is not easily done by human efforts.

Automated text categorization also helps human categorizers in traditional tasks. By suggesting possible classes for each unlabelled document, a machine classifier relieves the burdens of reading full-text documents and memorizing every class definition in a knowledge structure, both are required by a categorizer for the classification task. For novices in such tasks, this way of having a machine classifier seems to have an experienced colleague as a guide. Thus the training cost can be reduced and the period of getting acquainted with the task can be shortened, allowing more people involved in this knowledge organization and value-added task.

But how well can a machine classifier perform? It is a research interest and in fact a great need to evaluate the effectiveness of automated text classification to justify its advantages.

There have already existed several test collections for evaluation of automatic text categorization technologies. For examples: Reuters 21578, OHSUMED, and 20NG. All of these test collections are in English. It is needed to have some Chinese test collections for direct evaluation of Chinese processing techniques in automatic text categorization.

Sources of the Test Collection

The test collection provided here is originated from a several-years long digitization project of SCRC (Socio-Cultural Research Center) at Fu Jen Catholic University. The digitization project is mostly sponsored by the National Science Council, Republic of China.

The source of the test collection comes from the news broadcasts of Mainland China's radio stations between 1966 and 1982. These broadcasts were transcribed and labeled by hands in manuscript papers on-site or by first recording the broadcasts and then transcribed afterwards.

This material was used to reveal what happened in Mainland China during the Cultural Revolution by SCRC, which formerly situated in Hong Kong and published a well-known periodical, the CHINA NEWS ANALYSIS.

In year 2000-2001, under the digitization project, SCRC has 42371 manuscripts key-in manually for the preservation and better use of this material. Among them, 30710 manuscripts have category labels and dates.

Only a part of the manuscripts is included in this test collection according to the following guidelines:

Contents of the Test Collection

Download

To get this collection, click here and you will be prompted with a form that needs your following information:

Upon receiving this request, a pair of username and password will be included in a reply message for downloading the test collection.

Request List

Those who have requested this test collection will be listed here for 2 reasons:

Here are the requests:

Acknowledgement