Content Analysis Toolkit for Academic Research (CATAR)

Yuen-Hsien Tseng, 2022/07/18 (last version: 2020/03/12)

1. Introduction

CATAR is a software toolkit for users who would like to analyze a set of documents (semi- or un-structured free-text data), especially those publication records from Web of Science (WoS), full-text technical patent documents (in HTML) from USPTO, or any texts worth of analysis (Web pages, interview records, etc.), for the purpose of strategic reading, planning, or research.

The content analysis that CATAR provides can be used to:

summarize the background of a research field (or industrial domain);
obtain an overview of the research topics and technical development;
get breakdown analysis of various actors (authors, institutes, countries);
track trends of the knowledge development;
suggest hypotheses for further exploration;
offer evidence-based, data-driven, bottom-up information;
for panel discussion, strategic planning, or decision making.

To have more concrete ideas of what it can do, please see the examples:

Yuan Chih Fu, Marcelo Marques, Yuen-Hsien Tseng, Justin J.W. Powell, David P. Baker, An Evolving International Research Collaboration Network: Spatial and Thematic Developments in Co-Authored Higher Education Research, 1998–2018, Scientometrics, 2022. (SSCI, Scopus)
Yulan Yuan, Yuen-Hsien Tseng, Chaang-Iuan Ho, (2019) "Tourism information technology research trends: 1990-2016", Tourism Review, Vol. 74 Issue: 1, pp.5-19, https://doi.org/10.1108/TR-08-2017-0128. (SSCI, Scopus)
Yulan Yuan, Ulrike Gretzel, and Yuen-Hsien Tseng, " Revealing the Nature of Contemporary Tourism Research: Extracting Common Subject Areas through Bibliographic Coupling ", International Journal of Tourism Research, Vol. 17, No. 5, pp. 417-431, Sep./Oct. 2015, DOI: 10.1002/jtr.2004. (SSCI)
Yuen-Hsien Tseng, Chun-Yen Chang, M. Shane Tutwiler, Ming-Chao Lin, and James Barufaldi, " A Scientometric Analysis of the Effectiveness of Taiwan's Educational Research Projects", Scientometrics, Vol. 95, No. 3, pp 1141-1166, June 2013. (SCI, SSCI)
Yuen-Hsien Tseng and Ming-Yueh Tsay, " Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. (SCI, SSCI)
曾元顯, 林瑜一, 內容探勘技術在教育評鑑研究發展趨勢分析之應用教育科學研究期刊, 56(1), 129-166.
Yueh-Hsia Chang, Chun-Yen Chang, and Yuen-Hsien Tseng, " Trends of Science Education Research: An Automatic Content Analysis", Journal of Science Education and Technology, Vol. 19, No. 4, 2010, pp. 315-331.

Please check the English Tutorial or Chinese tutorial for a quick overview.

2. CATAR Installation

CATAR was developed in Perl language on MS Windows with MS Excel for data tabulation and MS Access as its databases. So, you should have MS Windows and MS Office on your computer.
Note: After 2017/10/14, the CATAR version for Windows 10 uses SQLite as its database. For USPTO documents, MS Access is still used because many queries for patent analyses are constructed in MS Access query commands, such that once the USPTO patent documents are downloaded, around 20 overview graphs are ready for review.

In additional to MS Excel, Open Office could be used to open the XLS files produced by CATAR. For SQLite database, SQLite Browser can be used to open the SQLite databases.

Below are the steps to prepare your computer to run CATAR:

(1) Download and Install Perl:

Firstly, download the latest Perl interpreter, e.g., Strawberry Perl at http://strawberryperl.com/ for your computer. Note: the 32 bits version is used to analyze USPTO patents.
Then install the downloaded Perl interpreter into your computer by following the installation steps and choose the default settings.

(2) Download CATAR:

Download CATAR from: https://github.com/SamTseng/CATAR (for Window 10).

Note: CATAR is free to use only to those individuals in education and non-profit organizations (or institutes).

(3) Decompress CATAR to a target folder:

Decompress the downloaded CATAR file to the folder C:\CATAR.
After decompression, you should find the following sub-folders under C:\CATAR\

src : the folder where perl source codes of CATAR is located.
Source_Data : the folder to store your data for CATAR to analyze (see later examples).
Result : the folder to hold the resulting files produced by CATAR;
doc : the folder that CATAR stores intermediate data during analysis; So do not store your own data here (could be deleted by CATAR).

(4) Install Perl Modules (Packages):

Make sure your computer is connecting to the Internet.

If you get CATAR from https://github.com/SamTseng/CATAR, then double click C:\CATAR\src\install.bat to execute package installation.
If there is no error report, you have installed CATAR successfully. You can proceed to the next section: "3. Data Preparation"

If the above .bat did not work, mannually do this step with the following actions:

Open MS DOS (search cmd.exe in your computer and execute cmd.exe), and then execute the following commands:

cpanm install Encode::Detect::Detector
cpanm install Statistics::Regression
cpanm install Math::MatrixReal
cpanm install Win32::ODBC

Install SAMtool: Decompress C:\CATAR\src\Perl_Module\SAMtool.rar to C:\Strawberry\perl\site\lib so that you see many Perl files under C:\Strawberry\perl\site\lib\SAMtool\

3. Data Preparation

CATAR is ready for the paper records downloaded from Web of Knowledge (i.e., WoK. Note: WoS is a database in the WoK platform. So hereafter, WoS and WoK are used interchangeably). It means that CATAR knows how to read those downloaded files and do the analysis right away without any further file conversion. To know the desired record format needed by CATAR, please have a look at the example files under Source_Data\sam\data. The following figure shows the options to save records from WoK (Note: Since 2014, WoK has changed its interface. So the screen shots below may not be the same interface you are using now. But the download process remains similar).
Please check WoS_Record_Download.ppt for instructions to download a set of records from WoS.

When downloading data from WoK, choose these options for correct content and format.

CATAR is also ready for the patent files downloaded from USPTO. Just supply a list of patent numbers (patent IDs), and CATAR will download them from USPTO ( PatFT and AppFT only) and analyze them for you. For details, please refer to: https://github.com/SamTseng/CATAR.

For other types of documents you would like to analyze, please copy the file at: src\Paper_org.db to a new file of your own, and insert your documents into table TPaper in that new file. (You will need some skills about SQL commands.) In fact, CATAR starts various analyses from this table in this new file. You could check the example in the sub-folder: Source_Data\movie\movie.db.

4. Analysis

Three kinds of analyses can be done with CATAR:

Overview analysis : this includes

general trends of the topics (or domains) represented by the documents;
most productive authors, institutes, or countries;
most cited references, authors, or journals;

Breakdown analysis based on bibliographic coupling (BC);
(This is possible only when the CR (Cited Reference) field is available.)
Breakdown analysis based on co-word (CW, co-occurrence words) analysis.

The last two (breakdown analyses) identify the sub-topics revealed in the document set and provide cross-tabulation analyses for each sub-topic and actor (such as most productive authors or institutes in a certain sub-topic).

The simplest way to use CATAR is to download the WoK data and save them in an example folder and then run the command batch file corresponding to the example folder. See the figures in the following for example. The left figure shows the example folder where the downloaded data were saved. The middle figure shows where the corresponding command batch file resides. Just double click the batch file and CATAR will run all the three analyses from the start to the end. The right figure shows the resulting folders after all the analyses have been done.

To do individual analysis (or to know the meaning of CATAR commands in the example batch files), the following examples explain the commands to run CATAR under the MS-DOS command console. To know them better, some familiarity with the concepts of DOS commands, folders (directories), and file paths is preferable.

Now use the publication records in Source_Data\Sam\data as an example, the overview analysis can be done with the command under DOS:

C:\CATAR\src>perl -s automc.pl -OOA Sam ..\Source_Data\Sam\data

The result is in Result\Sam\_Sam_by_field.xls. In addition, the content in the record files was converted and saved in Source_Data\Sam\Sam.db for use by breakdown analysis.

For the breakdown analysis based on bibliographic coupling (BC), run the command:

C:\CATAR\src>perl -s automc.pl -OBC Sam ..\Source_Data\Sam\Sam.db

CATAR will prompt you with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_BC under the Result\ folder.

To do the co-word analysis (CW), run the command:

C:\CATAR\src>perl -s automc.pl -OCW Sam ..\Source_Data\Sam\Sam.db

Also you will be prompted with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_CW under the Result\ folder.

In summary, if the record set you want to analyze is called SE and you place the WoK record files under Source_data\SE\data\, then the above three commands would become:

C:\CATAR\src>perl -s automc.pl -OOA SE ..\Source_Data\SE\data
C:\CATAR\src>perl -s automc.pl -OBC SE ..\Source_Data\SE\SE.db
C:\CATAR\src>perl -s automc.pl -OCW SE ..\Source_Data\SE\SE.db

Note: you must run overview analysis before you can run breakdown analysis.

Limitations:

For the overview analysis, the document set can be as large as tens of thousands of records.
For the breakdown analysis, the number of documents is limited to under 4000 for bibliographic coupling and under 3000 for co-word analysis due to the memory limit (tested on a 2GB RAM computer).
The above breakdown analysis commands are for article clustering. To do journal clustering, use the option -OBC=JBC, instead of only -OBC, and -OCW=JCW, instead of only -OCW. For journal clustering, the number of documents can be tens of thousands of documents.

5. Interpretation

The results (in the Result folder) obtained from the above analyses require your interpretation to make them useful. As how to make use of them depends on your domain knowledge, insights, imagination, and maybe luck!

To help understand the abbreviated field names presented in all the results, the following lists their meanings and examples:

AU: authors' names, e.g., WHITE, RT; GUNSTONE, RF;
AF: the authors' full names, such as Tseng, Yuen-Hsien (instead of Tseng, YH)
Note: authors' full names are available since 2007 (or 2006?) by WoK. Before this year, the content of AF is empty or is identical to that of AU.
TI: publication title, e.g., METALEARNING AND CONCEPTUAL CHANGE;
SO: journal title, e.g., INTERNATIONAL JOURNAL OF SCIENCE EDUCATION;
J9: abbreviation of journal title, e.g., INT J SCI EDUC;
AB: publication's abstract;
C1: authors' countries (extracted from the authors' addresses in the original C1 field);
IU: authors' institutes/universities (extracted from the authors' addresses in the original C1 field);
DP: authors' departments (extracted from the authors' addresses in the original C1 field);
Note: Do not use this field for analysis if your data contains multiple universities, because different universities may have the same department names.
CR: normalized cited references, e.g., POSNER GJ, 1982, SCI EDUC, V66, P211;
TC: times cited, e.g., 60 means the publication has been cited by other sixty publications indexed by WoS up until the publication was downloaded;
PY: year of publication, e.g., 1989;
UT: primary key (record ID) of the publication used in WoS, e.g., WOS:A1989CY40300009;
SC: source categories given by WoS, e.g., Education & Educational Research
ID: identifiers given by WoK to describe the topics of the article, not the record ID;
DE: keywords given by the authors; In contrast, ID contains broader concepts, while SC contains even broader fields.

Note:

When interpreting the number of articles published by certain authors, be careful about the possible problem of different authors sharing the same names (AU or AF).
Some fields may have no content. For example, DE and ID may be empty possibly due to: 1) the journal does not require keywords given by authors; 2) WoK did not record DE or did not label ID. Before your analysis, you should check CATAR's report for the number of empty records for each field (from the above Sam's example, the report is in the _Sam_stat worksheet in Result\Sam\_Sam_by_field.xls). Try to avoid using the result from the field containing too many empty records.

6. Frequently Asked Questions

Q: How to determine the threshold during the multi-stage clustering?
A: The similarity measures, either based on bibliographic coupling or co-occurrence words, are just an approximation to the "real similarity" of two documents. For example, document A and B sharing the same 50% of their references (or the same 50% words they used in their titles and abstracts) do not necessary mean they are 50% in common in their topics. It is only fair to say that A and B have higher probability to be in the same topic, compared to the case of A and C, if A and C share only the same 10% of their references (or the same 10% words ).
So insistence on computing an optimal similarity threshold for clustering is not necessary. This uncertainty in the similarity measure makes such an optimal threshold somewhat unsure. (It may be a research question to know to what degree a domain fit the methodology of bibliographic coupling and/or co-word analysis.)
Therefore, it is recommended that you explore as many thresholds as you can and see which one leads to an interpretable result.
Q: In some cases, there are empty data in some fields, such as DE (author keywords), ID (identifiers), or even AB (abstract). What can we do?
A: These cases are quite common. It may be due to the cases that the journal publisher did not provide such information or the database provider (e.g., WoK, Thomson Reuters) did not add such information to these fields.
Try to analyze and interpret the data with alternative ways. Avoid those results from insufficient data.
Q: The papers included in the final stage of clustering are in small number compared to the initial papers for analysis. For example, there may be up to 50% percent of papers not included in the final clustering. Is this abnormal?
A: No. It is a common phenomenon that many items may be regarded as outliers in clustering analysis. As can be imagined, although many papers deal with major topics, there are many more dealing with independent and probably less-noticed issues, which is a phenomenon similar to the long tail effect reflected in the online book sales statistics. Since the independent issues are in large number, they were excluded from the clustering. Like the other clustering analyses (such as those based on singular value decomposition), the multi-stage clustering used in CATAR tends to retain only some major topics for clarity, especially when a non-zero threshold is applied to each successive clustering stage.

7. History

CATAR was developed under the support and demand of the following projects:

曾元顯, 國立臺灣師範大學頂尖研究中心計畫, 2011~2017.
曾元顯, 「科學發展趨勢調查-科學地圖製作」, 財團法人國家實驗研究院科技政策研究與資訊中心, 2009/05/10-2009/09/10。
曾元顯, 「產業技術發展之文字探勘與趨勢分析」研究計畫, 財團法人工業技術研究院, 2008/09/15-2008/12/15。
曾元顯, 「社會經濟需求分析研究案」, 財團法人國家實驗研究院科技政策研究與資訊中心, 2008/06/10-2008/11/30。
曾元顯, 「農業研究前沿探勘模式與系統之開發」, 財團法人國家實驗研究院科技政策研究與資訊中心, 2007/09/01-2007/11/30。
曾元顯, 「科學與技術文獻之主題趨勢探勘」, 國科會96學年度專題研究計畫報告, NSC 96-2221-E-003-017-。
曾元顯, 「專利主題萃取之研究開發」研究計畫, 財團法人工業技術研究院, 2007/07/01-2007/11/30。
曾元顯, 「文字探勘技術在教育評鑑研究發展趨勢分析之應用」, 國立臺灣師範大學教育評鑑與發展研究中心研究計畫, 96/01/01-96/12/13。
曾元顯, 「文字探勘之視覺化模式與系統的研發」, 國科會95學年度提升產業技術及人才培育研究計畫報告, NSC 95-2622-E-003-009-CC3, 2006/11/01-2007/10/31。
曾元顯, 「農業創新研究前沿分析方法之開發」研究計畫, 財團法人國家實驗研究院科技政策研究與資訊中心, 2006/08-2006/12。
曾元顯, 「文字探勘技術之發展與其在科學與技術文獻分析之應用」, 國科會95學年度專題研究計畫報告, NSC 95-2221-E-003-016-。
曾元顯, 「引用網路分析等先進資訊工具於政策研究分析之應用」研究計畫, 財團法人國家實驗研究院科技政策研究與資訊中心, 2005/12-2006/07。

The technology used in CATAR denotes a series of long term research and was published in the following papers:

Yuen-Hsien Tseng, " Generic Title Labeling for Clustered Documents", Expert Systems With Applications, Vol. 37, No. 3, 15 March 2010, pp. 2247-2254 . (SCI)
Yuen-Hsien Tseng, Yu-I Lin, Yi-Yang Lee, Wen-Chi Hung, and Chun-Hsiang Lee, " A Comparison of Methods for Detecting Hot Topics", Scientometrics, Vol. 81, No. 1, Oct. 2009, pp. 73-90. (SCI, SSCI)
Yuen-Hsien Tseng, Yeong-Ming Wang, Yu-I Lin, Chi-Jen Lin and Dai-Wei Juang, " Patent Surrogate Extraction and Evaluation in the Context of Patent Mapping", Journal of Information Science, Vol. 33, No. 6, pp. 718-736, Dec. 2007. (SCI, SSCI)
Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin, " Text Mining Techniques for Patent Analysis", Information Processing and Management, Vol. 43, No. 5, 2007, pp. 1216-1247. (SCI, SSCI, EI)
Yuen-Hsien Tseng, "Automatic Thesaurus Generation for Chinese Documents", Journal of the American Society for Information Science and Technology, Vol. 53, No. 13, Nov. 2002, pp. 1130-1138. (SSCI and SCI)

Applications of CATAR were published in various domains:

Yulan Yuan, Ulrike Gretzel, and Yuen-Hsien Tseng, "Revealing the Nature of Contemporary Tourism Research: Extracting Common Subject Areas through Bibliographic Coupling ", International Journal of Tourism Research, Vol. 17, No. 5, pp. 417-431, Sep./Oct. 2015, DOI: 10.1002/jtr.2004. (SSCI)
Yulan Yuan, Yuen-Hsien Tseng, and Chun-Yen Chang, " Tourism Subfield Identification via Journal Clustering", Annals of Tourism Research, Vol. 47, July 2014, pp. 77-80 (SSCI)
Yuen-Hsien Tseng, Chun-Yen Chang, M. Shane Tutwiler, Ming-Chao Lin, and James Barufaldi, " A Scientometric Analysis of the Effectiveness of Taiwan's Educational Research Projects", Scientometrics, Vol. 95, No. 3, pp 1141-1166, June 2013. (SCI, SSCI) 2013教育領域國際論文表現新聞投影片
Yuen-Hsien Tseng and Ming-Yueh Tsay, " Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. (SCI, SSCI)
李清福, 陳志銘, 曾元顯, " 數位學習領域主題分析之研究", 教育資料與圖書館學, 50卷, 3期, 頁319-354, 2013年4月。(TSSCI)
曾元顯, " 文獻內容探勘工具 - CATAR 之發展和應用", 圖書館學與資訊科學半年刊, 第 37 卷第 1 期，頁 31-49, 2011年 04月.
曾元顯, 林瑜一, " 內容探勘技術在教育評鑑研究發展趨勢分析之應用", 教育科學研究期刊, 第 56 卷第 1 期，頁 129-166, 2011 年 3 月. (TSSCI)
Yueh-Hsia Chang, Chun-Yen Chang, and Yuen-Hsien Tseng, " Trends of Science Education Research: An Automatic Content Analysis", Journal of Science Education and Technology, Vol. 19, No. 4, 2010, pp. 315-331.
Yi-Yang Lee, Yuen-Hsien Tseng, Wen-Chi Hung, and Michael Huang, "The Application of Knowledge Mining to the Discovery of Trends in Future Agricultural Technological Development" Proceedings of the 10th International Conference on Science and Technology Indicators, Sep. 17-20, 2008, Vienna, Austria, page 481-483.
陳淑貞（2010）。以自動化主題分析探索免疫學領域研究主題之發展。國立台灣師範大學圖書資訊學研究所碩士論文，未出版，台北市。
許育聞（2009）。會議與期刊文獻對預測主題趨勢之比較研究—以「資訊檢索」領域為例。國立台灣師範大學圖書資訊學研究所碩士論文，未出版，台北市。
谷佳臻（2008）。電腦輔助分析軟體運用於質性研究訪談稿內容分析之探討。國立台灣師範大學圖書資訊學研究所碩士論文，未出版，台北市。

From the above projects (most are small, experimental ones) and papers, you can imagine that the CATAR presented here denotes only some of the analysis tasks I have done so far. Many more experimental analyses were not mentioned (due to the specificity of the requirements and applications). This suggests that the analysis approaches provided by current version of CATAR are the most mature ones.

If you have any idea of seeing unstructured data being analyzed in a certain way which maybe novel and useful, I would be happy to modify CATAR and include that way of analysis in future versions, especially when this modification leads to a research opportunity . The only limit is my time and energy.

Established on May 19, 2010 by Yuen-Hsien Tseng