Content Analysis Toolkit for Academic Research (CATAR)

Yuen-Hsien Tseng, 2022/07/18 (last version: 2020/03/12)

1. Introduction

CATAR is a software toolkit for users who would like to analyze a set of documents (semi- or un-structured free-text data), especially those publication records from Web of Science (WoS), full-text technical patent documents (in HTML) from USPTO, or any texts worth of analysis (Web pages, interview records, etc.), for the purpose of strategic reading, planning, or research.

The content analysis that CATAR provides can be used to:

To have more concrete ideas of what it can do, please see the examples:

Please check the English Tutorial or Chinese tutorial for a quick overview.


2. CATAR Installation

CATAR was developed in Perl language on MS Windows with MS Excel for data tabulation and MS Access as its databases. So, you should have MS Windows and MS Office on your computer.
Note: After 2017/10/14, the CATAR version for Windows 10 uses SQLite as its database. For USPTO documents, MS Access is still used because many queries for patent analyses are constructed in MS Access query commands, such that once the USPTO patent documents are downloaded, around 20 overview graphs are ready for review.

In additional to MS Excel, Open Office could be used to open the XLS files produced by CATAR. For SQLite database, SQLite Browser can be used to open the SQLite databases.

Below are the steps to prepare your computer to run CATAR:

(1) Download and Install Perl:

Firstly, download the latest Perl interpreter, e.g., Strawberry Perl at http://strawberryperl.com/ for your computer. Note: the 32 bits version is used to analyze USPTO patents.
Then install the downloaded Perl interpreter into your computer by following the installation steps and choose the default settings.

(2) Download CATAR:

Download CATAR from: https://github.com/SamTseng/CATAR (for Window 10).

(3) Decompress CATAR to a target folder:

Decompress the downloaded CATAR file to the folder C:\CATAR.
After decompression, you should find the following sub-folders under C:\CATAR\

(4) Install Perl Modules (Packages):

Make sure your computer is connecting to the Internet.

3. Data Preparation

CATAR is ready for the paper records downloaded from Web of Knowledge (i.e., WoK. Note: WoS is a database in the WoK platform. So hereafter, WoS and WoK are used interchangeably). It means that CATAR knows how to read those downloaded files and do the analysis right away without any further file conversion. To know the desired record format needed by CATAR, please have a look at the example files under Source_Data\sam\data. The following figure shows the options to save records from WoK (Note: Since 2014, WoK has changed its interface. So the screen shots below may not be the same interface you are using now. But the download process remains similar).
Please check WoS_Record_Download.ppt for instructions to download a set of records from WoS.



When downloading data from WoK, choose these options for correct content and format.

CATAR is also ready for the patent files downloaded from USPTO. Just supply a list of patent numbers (patent IDs), and CATAR will download them from USPTO ( PatFT and AppFT only) and analyze them for you. For details, please refer to: https://github.com/SamTseng/CATAR.

For other types of documents you would like to analyze, please copy the file at: src\Paper_org.db to a new file of your own, and insert your documents into table TPaper in that new file. (You will need some skills about SQL commands.) In fact, CATAR starts various analyses from this table in this new file. You could check the example in the sub-folder: Source_Data\movie\movie.db.


4. Analysis

Three kinds of analyses can be done with CATAR:

  1. Overview analysis : this includes
  2. Breakdown analysis based on bibliographic coupling (BC);
    (This is possible only when the CR (Cited Reference) field is available.)
  3. Breakdown analysis based on co-word (CW, co-occurrence words) analysis.
The last two (breakdown analyses) identify the sub-topics revealed in the document set and provide cross-tabulation analyses for each sub-topic and actor (such as most productive authors or institutes in a certain sub-topic).

The simplest way to use CATAR is to download the WoK data and save them in an example folder and then run the command batch file corresponding to the example folder. See the figures in the following for example. The left figure shows the example folder where the downloaded data were saved. The middle figure shows where the corresponding command batch file resides. Just double click the batch file and CATAR will run all the three analyses from the start to the end. The right figure shows the resulting folders after all the analyses have been done.



To do individual analysis (or to know the meaning of CATAR commands in the example batch files), the following examples explain the commands to run CATAR under the MS-DOS command console. To know them better, some familiarity with the concepts of DOS commands, folders (directories), and file paths is preferable.


Now use the publication records in Source_Data\Sam\data as an example, the overview analysis can be done with the command under DOS:

C:\CATAR\src>perl -s automc.pl -OOA Sam ..\Source_Data\Sam\data

The result is in Result\Sam\_Sam_by_field.xls. In addition, the content in the record files was converted and saved in Source_Data\Sam\Sam.db for use by breakdown analysis.


For the breakdown analysis based on bibliographic coupling (BC), run the command:

C:\CATAR\src>perl -s automc.pl -OBC Sam ..\Source_Data\Sam\Sam.db

CATAR will prompt you with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_BC under the Result\ folder.


To do the co-word analysis (CW), run the command:

C:\CATAR\src>perl -s automc.pl -OCW Sam ..\Source_Data\Sam\Sam.db

Also you will be prompted with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_CW under the Result\ folder.

In summary, if the record set you want to analyze is called SE and you place the WoK record files under Source_data\SE\data\, then the above three commands would become:

C:\CATAR\src>perl -s automc.pl -OOA SE ..\Source_Data\SE\data
C:\CATAR\src>perl -s automc.pl -OBC SE ..\Source_Data\SE\SE.db
C:\CATAR\src>perl -s automc.pl -OCW SE ..\Source_Data\SE\SE.db
Note: you must run overview analysis before you can run breakdown analysis.

Limitations:

  1. For the overview analysis, the document set can be as large as tens of thousands of records.
  2. For the breakdown analysis, the number of documents is limited to under 4000 for bibliographic coupling and under 3000 for co-word analysis due to the memory limit (tested on a 2GB RAM computer).
  3. The above breakdown analysis commands are for article clustering. To do journal clustering, use the option -OBC=JBC, instead of only -OBC, and -OCW=JCW, instead of only -OCW. For journal clustering, the number of documents can be tens of thousands of documents.

5. Interpretation

The results (in the Result folder) obtained from the above analyses require your interpretation to make them useful. As how to make use of them depends on your domain knowledge, insights, imagination, and maybe luck!

To help understand the abbreviated field names presented in all the results, the following lists their meanings and examples:

Note:

  1. When interpreting the number of articles published by certain authors, be careful about the possible problem of different authors sharing the same names (AU or AF).
  2. Some fields may have no content. For example, DE and ID may be empty possibly due to: 1) the journal does not require keywords given by authors; 2) WoK did not record DE or did not label ID. Before your analysis, you should check CATAR's report for the number of empty records for each field (from the above Sam's example, the report is in the _Sam_stat worksheet in Result\Sam\_Sam_by_field.xls). Try to avoid using the result from the field containing too many empty records.


6. Frequently Asked Questions

  1. Q: How to determine the threshold during the multi-stage clustering?
    A: The similarity measures, either based on bibliographic coupling or co-occurrence words, are just an approximation to the "real similarity" of two documents. For example, document A and B sharing the same 50% of their references (or the same 50% words they used in their titles and abstracts) do not necessary mean they are 50% in common in their topics. It is only fair to say that A and B have higher probability to be in the same topic, compared to the case of A and C, if A and C share only the same 10% of their references (or the same 10% words ).
    So insistence on computing an optimal similarity threshold for clustering is not necessary. This uncertainty in the similarity measure makes such an optimal threshold somewhat unsure. (It may be a research question to know to what degree a domain fit the methodology of bibliographic coupling and/or co-word analysis.)
    Therefore, it is recommended that you explore as many thresholds as you can and see which one leads to an interpretable result.

  2. Q: In some cases, there are empty data in some fields, such as DE (author keywords), ID (identifiers), or even AB (abstract). What can we do?
    A: These cases are quite common. It may be due to the cases that the journal publisher did not provide such information or the database provider (e.g., WoK, Thomson Reuters) did not add such information to these fields.
    Try to analyze and interpret the data with alternative ways. Avoid those results from insufficient data.

  3. Q: The papers included in the final stage of clustering are in small number compared to the initial papers for analysis. For example, there may be up to 50% percent of papers not included in the final clustering. Is this abnormal?
    A: No. It is a common phenomenon that many items may be regarded as outliers in clustering analysis. As can be imagined, although many papers deal with major topics, there are many more dealing with independent and probably less-noticed issues, which is a phenomenon similar to the long tail effect reflected in the online book sales statistics. Since the independent issues are in large number, they were excluded from the clustering. Like the other clustering analyses (such as those based on singular value decomposition), the multi-stage clustering used in CATAR tends to retain only some major topics for clarity, especially when a non-zero threshold is applied to each successive clustering stage.

7. History

CATAR was developed under the support and demand of the following projects:

The technology used in CATAR denotes a series of long term research and was published in the following papers:

Applications of CATAR were published in various domains:

  1. Yulan Yuan, Ulrike Gretzel, and Yuen-Hsien Tseng, "Revealing the Nature of Contemporary Tourism Research: Extracting Common Subject Areas through Bibliographic Coupling ", International Journal of Tourism Research, Vol. 17, No. 5, pp. 417-431, Sep./Oct. 2015, DOI: 10.1002/jtr.2004. (SSCI)
  2. Yulan Yuan, Yuen-Hsien Tseng, and Chun-Yen Chang, " Tourism Subfield Identification via Journal Clustering", Annals of Tourism Research, Vol. 47, July 2014, pp. 77-80 (SSCI)
  3. Yuen-Hsien Tseng, Chun-Yen Chang, M. Shane Tutwiler, Ming-Chao Lin, and James Barufaldi, " A Scientometric Analysis of the Effectiveness of Taiwan's Educational Research Projects", Scientometrics, Vol. 95, No. 3, pp 1141-1166, June 2013. (SCI, SSCI) 2013教育領域國際論文表現新聞投影片
  4. Yuen-Hsien Tseng and Ming-Yueh Tsay, " Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. (SCI, SSCI)
  5. 李清福, 陳志銘, 曾元顯, " 數位學習領域主題分析之研究", 教育資料與圖書館學, 50卷, 3期, 頁319-354, 2013年4月。(TSSCI)
  6. 曾元顯, " 文獻內容探勘工具 - CATAR 之發展和應用", 圖書館學與資訊科學 半年刊, 第 37 卷 第 1 期,頁 31-49, 2011年 04月.
  7. 曾元顯, 林瑜一, " 內容探勘技術在教育評鑑研究發展趨勢分析之應用", 教育科學研究期刊, 第 56 卷第 1 期,頁 129-166, 2011 年 3 月. (TSSCI)
  8. Yueh-Hsia Chang, Chun-Yen Chang, and Yuen-Hsien Tseng, " Trends of Science Education Research: An Automatic Content Analysis", Journal of Science Education and Technology, Vol. 19, No. 4, 2010, pp. 315-331.
  9. Yi-Yang Lee, Yuen-Hsien Tseng, Wen-Chi Hung, and Michael Huang, "The Application of Knowledge Mining to the Discovery of Trends in Future Agricultural Technological Development" Proceedings of the 10th International Conference on Science and Technology Indicators, Sep. 17-20, 2008, Vienna, Austria, page 481-483.
  10. 陳淑貞(2010)。以自動化主題分析探索免疫學領域研究主題之發展。國立台灣師範大學圖書資訊學研究所碩士論文,未出版,台北市。
  11. 許育聞(2009)。會議與期刊文獻對預測主題趨勢之比較研究—以「資訊檢索」領域為例。國立台灣師範大學圖書資訊學研究所碩士論文,未出版,台北市。
  12. 谷佳臻(2008)。電腦輔助分析軟體運用於質性研究訪談稿內容分析之探討。國立台灣師範大學圖書資訊學研究所碩士論文,未出版,台北市。

From the above projects (most are small, experimental ones) and papers, you can imagine that the CATAR presented here denotes only some of the analysis tasks I have done so far. Many more experimental analyses were not mentioned (due to the specificity of the requirements and applications). This suggests that the analysis approaches provided by current version of CATAR are the most mature ones.

If you have any idea of seeing unstructured data being analyzed in a certain way which maybe novel and useful, I would be happy to modify CATAR and include that way of analysis in future versions, especially when this modification leads to a research opportunity . The only limit is my time and energy.


Established on May 19, 2010 by Yuen-Hsien Tseng