Content Analysis Toolkit for Academic Research (CATAR)

Yuen-Hsien Tseng, 2018/11/15

1. Introduction

CATAR is a software toolkit for users who would like to analyze a set of documents (semi- or un-structured free-text data), especially those publication records from ISI's Web of Science (WoS), full-text technical patent documents (in HTML) from USPTO, or any texts worth of analysis (Web pages, interview records, etc.), for the purpose of strategic reading, planning, or research.

The content analysis that CATAR provides can be used to:

To have more concrete ideas of what it can do, please see the examples:

Please check the English Tutorial or Chinese tutorial for a quick overview.


2. CATAR Installation

CATAR was developed in Perl on MS Windows with MS Excel for data tabulation and with MS Access as its database. So, you should have MS Windows and MS Office on your computer.
(Note: After 2017/10/14, the CATAR version for Windows 10 uses SQLite as its database. So, having MS Office is not necessary to use CATAR. Instead, Open Office could be used to open the XLS files produced by CATAR, and SQLite Browser can be used to open the databases created by CATAR.)

Below are the steps to prepare your computer to run CATAR:

Firstly, download the recommended version of Strawberry Perl at http://strawberryperl.com/ for your computer (I installed 5.22.1.1 64 bits on my Windows 8.1 64 bits on about 2016/05/30).
Then install it into your computer by following the installation steps and choose the default settings.

Secondly, get CATAR from the following link:

Note:
  1. CATAR is free to use only to those individuals in education and non-profit organizations (or institutes).
  2. A pair of username (samtseng) and password (CATAR10) would be asked when accessing the file.
  3. The above location and access method may be changed in the future for better management.

Thirdly, uncompressed the downloaded CATAR file to a folder in your computer. The recommended folder to host CATAR's files and subfolders is c:\CATAR.
You will find the following sub-folders under C:\CATAR\

Finally, install necessary Perl modules needed by CATAR.

  1. Make sure your computer is connecting to the Internet.
  2. Open MS DOS (search cmd.exe in your computer and execute cmd.exe), and then execute the following commands:
  3. Install SAMtool by uncompressing C:\CATAR\src\Perl_Module\SAMtool.rar to C:\Strawberry\perl\site\lib so that you see many Perl files under C:\Strawberry\perl\site\lib\SAMtool\. (Assume your Perl is installed at C:\Strawberry)
  4. You may need to do one more step: When running a CATAR command shown in Section 4, if you encounter any error message like:
    Can't locate *.pm in @INC ...
    You can copy all the *.pm files (file name extension with pm) under
    C:\CATAR\src\
    to
    C:\strawberry\perl\site\lib\

3. Data Preparation

CATAR is ready for the paper records downloaded from ISI Web of Knowledge (i.e., WoK. Note: WoS is a database in the WoK platform. So hereafter, WoS and WoK are used interchangeably). It means that CATAR knows how to read those downloaded files and do the analysis right away without any further file conversion. To know the desired record format needed by CATAR, please have a look at the example files under Source_Data\sam\data.
The following figure shows the options to save records from WoK (Note: Since 2014, WoK has changed its interface. So the screen shots below may not be the same interface you are using now. But the download process remains similar).
Please check WoS_Record_Download.ppt for instructions to download a set of records from WoS.



When downloading data from WoK, choose these options for correct content and format.

CATAR is also ready for the patent files downloaded from USPTO. Just supply a list of patent numbers (patent IDs), and CATAR will download them from USPTO ( PatFT and AppFT only) and analyze them for you. (In this CATAR version, instructions for patent analyses are still yet to be included and tested. )

For other types of documents you would like to analyze, please copy the ACCESS file at: src\Paper_org.mdb to a new file of your own, and insert your documents into table TPaper in that new file. In fact, CATAR begins various analyses from this table in this new file. You could check the example in the sub-folder: Source_Data\movie\movie.mdb.
Note: For the CATAR version using SQLite as its database, copy the SQLite file at: src\Paper_org.db to a new file of your own, and insert your documents into table TPaper in that new file. (You will need some skills about SQL commands.) In fact, CATAR begins various analyses from this table in this new file. You could check the example in the sub-folder: Source_Data\movie\movie.db.


4. Analysis

Three kinds of analyses can be done with CATAR:

  1. Overview analysis : this includes
  2. Breakdown analysis based on bibliographic coupling (BC);
    (This is possible only when the CR (Cited Reference) field is available.)
  3. Breakdown analysis based on co-word (CW, co-occurrence words) analysis.
The last two (breakdown analyses) identify the sub-topics revealed in the document set and provide cross-tabulation analyses for each sub-topic and actor (such as most productive authors or institutes in a certain sub-topic).

The simplest way to use CATAR is to download the WoK data and save them in an example folder and then run the command batch file corresponding to the example folder. See the figures in the following for example. The left figure shows the example folder where the downloaded data were saved. The middle figure shows where the corresponding command batch file resides. Just double click the batch file and CATAR will run all the three analyses from the start to the end. The right figure shows the resulting folders after all the analyses have been done.



To do individual analysis (or to know the meaning of CATAR commands in the example batch files), the following examples explain the commands to run CATAR under the MS-DOS command console. To know them better, some familiarity with the concepts of DOS commands, folders (directories), and file paths is preferable.


Now use the publication records in Source_Data\Sam\data as an example, the overview analysis can be done with the command under DOS:

C:\CATAR\src>perl -s automc.pl -OOA Sam ..\Source_Data\Sam\data

The result is in Result\Sam\_Sam_by_field.xls. In addition, the content in the record files was converted and saved in Source_Data\Sam\Sam.mdb (or Source_Data\Sam\Sam.db if the SQLite version is used) for use by breakdown analysis.


For the breakdown analysis based on bibliographic coupling (BC), run the command:

C:\CATAR\src>perl -s automc.pl -OBC Sam ..\Source_Data\Sam\Sam.mdb
or
C:\CATAR\src>perl -s automc.pl -OBC Sam ..\Source_Data\Sam\Sam.db if the CATAR using SQLite version is used.

CATAR will prompt you with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_BC under the Result\ folder.


To do the co-word analysis (CW), run the command:

C:\CATAR\src>perl -s automc.pl -OCW Sam ..\Source_Data\Sam\Sam.mdb
or
C:\CATAR\src>perl -s automc.pl -OCW Sam ..\Source_Data\Sam\Sam.db if the CATAR using SQLite version is used.

Also you will be prompted with some questions during the analysis. Accept the default answer if you do not know how to respond. The result would be in the sub-folders starting with Sam_CW under the Result\ folder.

In summary, if the record set you want to analyze is called SE and you place the WoK record files under Source_data\SE\data\, then the above three commands would become:

C:\CATAR\src>perl -s automc.pl -OOA SE ..\Source_Data\SE\data
C:\CATAR\src>perl -s automc.pl -OBC SE ..\Source_Data\SE\SE.mdb
C:\CATAR\src>perl -s automc.pl -OCW SE ..\Source_Data\SE\SE.mdb
Note: you must run overview analysis before you can run breakdown analysis.

Or for the CATAR version using SQLite, the three commands would be:

C:\CATAR\src>perl -s automc.pl -OOA SE ..\Source_Data\SE\data
C:\CATAR\src>perl -s automc.pl -OBC SE ..\Source_Data\SE\SE.db
C:\CATAR\src>perl -s automc.pl -OCW SE ..\Source_Data\SE\SE.db
Note: you must run overview analysis before you can run breakdown analysis.

Limitations:

  1. For the overview analysis, the document set can be as large as tens of thousands of records.
  2. For the breakdown analysis, the number of documents is limited to under 4000 for bibliographic coupling and under 3000 for co-word analysis due to the memory limit (tested on a 2GB RAM computer).
  3. The above breakdown analysis commands are for article clustering. To do journal clustering, use the option -OBC=JBC, instead of only -OBC, and -OCW=JCW, instead of only -OCW. For journal clustering, the number of documents can be tens of thousands of documents.

5. Interpretation

The results (in the Result folder) obtained from the above analyses require your interpretation to make them useful. As how to make use of them depends on your domain knowledge, insights, imagination, and maybe luck!

To help understand the abbreviated field names presented in all the results, the following lists their meanings and examples:

Note:

  1. When interpreting the number of articles published by certain authors, be careful about the possible problem of different authors sharing the same names (AU or AF).
  2. Some fields may have no content. For example, DE and ID may be empty due to: 1) the journal does not require keywords given by authors; 2) WoK did not record DE or did not label ID. Before your analysis, you should check CATAR's report for the number of empty records for each field (from the above Sam's example, the report is in the _Sam_stat worksheet in Result\Sam\_Sam_by_field.xls). Try to avoid using the result from the field containing too many empty records.


6. Frequently Asked Questions

  1. Q: How to determine the threshold during the multi-stage clustering?
    A: The similarity measures, either based on bibliographic coupling or co-occurrence words, are just an approximation to the "real similarity" of two documents. For example, document A and B sharing the same 50% of their references (or the same 50% words they used in their titles and abstracts) do not necessary mean they are 50% in common in their topics. It is only fair to say that A and B have higher probability to be in the same topic, compared to the case of A and C, if A and C share only the same 10% of their references (or the same 10% words ).
    So insistence on computing an optimal similarity threshold for clustering is not necessary. This uncertainty in the similarity measure makes such an optimal threshold somewhat unsure. (It may be a research question to know to what degree a domain fit the methodology of bibliographic coupling and/or co-word analysis.)
    Therefore, it is recommended that you explore as many thresholds as you can and see which one leads to an interpretable result.

  2. Q: In some cases, there are empty data in some fields, such as DE (author keywords), ID (identifiers), or even AB (abstract). What can we do?
    A: These cases are quite common. It may be due to the cases that the journal publisher did not provide such information or the database provider (e.g., WoK, Thomson Reuters) did not add such information to these fields.
    Try to analyze and interpret the data with alternative ways. Avoid those results from insufficient data.

  3. Q: The papers included in the final stage of clustering are in small number compared to the initial papers for analysis. For example, there may be up to 50% percent of papers not included in the final clustering. Is this abnormal?
    A: No. It is a common phenomenon that many items may be regarded as outliers in clustering analysis. As can be imagined, although many papers deal with major topics, there are many more dealing with independent and probably less-noticed issues, which is a phenomenon similar to the long tail effect reflected in the online book sales statistics. Since the independent issues are in large number, they were excluded from the clustering. Like the other clustering analyses (such as those based on singular value decomposition), the multi-stage clustering used in CATAR tends to retain only some major topics for clarity, especially when a non-zero threshold is applied to each successive clustering stage.

7. History

CATAR was developed under the support and demand of the following projects:

The technology used in CATAR denotes a series of long term research and was published in the following papers:

Applications of CATAR were published in various domains:

  1. Yulan Yuan, Ulrike Gretzel, and Yuen-Hsien Tseng, "Revealing the Nature of Contemporary Tourism Research: Extracting Common Subject Areas through Bibliographic Coupling ", International Journal of Tourism Research, Vol. 17, No. 5, pp. 417�V431, Sep./Oct. 2015, DOI: 10.1002/jtr.2004. (SSCI)
  2. Yulan Yuan, Yuen-Hsien Tseng, and Chun-Yen Chang, " Tourism Subfield Identification via Journal Clustering", Annals of Tourism Research, Vol. 47, July 2014, pp. 77�V80 (SSCI)
  3. Yuen-Hsien Tseng, Chun-Yen Chang, M. Shane Tutwiler, Ming-Chao Lin, and James Barufaldi, " A Scientometric Analysis of the Effectiveness of Taiwan's Educational Research Projects", Scientometrics, Vol. 95, No. 3, pp 1141-1166, June 2013. (SCI, SSCI) 2013�Ш|����ڽפ��{�s�D��v��
  4. Yuen-Hsien Tseng and Ming-Yueh Tsay, " Journal clustering of Library and Information Science for subfield delineation using the bibliometric analysis toolkit: CATAR", Scientometrics, Vol. 95, No. 2, pp. 503-528, May 2013. (SCI, SSCI)
  5. ���M��, ���ӻ�, ������, " �Ʀ�Dz߻��D�D���R����s", �Ш|��ƻP�Ϯ��]��, 50��, 3��, ��319-354, 2013�~4��C(TSSCI)
  6. ������, " ���m���e���ɤu�� - CATAR ���o�i�M����", �Ϯ��]�ǻP��T��� �b�~�Z, �� 37 �� �� 1 ���A�� 31-49, 2011�~ 04��.
  7. ������, �L��@, " ���e���ɧ޳N�b�Ш|��Ų��s�o�i�Ͷդ��R������", �Ш|��Ǭ�s���Z, �� 56 ���� 1 ���A�� 129-166, 2011 �~ 3 ��. (TSSCI)
  8. Yueh-Hsia Chang, Chun-Yen Chang, and Yuen-Hsien Tseng, " Trends of Science Education Research: An Automatic Content Analysis", Journal of Science Education and Technology, Vol. 19, No. 4, 2010, pp. 315-331.
  9. Chang, Y.-H., Tseng, Y.-H., Chang, C.-Y., & Chen, C. L. D. (2012). Research trends of pedagogical content knowledge: An automatic content analysis. International Conference of the European Association for Practitioner Research on Improving Learning (EAPRIL), Nov. 28-30, 2012, Jyvaskyla, Finland.
  10. Yuen-Hsien Tseng and Chun-Yen Chang, "Performance Comparison for Educational Institutes �V Identifying Sub-Field Characteristics by Journal Clustering for Better Ranking", The 17th International Conference on Science and Technology Indicators (STI 2012), Montreal, Canada, Sept. 5th-8th, 2012, pp.897-898.
  11. Yi-Yang Lee, Yuen-Hsien Tseng, Wen-Chi Hung, and Michael Huang, "The Application of Knowledge Mining to the Discovery of Trends in Future Agricultural Technological Development" Proceedings of the 10th International Conference on Science and Technology Indicators, Sep. 17-20, 2008, Vienna, Austria, page 481-483.
  12. ���ο��]2008�^�C�q�����U���R�n��B�Ω��ʬ�s�X�ͽZ���e���R�����Q�C��ߥx�W�v�d�j�ǹϮѸ�T�Ǭ�s�ҺӤh�פ�A���X���A�x�_���C
  13. �\�|�D�]2009�^�C�|ij�P���Z���m��w���D�D�Ͷդ������s�X�H�u��T�˯��v��쬰�ҡC��ߥx�W�v�d�j�ǹϮѸ�T�Ǭ�s�ҺӤh�פ�A���X���A�x�_���C
  14. ���Q�s�]2010�^�C�H�۰ʤƥD�D���R�����K�̾ǻ���s�D�D���o�i�C��ߥx�W�v�d�j�ǹϮѸ�T�Ǭ�s�ҺӤh�פ�A���X���A�x�_���C

From the above projects (most are small, experimental ones) and papers, you can imagine that the CATAR presented here denotes only some of the analysis tasks I have done so far. Many more experimental analyses were not mentioned (due to the specificity of the requirements and applications). This suggests that the analysis approaches provided by current version of CATAR are the most mature ones.

If you have any idea of seeing unstructured data being analyzed in a certain way which maybe novel and useful, I would be happy to modify CATAR and include that way of analysis in future versions, especially when this modification leads to a research opportunity . The only limit is my time and energy.


Established on May 19, 2010 by Yuen-Hsien Tseng