(Last updated on 2021-09-03)
Alvin Cheng-Hsien Chen , National Taiwan Normal University
Táňa Dluhošová* , Oriental Institute of the Czech Academy of Sciences
*Corresponding Author: Táňa Dluhošová, Oriental Institute, Academy of Sciences of the Czech Republic. Email:dluhosova@oient.cas.cz
This study examines the semantic development of ideologically loaded key terms in Taiwanese cultural discourse from a corpus-based inductive perspective, contextualizing them ideologically and socially. We investigated lexical patterns in the Taiwan Early Post-war Corpus (TEPC), which consists of culturally oriented articles from the period 1945–1949. We used hierarchical cluster analysis of lexical distributions to identify text subgroups. Statistically defined keywords within clusters helped characterize post-war ideological undercurrents. Network analyses revealed interrelationships of these cluster-based statistical keywords, uncovering three ideologically inclined cluster-specific semantic fields. Additionally, we examined networks associated with the authors and periodicals in each cluster through shared keyword usage, emphasizing connections to cultural policies of the central and local governments and the propagation of Taiwanese subjectivity. Finally, with the help of the Taiwan Biographical Ontology (TBIO), we complemented ideological patterns with positional analyses of authors’ social involvement, determining the predispositions of each cluster and its position on an ideological map. These combined analyses create a multilayered network that can be used to identify and characterize ideological camps in post-war Taiwan.
This page is the Supplementary Data of the study.
This research was supported by bilateral grants from the Taiwan Ministry of Science and Technology [106-2923-H-003-001-MY2], granted to the first author, and Czech Grant Agency [CZ 17-03529J], granted to the second author.
For TEPC text preprocessing we first tokenised articles into words with the word segmenter JiebaR (Qin & Wu, 2019). We defined a relevant dictionary list to improve word segmentation performance by including proper names in the early post-war context. Second, we segmented each article into sentence-like units using non-word punctuation tokens. These units formed the basis for later multiword combinations (i.e., n-gram) extraction for each article. Our methodology combined a series of quantitative analyses. All data processing and analyses employed R scripts developed by the first author.
Exploratory hierarchical cluster analysis identified document clusters through lexical patterns: it grouped texts into different sectors according to similarities in lexical distribution. We used each text’s unigrams and bigrams for cluster analysis and considered both the frequencies of selected lexical types in each document and their distributions across documents, i.e., lexical dispersion. Distributional cut-offs ensured that the included n-grams were representative in terms of frequency and dispersion within and across articles. Specifically, an n-gram was included as a classifying feature if it occurred (a) at least five times and (b) in at least two different articles in the corpus. High n-gram dispersions point to grammatical words or sequences with low ideationally loaded content. To reduce the impact of non-informative functional words/sequences, we excluded n-grams occurring in more than 80% of TEPC items.
Having determined a relevant set of n-grams as classifying features of each text, we subjected this document-by-n-gram co-occurrence matrix to hierarchical cluster analysis using Ward’s method. We used the correlation-based metric, i.e., the cosine distance, for the computation of the pairwise distance for texts. To determine the optimal number of clusters, we used the scree-plot method (Kaufman¬ & Rousseeuw, 2005), which suggested a five-cluster solution for our data would provide the most parsimonious balance between minimising the cluster number and within-cluster variance for each cluster. The internal hierarchical structure of the dendrogram indicates how closely these clusters match in terms of shared vocabulary.
Although the cluster algorithm provided lexically based groupings, it is important to isolate lexical features distinctive of each cluster’s texts. Our second analysis utilised a quantitative corpus-linguistic method – multiple distinctive collexeme analysis (MDCA, Gilquin, 2006; Gries & Stefanowitsch, 2004) – as a robust statistical variant of keyword analysis (Baron et al., 2009; Pojanapunya & Todd, 2018; Rayson, 2008) to determine each cluster’s keywords, i.e., lexical features strongly associated with it. The description of each cluster’s semantic structures and ideological position derives from these keywords.
MDCA was initially proposed by Gries and Stefanowitch (2004) to identify words that could help effectively differentiate between two or more semantically similar constructions. Words attracted to a particular linguistic pattern are termed collexemes in this research paradigm. The present study extended MDCA to the analysis of distinctive collexemes for five relevant linguistic contexts, i.e., our five clusters of texts. MDCA identified two types of distinctive lexical items in each cluster: (a) n-grams that were strongly attracted and (b) n-grams that were repelled to each cluster.
Author networks represent more than authors’ relationships and proximity in terms of lexical choices. They also indicate shared worldviews. Having identified groups of authors, we contextualised the positions of these clusters by utilising TBIO for a study of the authors’ backgrounds (i.e., habitus). TBIO was designed by the corresponding author as a biographical database of Taiwanese elites, both of Taiwanese and Mainland origin, for the period 1900–1949. It facilitates analysis of career-path trajectories. Currently 29,108 individuals, 82,936 organisations, and 3,219 positions within these organisations are included.
TBIO is organised around two entities: person and organisation. Organisations are classified according to societal sectors regularly used for positional analysis. Societal sectors usually comprise politics, public administration, the armed forces, private business, mass media, academia and education, and voluntary associations. (Hoffman-Lange, 2018) To reflect the specificity of Taiwanese society, we added cultural organisations (sources of social and artistic prestige) and the medical sector and police (the commonest sources of social prestige under Japanese rule). Each societal sector generates different types of prestige and the article is using mapping proxies as introduced in (Dluhošová, 2020). Combinations of these point to particular dispositions of individuals or groups. This helps characterise social groups propagating certain ideas –combining the study of discourse with sociology.
In our positional analyses, we first identified authors in each network whose biographical records had been included in TBIO and then computed the overall distribution of societal sectors that the cluster’s authors were involved in, normalised by the total number of authors.