Overview

Methods

Abstract

(Last updated on 2021-09-03)

Concepts in Contexts: Discourse-based Semantic Networks of Ideologies in Taiwan (1945–1949)

Alvin Cheng-Hsien Chen , National Taiwan Normal University

Táňa Dluhošová* , Oriental Institute of the Czech Academy of Sciences

*Corresponding Author: Táňa Dluhošová, Oriental Institute, Academy of Sciences of the Czech Republic. Email:dluhosova@oient.cas.cz

This study examines the semantic development of ideologically loaded key terms in Taiwanese cultural discourse from a corpus-based inductive perspective, contextualizing them ideologically and socially. We investigated lexical patterns in the Taiwan Early Post-war Corpus (TEPC), which consists of culturally oriented articles from the period 1945–1949. We used hierarchical cluster analysis of lexical distributions to identify text subgroups. Statistically defined keywords within clusters helped characterize post-war ideological undercurrents. Network analyses revealed interrelationships of these cluster-based statistical keywords, uncovering three ideologically inclined cluster-specific semantic fields. Additionally, we examined networks associated with the authors and periodicals in each cluster through shared keyword usage, emphasizing connections to cultural policies of the central and local governments and the propagation of Taiwanese subjectivity. Finally, with the help of the Taiwan Biographical Ontology (TBIO), we complemented ideological patterns with positional analyses of authors’ social involvement, determining the predispositions of each cluster and its position on an ideological map. These combined analyses create a multilayered network that can be used to identify and characterize ideological camps in post-war Taiwan.

This page is the Supplementary Data of the study.

Funding Acknowledgements

This research was supported by bilateral grants from the Taiwan Ministry of Science and Technology [106-2923-H-003-001-MY2], granted to the first author, and Czech Grant Agency [CZ 17-03529J], granted to the second author.

Flowchart

Quantitative Methods

For TEPC text preprocessing we first tokenised articles into words with the word segmenter JiebaR (Qin & Wu, 2019). We defined a relevant dictionary list to improve word segmentation performance by including proper names in the early post-war context. Second, we segmented each article into sentence-like units using non-word punctuation tokens. These units formed the basis for later multiword combinations (i.e., n-gram) extraction for each article. Our methodology combined a series of quantitative analyses. All data processing and analyses employed R scripts developed by the first author.

Cluster Analysis

Exploratory hierarchical cluster analysis identified document clusters through lexical patterns: it grouped texts into different sectors according to similarities in lexical distribution. We used each text’s unigrams and bigrams for cluster analysis and considered both the frequencies of selected lexical types in each document and their distributions across documents, i.e., lexical dispersion. Distributional cut-offs ensured that the included n-grams were representative in terms of frequency and dispersion within and across articles. Specifically, an n-gram was included as a classifying feature if it occurred (a) at least five times and (b) in at least two different articles in the corpus. High n-gram dispersions point to grammatical words or sequences with low ideationally loaded content. To reduce the impact of non-informative functional words/sequences, we excluded n-grams occurring in more than 80% of TEPC items.

Having determined a relevant set of n-grams as classifying features of each text, we subjected this document-by-n-gram co-occurrence matrix to hierarchical cluster analysis using Ward’s method. We used the correlation-based metric, i.e., the cosine distance, for the computation of the pairwise distance for texts. To determine the optimal number of clusters, we used the scree-plot method (Kaufman¬ & Rousseeuw, 2005), which suggested a five-cluster solution for our data would provide the most parsimonious balance between minimising the cluster number and within-cluster variance for each cluster. The internal hierarchical structure of the dendrogram indicates how closely these clusters match in terms of shared vocabulary.

Keyword Analysis: Semantic Coherence of the Clusters

Although the cluster algorithm provided lexically based groupings, it is important to isolate lexical features distinctive of each cluster’s texts. Our second analysis utilised a quantitative corpus-linguistic method – multiple distinctive collexeme analysis (MDCA, Gilquin, 2006; Gries & Stefanowitsch, 2004) – as a robust statistical variant of keyword analysis (Baron et al., 2009; Pojanapunya & Todd, 2018; Rayson, 2008) to determine each cluster’s keywords, i.e., lexical features strongly associated with it. The description of each cluster’s semantic structures and ideological position derives from these keywords.

MDCA was initially proposed by Gries and Stefanowitch (2004) to identify words that could help effectively differentiate between two or more semantically similar constructions. Words attracted to a particular linguistic pattern are termed collexemes in this research paradigm. The present study extended MDCA to the analysis of distinctive collexemes for five relevant linguistic contexts, i.e., our five clusters of texts. MDCA identified two types of distinctive lexical items in each cluster: (a) n-grams that were strongly attracted and (b) n-grams that were repelled to each cluster.

Networks Analyses: Keywords, Authors, and Periodicals

After quantitatively determining distinctive keywords for each cluster through MDCA, our third analysis utilised network science (Newman, 2010; Barabási, 2016) to visualise the semantic structures of each cluster in terms of keyword associations. Additionally, as each text was connected to a writer or periodical, we constructed author and periodical networks for each text cluster based on the shared use of distinctive keywords to examine the intricacies of their interrelationships.

A network consists of nodes, with edges connecting nodes that are associated by some parametric factor. Each cluster’s keyword network represents the co-occurrence of each distinctive collexeme with each text. If two keywords co-occurred more often in similar sets of articles, they would also share a stronger connection in the network. For keyword networks, we included the top thirty distinctive keywords of each cluster according to their respective distinctiveness values from MDCA. Keywords formed an edge if they co-occurred in a certain proportion of texts in the cluster. Our cut-off proportion was 0.7 for keyword networks. Similar mechanisms were applied to the creation of the author and periodical networks.

The author network was based on the co-occurrence of each author in the cluster with the distinctive keywords. If two authors often used similar sets of distinctive keywords, they would share a stronger link in the network. Authors with more than 20–40 keyword uses (depending on the size of the clusters) were included as nodes in the network. Edges between authors were created if they shared a proportion of the same keyword uses in frequencies above a cut-off (ranging from 0.7–0.9 depending on the complexity of the network). These author-based networks indicate agents’ prominence within clusters. Similarly, periodical networks were based on co-occurrences of each periodical appearing in a cluster with distinctive keywords.

We present network analyses of keywords, authors, and periodicals, and discusses how our findings clarify the dynamics of the literary and intellectual landscape. All networks presented in our analysis were non-directed weighted graphs in which edge widths were determined by normalised keyword proportions. The node sizes were made proportional to their degree values. Communities of each network were identified using the fast-greedy algorithm of the library igraph in R.

Positional Analysis

Author networks represent more than authors’ relationships and proximity in terms of lexical choices. They also indicate shared worldviews. Having identified groups of authors, we contextualised the positions of these clusters by utilising TBIO for a study of the authors’ backgrounds (i.e., habitus). TBIO was designed by the corresponding author as a biographical database of Taiwanese elites, both of Taiwanese and Mainland origin, for the period 1900–1949. It facilitates analysis of career-path trajectories. Currently 29,108 individuals, 82,936 organisations, and 3,219 positions within these organisations are included.

TBIO is organised around two entities: person and organisation. Organisations are classified according to societal sectors regularly used for positional analysis. Societal sectors usually comprise politics, public administration, the armed forces, private business, mass media, academia and education, and voluntary associations. (Hoffman-Lange, 2018) To reflect the specificity of Taiwanese society, we added cultural organisations (sources of social and artistic prestige) and the medical sector and police (the commonest sources of social prestige under Japanese rule). Each societal sector generates different types of prestige and the article is using mapping proxies as introduced in (Dluhošová, 2020). Combinations of these point to particular dispositions of individuals or groups. This helps characterise social groups propagating certain ideas –combining the study of discourse with sociology.

In our positional analyses, we first identified authors in each network whose biographical records had been included in TBIO and then computed the overall distribution of societal sectors that the cluster’s authors were involved in, normalised by the total number of authors.

TEPC

Distributions by Periodicals

TEPC Distribution by Periodicals

Distributionals by Authors

TEPC Distribution by Authors

TEPC Descriptive Statistics

Corpus Size (Characters):  1648644

Number of texts:  1168

Number of words:  813929

Number of chars/text:  1411.51

Number of words/text:  696.857

Number of periodicals (types): 25

Number of authors (types):  694

Overview

Methods

Abstract

Concepts in Contexts: Discourse-based Semantic Networks of Ideologies in Taiwan (1945–1949)

Funding Acknowledgements

Flowchart

Quantitative Methods

Cluster Analysis

Keyword Analysis: Semantic Coherence of the Clusters

Networks Analyses: Keywords, Authors, and Periodicals

Positional Analysis

TEPC

Distributions by Periodicals

TEPC Distribution by Periodicals

Distributionals by Authors

TEPC Distribution by Authors

TEPC Descriptive Statistics

HCL

Column 1

Scree Plot

Cluster Size

Dendrogram

Column 3

Sub-Cluster 1 Trees

Sub-Cluster 2 Trees

Sub-Cluster 3 Trees

Sub-Cluster 4 Trees

Sub-Cluster 5 Trees

Collexems

Column 1

Dendrogram

Collexemes of Two Super-Clusters

Collexme of Five Clusters

Lexical Networks

Column 1

Cluster Dendrogram

Column 2

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Author Networks

Community Tabs

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Author Tabs

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Periodical Networks

Column 1

Cluster Denfrogram

Column 2

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Positional Analysis

Column 1

Proportion of Authors Include in TBIO

Cluster 3

Cluster 4

Cluster 5