Abstract:
The article tries to answer whether the BERTopic topic modeling framework can be used to obtain topics that meaningfully distinguish two corpora of Buddhist Chinese texts from 500 to 800 CE. The first corpus consists of translated Indian-Chinese Buddhist texts, the second of “Chinese-Chinese” texts, i.e. texts directly authored in Buddhist Chinese. Does the application of topic modeling reveal aspects that are typical for these corpora and do these topics suggest avenues for future research into the sinicization of Buddhism that took place during that time? For our implementation of BERTopic, we used the customized GuwenBERT, a language model trained on classical Chinese. To reduce the dimensionality of the embeddings we used the UMAP algorithm. Next, the HDBSCAN takes care of hierarchical clustering. The most relevant words of each cluster are identified with c-tf-idf. As a last step, we score each cluster by its monochromaticity – this is a measure of how likely the documents in the cluster are to be derived from either just the Chinese-Chinese or just the Indian-Chinese documents. In order to communicate the topics we create virtual paragraphs that combine most of the top twenty terms that represent a sample of ten highly monochromatic topics. Discussing these topics from a Buddhist Studies point of view, we find that our modified BERTopic workflow does indeed return topics that are characteristic of their corpus and highlights facets that help to understand the process of how Buddhism became sinicized in the three centuries between 500 and 800 CE. Thus distant reading of latent topics in the corpus is possible. While some topics are in themselves unsurprising, others highlight new promising areas for research.
Authors: Marcus Bingenheimer (Temple University), Justin Brody (Franklin and Marshall College), Ryan Nichols (California State University, Fullerton)
Publication: Digital Humanities Quarterly, Vol. 19, No. 1 (2025)