Content-based classification of research articles : comparing keyword extraction, BERT, and random forest classifiers

Arhiliuc, Cristina; Guns, Raf

doi:10.5281/zenodo.8305874

Title

Content-based classification of research articles : comparing keyword extraction, BERT, and random forest classifiers

Author

Arhiliuc, Cristina

Guns, Raf

Abstract

The classification of publications into disciplines has multiple applications in scientometrics – from contributing to further studies of the dynamics of research to allowing responsible use of research metrics. However, the most common ways to classify publications into disciplines are mostly based on citation data, which is not always available. Thus, we compare a set of algorithms to classify publications based on the textual data from their abstract and titles. The algorithms learn from a training dataset of Web of Science (WoS) articles that, after mapping their subject categories to the OECD FORD classification schema, have only one assigned discipline. We present different implementations of the Random Forest algorithm, evaluate a BERT-based classifier and introduce a keyword-based methodology for comparison. We find that the BERT classifier performs the best with an accuracy of 0.7 when trying to predict the discipline and an accuracy of 0.91 for the “real discipline” to be in top 3. Additionally, confusion matrices are presented that indicate that frequently the results of misclassifications are similar disciplines to “real” ones. We conclude that, overall, Random Forest-based methods are a compromise between interpretability and performance, being also the fastest to execute.

Language

English

Source (book)

ISSI 2023: the 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI2023), 2-5 July, 2023, Bloomington, Indiana

Source (series)

ISSI Conference Proceedings

Publication

2023

DOI

10.5281/ZENODO.8305874

Volume/pages

1 (2023) , p. 43-63

Full text (Publisher's DOI)

https://doi.org/10.5281/zenodo.8305874

Full text (open access)

Licensed under a CC BY Attribution license

Faculty/Department				Faculty of Social Sciences. Communication Sciences

Research group

Publication type				P1 Proceeding

Subject				Documentation and information

Affiliation				Publications with a UAntwerp address

VABB-SHW

This title in VABB-SHW

Identifier

c:irua:202477

Creation

22.01.2024

Last edited

07.04.2025

To cite this reference

https://hdl.handle.net/10067/2024770151162165141