Publication
Title
Content-based classification of research articles : comparing keyword extraction, BERT, and random forest classifiers
Author
Abstract
The classification of publications into disciplines has multiple applications in scientometrics – from contributing to further studies of the dynamics of research to allowing responsible use of research metrics. However, the most common ways to classify publications into disciplines are mostly based on citation data, which is not always available. Thus, we compare a set of algorithms to classify publications based on the textual data from their abstract and titles. The algorithms learn from a training dataset of Web of Science (WoS) articles that, after mapping their subject categories to the OECD FORD classification schema, have only one assigned discipline. We present different implementations of the Random Forest algorithm, evaluate a BERT-based classifier and introduce a keyword-based methodology for comparison. We find that the BERT classifier performs the best with an accuracy of 0.7 when trying to predict the discipline and an accuracy of 0.91 for the “real discipline” to be in top 3. Additionally, confusion matrices are presented that indicate that frequently the results of misclassifications are similar disciplines to “real” ones. We conclude that, overall, Random Forest-based methods are a compromise between interpretability and performance, being also the fastest to execute.
Language
English
Source (book)
ISSI 2023: the 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI2023), 2-5 July, 2023, Bloomington, Indiana
Source (series)
ISSI Conference Proceedings
Publication
2023
DOI
10.5281/ZENODO.8305874
Volume/pages
1 (2023) , p. 43-63
Full text (Publisher's DOI)
Full text (open access)
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Record
Identifier
Creation 22.01.2024
Last edited 17.06.2024
To cite this reference