Title
|
|
|
|
Content-based classification of research articles : comparing keyword extraction, BERT, and random forest classifiers
| |
Author
|
|
|
|
| |
Abstract
|
|
|
|
The classification of publications into disciplines has multiple applications in scientometrics – from contributing to further studies of the dynamics of research to allowing responsible use of research metrics. However, the most common ways to classify publications into disciplines are mostly based on citation data, which is not always available. Thus, we compare a set of algorithms to classify publications based on the textual data from their abstract and titles. The algorithms learn from a training dataset of Web of Science (WoS) articles that, after mapping their subject categories to the OECD FORD classification schema, have only one assigned discipline. We present different implementations of the Random Forest algorithm, evaluate a BERT-based classifier and introduce a keyword-based methodology for comparison. We find that the BERT classifier performs the best with an accuracy of 0.7 when trying to predict the discipline and an accuracy of 0.91 for the “real discipline” to be in top 3. Additionally, confusion matrices are presented that indicate that frequently the results of misclassifications are similar disciplines to “real” ones. We conclude that, overall, Random Forest-based methods are a compromise between interpretability and performance, being also the fastest to execute. |
| |
Language
|
|
|
|
English
| |
Source (book)
|
|
|
|
ISSI 2023: the 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI2023), 2-5 July, 2023, Bloomington, Indiana
| |
Source (series)
|
|
|
|
ISSI Conference Proceedings
| |
Publication
|
|
|
|
2023
| |
DOI
|
|
|
|
10.5281/ZENODO.8305874
| |
Volume/pages
|
|
|
|
1
(2023)
, p. 43-63
| |
Full text (Publisher's DOI)
|
|
|
|
| |
Full text (open access)
|
|
|
|
| |
|