Publication
Title
A benchmarking study of classification techniques for behavioral data
Author
Abstract
The predictive power of increasingly common large-scale, behavioral data has been demonstrated by previous research. Such data capture human behavior through the actions and/or interactions of people. Their sparsity and ultra-high dimensionality pose significant challenges to state-of-the-art classification techniques. Moreover, no prior work has systematically explored the choice of methods with respect to the trade-off between classification performance and computational expense. This paper provides a contribution in this direction through a benchmarking study. Eleven classification models are compared on forty-one fine-grained behavioral data sets. Statistical performance comparisons enriched with learning curve analyses demonstrate two important findings. First, there is an inherent generalization performance versus time trade-off, rendering the choice of an appropriate classifier dependent on computation constraints and data set characteristics. Well-regularized logistic regression achieves the best AUC; however, it takes the longest time to train. L2 regularization performs better than sparse L1 regularization. An attractive generalization/time trade-off is achieved by a similarity-based technique. Second, although the data sets used are large, the learning curve results illustrate that as a direct consequence of their high dimensionality and sparseness, significant value lies in collecting and analyzing even more data. This finding is observed both in the instance and in the feature dimensions, contrasting with learning curve studies on traditional data. The results of this study provide guidance for researchers and practitioners for the selection of appropriate classification techniques, sample sizes and data features, while also providing focus in scalable algorithm design in the face of large, behavioral data.
Language
English
Source (journal)
International journal of data science and analytics. - Cham, 2016, currens
Publication
Cham : Springer , 2020
ISSN
2364-415X [print]
2364-4168 [online]
DOI
10.1007/S41060-019-00185-1
Volume/pages
9 :2 (2020) , p. 131-174
ISI
000590211300001
Full text (Publisher's DOI)
Full text (open access)
Full text (publisher's version - intranet only)
UAntwerpen
Faculty/Department
Research group
Project info
Digitalisation and Tax (DigiTax).
Big Data Mining for Customer Analytics.
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
VABB-SHW
Web of Science
Record
Identifier
Creation 24.04.2019
Last edited 08.11.2024
To cite this reference