The effect of author set size and data size in authorship attribution

Luyckx, Kim; Daelemans, Walter

doi:10.1093/LLC/FQQ013

Title

The effect of author set size and data size in authorship attribution

Author

Luyckx, Kim

Daelemans, Walter

Abstract

Applications of authorship attribution `in the wild [Koppel, M., Schler, J., and Argamon, S. (2010). Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in supervised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per candidate author (i.e. the size of the training data). We also investigate the robustness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating between the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.

Language

English

Source (journal)

Literary and linguistic computing. - Oxford, 1986 - 2014

Publication

Oxford : 2011

ISSN

0268-1145 [print]

1477-4615 [online]

DOI

10.1093/LLC/FQQ013

Volume/pages

26 :1 (2011) , p. 35-55

ISI

000288801500005

Full text (Publisher's DOI)

https://doi.org/10.1093/LLC/FQQ013

Faculty/Department				Faculty of Arts. Linguistics

Research group				Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS)
Publication type				A1 Journal article

Subject				Computer. Automation Linguistics Literature

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

29.03.2011

Last edited

15.11.2022

To cite this reference

https://hdl.handle.net/10067/876360151162165141