The Netlog corpus : a resource for the study of Flemish Dutch internet language

Kestemont, Mike; Peersman, Claudia; De Decker, Benny; De Pauw, Guy; Luyckx, Kim; Morante, Roser; Vaassen, Frederik; van de Loo, Janneke; Daelemans, Walter

Title

Author

Kestemont, Mike

Peersman, Claudia

De Decker, Benny

De Pauw, Guy

Luyckx, Kim

Morante, Roser

Vaassen, Frederik

van de Loo, Janneke

Daelemans, Walter

Abstract

Although in recent years numerous forms of Internet communication such as e-mail, blogs, chat rooms and social network environments have emerged, balanced corpora of Internet speech with trustworthy meta-information (e. g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e. g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this 'anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

Language

English

Source (book)

Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) / Calzolari, Nicoletta [edit.]; e.a.

Publication

Istanbul : European Language Resources Association , 2012

ISBN

978-2-9517408-7-7

Volume/pages

p. 1569-1572

ISI

000323927701108

Faculty/Department				Faculty of Arts. Applied Linguistics Faculty of Arts. Linguistics Faculty of Arts. Literature

Research group				Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS) Antwerp Centre for Digital humanities and literary Criticism (ACDC) Translation, Interpreting and Intercultural Studies (TricS)
Project info				A Safer Internet: (Semi)automatically Recognizing Internet Paedophilia in Multilingual Online Social Networks.
Publication type				P1 Proceeding

Subject				Linguistics

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

31.05.2012

Last edited

09.10.2023

To cite this reference

https://hdl.handle.net/10067/981230151162165141