Multimodular text normalization of Dutch user-generated content

Schulz, Sarah; De Pauw, Guy; De Clercq, Orphee; Desmet, Bart; Hoste, Veronique; Daelemans, Walter; Macken, Lieve

doi:10.1145/2850422

Title

Multimodular text normalization of Dutch user-generated content

Author

Schulz, Sarah

De Pauw, Guy

De Clercq, Orphee

Desmet, Bart

Hoste, Veronique

Daelemans, Walter

Macken, Lieve

Abstract

As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.

Language

English

Source (journal)

ACM Transactions on Intelligent Systems and Technology (TIST)

Publication

2016

ISSN

2157-6904

2157-6912

DOI

10.1145/2850422

Volume/pages

7 :4 (2016) , p. 1-22

Article Reference

61

ISI

000380322200018

Medium

E-only publicatie

Full text (Publisher's DOI)

https://doi.org/10.1145/2850422

Full text (publisher's version - intranet only)

https://repository.uantwerpen.be/docman/iruaauth/1cc362/135024.pdf

Faculty/Department				Faculty of Arts. Linguistics

Research group				Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS)
Project info				PARIS - Personalised advertisements built from web sources.
Publication type				A1 Journal article

Subject				Computer. Automation

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

02.09.2016

Last edited

09.10.2023

To cite this reference

https://hdl.handle.net/10067/1350240151162165141