Title
Multimodular text normalization of Dutch user-generated content Multimodular text normalization of Dutch user-generated content
Author
Faculty/Department
Faculty of Arts. Linguistics and Literature
Publication type
article
Publication
Subject
Computer. Automation
Source (journal)
ACM Transactions on Intelligent Systems and Technology (TIST)
Volume/pages
7(2016) :4 , 22 p.
ISSN
2157-6904
2157-6912
2157-6904
Article Reference
61
Carrier
E
Target language
English (eng)
Full text (Publishers DOI)
Affiliation
University of Antwerp
Abstract
As social media constitutes a valuable source for data analysis for a wide range of applications, the need for handling such data arises. However, the nonstandard language used on social media poses problems for natural language processing (NLP) tools, as these are typically trained on standard language material. We propose a text normalization approach to tackle this problem. More specifically, we investigate the usefulness of a multimodular approach to account for the diversity of normalization issues encountered in user-generated content (UGC). We consider three different types of UGC written in Dutch (SNS, SMS, and tweets) and provide a detailed analysis of the performance of the different modules and the overall system. We also apply an extrinsic evaluation by evaluating the performance of a part-of-speech tagger, lemmatizer, and named-entity recognizer before and after normalization.
E-info
http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000380322200018&DestLinkType=RelatedRecords&DestApp=ALL_WOS&UsrCustomerID=ef845e08c439e550330acc77c7d2d848
http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER_APP&SrcAuth=LinksAMR&KeyUT=WOS:000380322200018&DestLinkType=FullRecord&DestApp=ALL_WOS&UsrCustomerID=ef845e08c439e550330acc77c7d2d848
https://repository.uantwerpen.be/docman/iruaauth/1cc362/135024.pdf
Handle