Weigh your words : memory-based lemmatization for Middle Dutch

Kestemont, Mike; Daelemans, Walter; De Pauw, Guy

doi:10.1093/LLC/FQQ011

Title

Weigh your words : memory-based lemmatization for Middle Dutch

Author

Kestemont, Mike

Daelemans, Walter

De Pauw, Guy

Abstract

This article deals with the lemmatization of Middle Dutch literature. This text collectionlike any other medieval corpusis characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling, containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can learn intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.

Language

English

Source (journal)

Literary and linguistic computing. - Oxford, 1986 - 2014

Publication

Oxford : 2010

ISSN

0268-1145 [print]

1477-4615 [online]

DOI

10.1093/LLC/FQQ011

Volume/pages

25 :3 (2010) , p. 287-301

ISI

000284758900002

Full text (Publisher's DOI)

https://doi.org/10.1093/LLC/FQQ011

Full text (publisher's version - intranet only)

https://repository.uantwerpen.be/docman/iruaauth/3fb663/eced7745754.pdf

Faculty/Department				Faculty of Arts. Linguistics Faculty of Arts. Literature

Research group				Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS) Antwerp Centre for Digital humanities and literary Criticism (ACDC)
Publication type				A1 Journal article

Subject				Computer. Automation Linguistics

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

17.03.2012

Last edited

25.05.2022

To cite this reference

https://hdl.handle.net/10067/965130151162165141