Publication
Title
Weigh your words : memory-based lemmatization for Middle Dutch
Author
Abstract
This article deals with the lemmatization of Middle Dutch literature. This text collectionlike any other medieval corpusis characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling, containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can learn intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.
Language
English
Source (journal)
Literary and linguistic computing. - Oxford, 1986 - 2014
Publication
Oxford : 2010
ISSN
0268-1145 [print]
1477-4615 [online]
Volume/pages
25:3(2010), p. 287-301
ISI
000284758900002
Full text (Publisher's DOI)
Full text (publisher's version - intranet only)
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Web of Science
Record
Identification
Creation 17.03.2012
Last edited 16.07.2017
To cite this reference