Publication
Title
Exploring the SAWA corpus : collection and deployment of a parallel corpus English-Swahili
Author
Abstract
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus EnglishSwahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing EnglishSwahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.
Language
English
Source (journal)
Language resources and evaluation. - New York, N.Y., 2005, currens
Publication
New York, N.Y. : 2011
ISSN
1574-020X [print]
1574-0218 [online]
DOI
10.1007/S10579-011-9159-7
Volume/pages
45 :3 (2011) , p. 331-344
ISI
000293709900005
Full text (Publisher's DOI)
Full text (publisher's version - intranet only)
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Web of Science
Record
Identifier
Creation 09.09.2011
Last edited 15.11.2022
To cite this reference