Exploring the SAWA corpus : collection and deployment of a parallel corpus English-Swahili

De Pauw, Guy; Waiganjo Wagacha, Peter; de Schryver, Gilles-Maurice

doi:10.1007/S10579-011-9159-7

Title

Exploring the SAWA corpus : collection and deployment of a parallel corpus English-Swahili

Author

De Pauw, Guy

Waiganjo Wagacha, Peter

de Schryver, Gilles-Maurice

Abstract

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus EnglishSwahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing EnglishSwahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

Language

English

Source (journal)

Language resources and evaluation. - New York, N.Y., 2005, currens

Publication

New York, N.Y. : 2011

ISSN

1574-020X [print]

1574-0218 [online]

DOI

10.1007/S10579-011-9159-7

Volume/pages

45 :3 (2011) , p. 331-344

ISI

000293709900005

Full text (Publisher's DOI)

https://doi.org/10.1007/S10579-011-9159-7

Full text (publisher's version - intranet only)

https://repository.uantwerpen.be/docman/iruaauth/28a979/a9fb7ca961f.pdf

Faculty/Department				Faculty of Arts. Linguistics

Research group				Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics (CLiPS)
Publication type				A1 Journal article

Subject				Linguistics

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

Creation

09.09.2011

Last edited

15.11.2022

To cite this reference

https://hdl.handle.net/10067/911210151162165141