Publication
Title
CorpusCollie : a web corpus mining tool for resource-scarce languages
Author
Abstract
This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is able to classify documents for over five hundred different languages with state-of-the-art accuracy. As a proof-of-concept, we describe and evaluate the fully automatic compilation of a web corpus for the Nilotic language of Luo (Dholuo) using CORPUSCOLLIE.
Language
English
Source (book)
Proceedings of the Conference on Human Language Technology for Development
Publication
S.l. : 2011
Volume/pages
p. 44-49
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Record
Identifier
Creation 17.03.2012
Last edited 07.10.2022
To cite this reference