Title
CorpusCollie : a web corpus mining tool for resource-scarce languages
Author
Faculty/Department
Faculty of Arts. Linguistics and Literature
Publication type
conferenceObject
Publication
S.l. , [*]
Subject
Computer. Automation
Linguistics
Source (book)
Proceedings of the Conference on Human Language Technology for Development
Carrier
E
Target language
English (eng)
Affiliation
University of Antwerp
Abstract
This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is able to classify documents for over five hundred different languages with state-of-the-art accuracy. As a proof-of-concept, we describe and evaluate the fully automatic compilation of a web corpus for the Nilotic language of Luo (Dholuo) using CORPUSCOLLIE.
Handle