Title
|
|
|
|
CorpusCollie : a web corpus mining tool for resource-scarce languages
| |
Author
|
|
|
|
| |
Abstract
|
|
|
|
This paper describes CORPUSCOLLIE, an open-source software package that is geared towards the collection of clean web corpora of resource-scarce languages. CORPUSCOLLIE uses a wide range of information sources to find, classify and clean documents for a given target language. One of the most powerful components in CORPUSCOLLIE is a maximum-entropy based language identification module that is able to classify documents for over five hundred different languages with state-of-the-art accuracy. As a proof-of-concept, we describe and evaluate the fully automatic compilation of a web corpus for the Nilotic language of Luo (Dholuo) using CORPUSCOLLIE. |
| |
Language
|
|
|
|
English
| |
Source (book)
|
|
|
|
Proceedings of the Conference on Human Language Technology for Development
| |
Publication
|
|
|
|
S.l.
:
2011
| |
Volume/pages
|
|
|
|
p. 44-49
| |
|