Publication
Title
Transfer learning for digital heritage collections: comparing neural machine translation at the subword-level and character-level
Author
Abstract
Transfer learning via pre-training has become an important strategy for the efficient application of NLP methods in domains where only limited training data is available. This paper reports on a focused case study in which we apply transfer learning in the context of neural machine translation (French-Dutch) for cultural heritage metadata (i.e. titles of artistic works). Nowadays, neural machine translation (NMT) is commonly applied at the subword level using byte-pair encoding (BPE), because word-level models struggle with rare and out-of-vocabulary words. Because unseen vocabulary is a significant issue in domain adaptation, BPE seems a better fit for transfer learning across text varieties. We discuss an experiment in which we compare a subword-level to a character-level NMT approach. We pre-trained models on a large, generic corpus and fine-tuned them in a two-stage process: first, on a domain-specific dataset extracted from Wikipedia, and then on our metadata. While our experiments show comparable performance for character-level and BPEbased models on the general dataset, we demonstrate that the character-level approach nevertheless yields major downstream performance gains during the subsequent stages of fine-tuning. We therefore conclude that character-level translation can be beneficial compared to the popular subword-level approach in the cultural heritage domain.
Language
English
Source (book)
Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH]
Publication
Setubal : Scitepress , 2020
ISBN
978-989-758-395-7
DOI
10.5220/0009167205220529
Volume/pages
(2020) , p. 522-529
ISI
000570767700058
Full text (Publisher's DOI)
Full text (open access)
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Web of Science
Record
Identifier
Creation 19.03.2020
Last edited 12.11.2024
To cite this reference