De-identification of clinical free text in Dutch with limited training data : a case study
Faculty of Sciences. Mathematics and Computer Science
Faculty of Arts. Linguistics and Literature
Faculty of Applied Engineering Sciences
S.l. , 2013
Workshop on NLP for Medicine and Biology
University of Antwerp
In order to analyse the information present in medical records while maintaining patient privacy, there is a basic need for techniques to automatically de-identify the free text information in these records. This paper presents a machine learning deidentification system for clinical free text in Dutch, relying on best practices from the state of the art in de-identification of English-language texts. We combine string and pattern matching features with machine learning algorithms and compare performance of three different experimental setups using Support Vector Machines and Random Forests on a limited data set of one hundred manually obfuscated texts provided by Antwerp University Hospital (UZA). The setup with the best balance in precision and recall during development was tested on an unseen set of raw clinical texts and evaluated manually at the hospital site.