Publication
Title
Keeping the data lake in form: DS-kNN datasets categorization using proximity mining
Author
Abstract
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schemaon-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.
Language
English
Source (journal)
Lecture notes in computer science. - Berlin, 1973, currens
Source (book)
9th International Conference on Model and Data Engineering (MEDI), OCT 28-31, 2019, Toulouse, FRANCE
Publication
Cham : Springer international publishing ag , 2019
ISBN
978-3-030-32065-2
978-3-030-32064-5
DOI
10.1007/978-3-030-32065-2_3
Volume/pages
11815 (2019) , p. 35-49
ISI
000567294500003
Full text (Publisher's DOI)
UAntwerpen
Faculty/Department
Research group
Project info
Digitalisation and Tax (DigiTax).
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Web of Science
Record
Identifier
Creation 19.10.2020
Last edited 02.10.2024
To cite this reference