Title
|
|
|
|
Keeping the data lake in form: DS-kNN datasets categorization using proximity mining
|
|
Author
|
|
|
|
|
|
Abstract
|
|
|
|
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schemaon-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings. |
|
|
Language
|
|
|
|
English
|
|
Source (journal)
|
|
|
|
Lecture notes in computer science. - Berlin, 1973, currens
|
|
Source (book)
|
|
|
|
9th International Conference on Model and Data Engineering (MEDI), OCT 28-31, 2019, Toulouse, FRANCE
|
|
Publication
|
|
|
|
Cham
:
Springer international publishing ag
,
2019
|
|
ISBN
|
|
|
|
978-3-030-32065-2
978-3-030-32064-5
|
|
DOI
|
|
|
|
10.1007/978-3-030-32065-2_3
|
|
Volume/pages
|
|
|
|
11815
(2019)
, p. 35-49
|
|
ISI
|
|
|
|
000567294500003
|
|
Full text (Publisher's DOI)
|
|
|
|
|
|