Publication
Title
DS-Prox : dataset proximity mining for governing the Data Lake
Author
Abstract
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
Language
English
Source (journal)
Lecture notes in computer science. - Berlin, 1973, currens
Source (book)
Similarity search and applications 10th International Conference, SISAP 2017, 4-6 October 2017, Munich, Germany, Proceedings / Beecks, Christian [edit.]; et al.
Source (series)
Information systems and applications, incl. internet/web, and HC (LNISA) ; 10609
Publication
Cham : Springer , 2017
ISBN
978-3-319-68473-4
Volume/pages
(2018) , p. 284-299
Full text (Publisher's DOI)
Full text (publisher's version - intranet only)
UAntwerpen
Faculty/Department
Research group
Publication type
Subject
Affiliation
Publications with a UAntwerp address
External links
Record
Identification
Creation 01.08.2018
Last edited 15.07.2021
To cite this reference