DS-Prox : dataset proximity mining for governing the Data Lake

Alserafi, Ayman; Calders, Toon; Abelló, Alberto; Romero, Oscar

doi:10.1007/978-3-319-68474-1_20

Title

DS-Prox : dataset proximity mining for governing the Data Lake

Author

Alserafi, Ayman

Calders, Toon

Abelló, Alberto

Romero, Oscar

Abstract

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

Language

English

Source (journal)

Lecture notes in computer science. - Berlin, 1973, currens

Source (book)

Similarity search and applications 10th International Conference, SISAP 2017, 4-6 October 2017, Munich, Germany, Proceedings / Beecks, Christian [edit.]; et al.

Source (series)

Information systems and applications, incl. internet/web, and HC (LNISA) ; 10609

Publication

Cham : Springer , 2017

ISSN

0302-9743 [print]

1611-3349 [online]

ISBN

978-3-319-68473-4

DOI

10.1007/978-3-319-68474-1_20

Volume/pages

(2018) , p. 284-299

ISI

000616693000020

Full text (Publisher's DOI)

https://doi.org/10.1007/978-3-319-68474-1_20

Full text (publisher's version - intranet only)

https://repository.uantwerpen.be/docman/iruaauth/ba4a03/152350.pdf

Faculty/Department				Faculty of Sciences. Mathematics and Computer Science

Research group				ADReM Data Lab (ADReM)

Publication type				A1 Journal article

Subject				Computer. Automation

Affiliation				Publications with a UAntwerp address

Web of Science

View record in Web of Science®

View citing articles in Web of Science®

Identifier

c:irua:152350

Creation

01.08.2018

Last edited

02.10.2024

To cite this reference

https://hdl.handle.net/10067/1523500151162165141