Результаты исследований: Материалы конференций › материалы › Рецензирование
Fast Discovery of Inclusion Dependencies with Desbordante. / Smirnov, Alexander; Chizhov, Anton; Shchuckin, Ilya; Bobrov, Nikita; Chernishev, George A.
2023. 264-275 Работа представлена на The 33rd Conference of the Open Innovations Association FRUCT, Жилина, Словакия.Результаты исследований: Материалы конференций › материалы › Рецензирование
}
TY - CONF
T1 - Fast Discovery of Inclusion Dependencies with Desbordante.
AU - Smirnov, Alexander
AU - Chizhov, Anton
AU - Shchuckin, Ilya
AU - Bobrov, Nikita
AU - Chernishev, George A.
N1 - Conference code: 33
PY - 2023/5/24
Y1 - 2023/5/24
N2 - Inclusion dependency is a relation between attributes of tables that indicates possible Primary Key-Foreign Key references. Automatic discovery of inclusion dependencies is a relevant problem for both academic and industrial communities. The core concern for this problem is the efficiency of discovery process, since it is a computationally expensive task. However, existing studies only address the algorithmic side, while leaving out the implementation aspect. At the same time, engineering details are at least as important as the algorithmic ones for achieving good performance. In this paper, we describe techniques for efficient implementation of two algorithms for discovery of inclusion dependencies - Spider and Faida. The first one is a classic algorithm whose ideas lie in the foundation of many other inclusion dependency discovery algorithms. We propose an efficient parallelization technique, which greatly speeds up the algorithm while simultaneously reducing its memory consumption. The second one is the state-of-the-art approximate algorithm, which we approach by applying four types of optimizations: data buffering, SIMD-enabled execution, careful hash-table selection and parallelization. In order to experimentally evaluate our techniques, we have implemented these algorithms in Desbordante - an open-source science-intensive data profiler written in C++. For Spider, we have evaluated several different options, and in case of Faida we have demonstrated that all our optimization techniques yield results. We also compared our implementations with Metanome - a Java-based data profiler. Overall, we report up to 5x improvement in terms of run time reduction for Spider and up to 8x for Faida.
AB - Inclusion dependency is a relation between attributes of tables that indicates possible Primary Key-Foreign Key references. Automatic discovery of inclusion dependencies is a relevant problem for both academic and industrial communities. The core concern for this problem is the efficiency of discovery process, since it is a computationally expensive task. However, existing studies only address the algorithmic side, while leaving out the implementation aspect. At the same time, engineering details are at least as important as the algorithmic ones for achieving good performance. In this paper, we describe techniques for efficient implementation of two algorithms for discovery of inclusion dependencies - Spider and Faida. The first one is a classic algorithm whose ideas lie in the foundation of many other inclusion dependency discovery algorithms. We propose an efficient parallelization technique, which greatly speeds up the algorithm while simultaneously reducing its memory consumption. The second one is the state-of-the-art approximate algorithm, which we approach by applying four types of optimizations: data buffering, SIMD-enabled execution, careful hash-table selection and parallelization. In order to experimentally evaluate our techniques, we have implemented these algorithms in Desbordante - an open-source science-intensive data profiler written in C++. For Spider, we have evaluated several different options, and in case of Faida we have demonstrated that all our optimization techniques yield results. We also compared our implementations with Metanome - a Java-based data profiler. Overall, we report up to 5x improvement in terms of run time reduction for Spider and up to 8x for Faida.
UR - https://www.mendeley.com/catalogue/720fcf75-8307-3f42-b289-1b0ac2be3440/
U2 - 10.23919/fruct58615.2023.10143047
DO - 10.23919/fruct58615.2023.10143047
M3 - Paper
SP - 264
EP - 275
Y2 - 24 May 2023 through 26 May 2023
ER -
ID: 108632334