Smart Caching for Efficient Functional Dependency Discovery

Standard

Smart Caching for Efficient Functional Dependency Discovery. / Birillo, Anastasia; Bobrov, Nikita.

New Trends in Databases and Information Systems. Springer Nature, 2019. p. 52-59 (Communications in Computer and Information Science; Vol. 1064).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Harvard

Birillo, A & Bobrov, N 2019, Smart Caching for Efficient Functional Dependency Discovery. in New Trends in Databases and Information Systems. Communications in Computer and Information Science, vol. 1064, Springer Nature, pp. 52-59, 23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019, Bled, Slovenia, 8/09/19. https://doi.org/10.1007/978-3-030-30278-8_7

APA

Birillo, A., & Bobrov, N. (2019). Smart Caching for Efficient Functional Dependency Discovery. In New Trends in Databases and Information Systems (pp. 52-59). (Communications in Computer and Information Science; Vol. 1064). Springer Nature. https://doi.org/10.1007/978-3-030-30278-8_7

Vancouver

Birillo A, Bobrov N. Smart Caching for Efficient Functional Dependency Discovery. In New Trends in Databases and Information Systems. Springer Nature. 2019. p. 52-59. (Communications in Computer and Information Science). https://doi.org/10.1007/978-3-030-30278-8_7

Author

Birillo, Anastasia ; Bobrov, Nikita. / Smart Caching for Efficient Functional Dependency Discovery. New Trends in Databases and Information Systems. Springer Nature, 2019. pp. 52-59 (Communications in Computer and Information Science).

BibTeX

@inproceedings{88fa7adecb1245f4b414988d4d00de6f,

title = "Smart Caching for Efficient Functional Dependency Discovery",

abstract = "Functional dependency (FD) is a concept for describing regularities in data, that traditionally was used for schema normalization. Recently, a different problem has emerged: For a given dataset, find FDs that are contained in it. Efficient FD discovery is of great importance due to FD usage in many tasks such as data cleaning, schema recovery, and query optimization. In this paper we consider a particular class of FD discovery algorithms that rely on partition intersection. We present a simple approach for caching intermediate results that are generated during run times of these algorithms. Our approach is essentially a heuristic that selects which partitions to cache based on measures of disorder in data. For this purpose, we adopt such well-known measures as Entropy and Gini Impurity, and propose a novel one—Inverted Entropy. Our approach has a negligible computational overhead, and can be used with a number of FD discovery algorithms, both exact and approximate. Our experiments demonstrate that our heuristic allows to decrease algorithm run times by up to 40% while simultaneously requiring up to an order of magnitude less space compared to the state-of-the-art caching approaches.",

keywords = "Caching, Entropy, Functional dependency, Gini impurity, Partition caching, Partition intersection, PYRO, TANE",

author = "Anastasia Birillo and Nikita Bobrov",

year = "2019",

month = jan,

day = "1",

doi = "10.1007/978-3-030-30278-8_7",

language = "English",

isbn = "978-3-030-30277-1",

series = "Communications in Computer and Information Science",

publisher = "Springer Nature",

pages = "52--59",

booktitle = "New Trends in Databases and Information Systems",

address = "Germany",

note = "23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019 ; Conference date: 08-09-2019 Through 11-09-2019",

}

RIS

TY - GEN

T1 - Smart Caching for Efficient Functional Dependency Discovery

AU - Birillo, Anastasia

AU - Bobrov, Nikita

N1 - Conference code: 231719

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Functional dependency (FD) is a concept for describing regularities in data, that traditionally was used for schema normalization. Recently, a different problem has emerged: For a given dataset, find FDs that are contained in it. Efficient FD discovery is of great importance due to FD usage in many tasks such as data cleaning, schema recovery, and query optimization. In this paper we consider a particular class of FD discovery algorithms that rely on partition intersection. We present a simple approach for caching intermediate results that are generated during run times of these algorithms. Our approach is essentially a heuristic that selects which partitions to cache based on measures of disorder in data. For this purpose, we adopt such well-known measures as Entropy and Gini Impurity, and propose a novel one—Inverted Entropy. Our approach has a negligible computational overhead, and can be used with a number of FD discovery algorithms, both exact and approximate. Our experiments demonstrate that our heuristic allows to decrease algorithm run times by up to 40% while simultaneously requiring up to an order of magnitude less space compared to the state-of-the-art caching approaches.

AB - Functional dependency (FD) is a concept for describing regularities in data, that traditionally was used for schema normalization. Recently, a different problem has emerged: For a given dataset, find FDs that are contained in it. Efficient FD discovery is of great importance due to FD usage in many tasks such as data cleaning, schema recovery, and query optimization. In this paper we consider a particular class of FD discovery algorithms that rely on partition intersection. We present a simple approach for caching intermediate results that are generated during run times of these algorithms. Our approach is essentially a heuristic that selects which partitions to cache based on measures of disorder in data. For this purpose, we adopt such well-known measures as Entropy and Gini Impurity, and propose a novel one—Inverted Entropy. Our approach has a negligible computational overhead, and can be used with a number of FD discovery algorithms, both exact and approximate. Our experiments demonstrate that our heuristic allows to decrease algorithm run times by up to 40% while simultaneously requiring up to an order of magnitude less space compared to the state-of-the-art caching approaches.

KW - Caching

KW - Entropy

KW - Functional dependency

KW - Gini impurity

KW - Partition caching

KW - Partition intersection

KW - PYRO

KW - TANE

UR - http://www.scopus.com/inward/record.url?scp=85072952979&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-30278-8_7

DO - 10.1007/978-3-030-30278-8_7

M3 - Conference contribution

AN - SCOPUS:85072952979

SN - 978-3-030-30277-1

T3 - Communications in Computer and Information Science

SP - 52

EP - 59

BT - New Trends in Databases and Information Systems

PB - Springer Nature

T2 - 23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019

Y2 - 8 September 2019 through 11 September 2019

ER -

ID: 113380819