Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
Desbordante: from benchmarking suite to high-performance science-intensive data profiler. / Чернышев, Георгий Алексеевич; Полынцов, Михаил Александрович; Чижов, Антон Игоревич; Ступаков, Кирилл Валерьевич; Щукин, Илья Вячеславович; Смирнов, Александр; Струтовский, Максим Андреевич; Шлёнских, Алексей Анатольевич; Фирсов, Михаил Александрович; Мананников, Степан Дмитриевич; Бобров, Никита; Гончаров, Даниил Юрьевич; Баруткин, Илья Дмитриевич; Якшигулов, Вадим Наилевич; Шальнев, Владислав Александрович; Муравьев, Кирилл Ильич; Рахмукова, Анна Игоревна; Щека, Дмитрий Вадимович; Черников, Антон Александрович; Кузин, Яков Сергеевич; Синельников, Михаил Алексеевич; Абросимов, Григорий; Попов, Дмитрий; Демченко, Артем Евгеньевич; Белоконный, Сергей Александрович; Соловьёва, Лиана-Юлия Викторовна; Курбатов, Ярослав Андреевич; Выродов, Михаил Владимирович; Салью, Артур Кристофович; Гайсин, Эдуард Ринатович; Смирнов, Кирилл Константинович.
CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data. Association for Computing Machinery, 2025. p. 234-243.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review
}
TY - GEN
T1 - Desbordante: from benchmarking suite to high-performance science-intensive data profiler
AU - Чернышев, Георгий Алексеевич
AU - Полынцов, Михаил Александрович
AU - Чижов, Антон Игоревич
AU - Ступаков, Кирилл Валерьевич
AU - Щукин, Илья Вячеславович
AU - Смирнов, Александр
AU - Струтовский, Максим Андреевич
AU - Шлёнских, Алексей Анатольевич
AU - Фирсов, Михаил Александрович
AU - Мананников, Степан Дмитриевич
AU - Бобров, Никита
AU - Гончаров, Даниил Юрьевич
AU - Баруткин, Илья Дмитриевич
AU - Якшигулов, Вадим Наилевич
AU - Шальнев, Владислав Александрович
AU - Муравьев, Кирилл Ильич
AU - Рахмукова, Анна Игоревна
AU - Щека, Дмитрий Вадимович
AU - Черников, Антон Александрович
AU - Кузин, Яков Сергеевич
AU - Синельников, Михаил Алексеевич
AU - Абросимов, Григорий
AU - Попов, Дмитрий
AU - Демченко, Артем Евгеньевич
AU - Белоконный, Сергей Александрович
AU - Соловьёва, Лиана-Юлия Викторовна
AU - Курбатов, Ярослав Андреевич
AU - Выродов, Михаил Владимирович
AU - Салью, Артур Кристофович
AU - Гайсин, Эдуард Ринатович
AU - Смирнов, Кирилл Константинович
PY - 2025/6/25
Y1 - 2025/6/25
N2 - Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.
AB - Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.
KW - Извлечение знаний
KW - Профилирование данных
KW - Исследование данных
KW - Обработка данных
KW - Извлечение шаблонов
KW - Обнаружение аномалий
KW - Анализ данных
KW - Anomaly Detection
KW - Data Analysis
KW - Data Exploration
KW - Data Mining
KW - Data Profiling
KW - Data Wrangling
KW - Knowledge Discovery
KW - Pattern Extraction
UR - https://dl.acm.org/doi/10.1145/3703323.3703725
UR - https://www.mendeley.com/catalogue/96ae8f21-3f8a-3129-a841-82b7bfaedd8b/
U2 - 10.1145/3703323.3703725
DO - 10.1145/3703323.3703725
M3 - Conference contribution
SN - 9798400711244
SP - 234
EP - 243
BT - CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data
PB - Association for Computing Machinery
T2 - 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD)
Y2 - 18 December 2024 through 21 December 2024
ER -
ID: 140332521