Standard

Desbordante: from benchmarking suite to high-performance science-intensive data profiler. / Чернышев, Георгий Алексеевич; Полынцов, Михаил Александрович; Чижов, Антон Игоревич; Ступаков, Кирилл Валерьевич; Щукин, Илья Вячеславович; Смирнов, Александр; Струтовский, Максим Андреевич; Шлёнских, Алексей Анатольевич; Фирсов, Михаил Александрович; Мананников, Степан Дмитриевич; Бобров, Никита; Гончаров, Даниил Юрьевич; Баруткин, Илья Дмитриевич; Якшигулов, Вадим Наилевич; Шальнев, Владислав Александрович; Муравьев, Кирилл Ильич; Рахмукова, Анна Игоревна; Щека, Дмитрий Вадимович; Черников, Антон Александрович; Кузин, Яков Сергеевич; Синельников, Михаил Алексеевич; Абросимов, Григорий; Попов, Дмитрий; Демченко, Артем Евгеньевич; Белоконный, Сергей Александрович; Соловьёва, Лиана-Юлия Викторовна; Курбатов, Ярослав Андреевич; Выродов, Михаил Владимирович; Салью, Артур Кристофович; Гайсин, Эдуард Ринатович; Смирнов, Кирилл Константинович.

CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data. Association for Computing Machinery, 2025. p. 234-243.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Harvard

APA

Vancouver

Чернышев ГА, Полынцов МА, Чижов АИ, Ступаков КВ, Щукин ИВ, Смирнов А et al. Desbordante: from benchmarking suite to high-performance science-intensive data profiler. In CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data. Association for Computing Machinery. 2025. p. 234-243 https://doi.org/10.1145/3703323.3703725

Author

Чернышев, Георгий Алексеевич ; Полынцов, Михаил Александрович ; Чижов, Антон Игоревич ; Ступаков, Кирилл Валерьевич ; Щукин, Илья Вячеславович ; Смирнов, Александр ; Струтовский, Максим Андреевич ; Шлёнских, Алексей Анатольевич ; Фирсов, Михаил Александрович ; Мананников, Степан Дмитриевич ; Бобров, Никита ; Гончаров, Даниил Юрьевич ; Баруткин, Илья Дмитриевич ; Якшигулов, Вадим Наилевич ; Шальнев, Владислав Александрович ; Муравьев, Кирилл Ильич ; Рахмукова, Анна Игоревна ; Щека, Дмитрий Вадимович ; Черников, Антон Александрович ; Кузин, Яков Сергеевич ; Синельников, Михаил Алексеевич ; Абросимов, Григорий ; Попов, Дмитрий ; Демченко, Артем Евгеньевич ; Белоконный, Сергей Александрович ; Соловьёва, Лиана-Юлия Викторовна ; Курбатов, Ярослав Андреевич ; Выродов, Михаил Владимирович ; Салью, Артур Кристофович ; Гайсин, Эдуард Ринатович ; Смирнов, Кирилл Константинович. / Desbordante: from benchmarking suite to high-performance science-intensive data profiler. CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data. Association for Computing Machinery, 2025. pp. 234-243

BibTeX

@inproceedings{5630f5857f8f4950abf2b89e454a1789,
title = "Desbordante: from benchmarking suite to high-performance science-intensive data profiler",
abstract = "Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.",
keywords = "Извлечение знаний, Профилирование данных, Исследование данных, Обработка данных, Извлечение шаблонов, Обнаружение аномалий, Анализ данных, Anomaly Detection, Data Analysis, Data Exploration, Data Mining, Data Profiling, Data Wrangling, Knowledge Discovery, Pattern Extraction",
author = "Чернышев, {Георгий Алексеевич} and Полынцов, {Михаил Александрович} and Чижов, {Антон Игоревич} and Ступаков, {Кирилл Валерьевич} and Щукин, {Илья Вячеславович} and Александр Смирнов and Струтовский, {Максим Андреевич} and Шлёнских, {Алексей Анатольевич} and Фирсов, {Михаил Александрович} and Мананников, {Степан Дмитриевич} and Никита Бобров and Гончаров, {Даниил Юрьевич} and Баруткин, {Илья Дмитриевич} and Якшигулов, {Вадим Наилевич} and Шальнев, {Владислав Александрович} and Муравьев, {Кирилл Ильич} and Рахмукова, {Анна Игоревна} and Щека, {Дмитрий Вадимович} and Черников, {Антон Александрович} and Кузин, {Яков Сергеевич} and Синельников, {Михаил Алексеевич} and Григорий Абросимов and Дмитрий Попов and Демченко, {Артем Евгеньевич} and Белоконный, {Сергей Александрович} and Соловьёва, {Лиана-Юлия Викторовна} and Курбатов, {Ярослав Андреевич} and Выродов, {Михаил Владимирович} and Салью, {Артур Кристофович} and Гайсин, {Эдуард Ринатович} and Смирнов, {Кирилл Константинович}",
year = "2025",
month = jun,
day = "25",
doi = "10.1145/3703323.3703725",
language = "English",
isbn = "9798400711244",
pages = "234--243",
booktitle = "CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data",
publisher = "Association for Computing Machinery",
address = "United States",
note = "8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD), CODS-COMAD Dec'24 ; Conference date: 18-12-2024 Through 21-12-2024",
url = "https://cods-comad.in/dec-2024/",

}

RIS

TY - GEN

T1 - Desbordante: from benchmarking suite to high-performance science-intensive data profiler

AU - Чернышев, Георгий Алексеевич

AU - Полынцов, Михаил Александрович

AU - Чижов, Антон Игоревич

AU - Ступаков, Кирилл Валерьевич

AU - Щукин, Илья Вячеславович

AU - Смирнов, Александр

AU - Струтовский, Максим Андреевич

AU - Шлёнских, Алексей Анатольевич

AU - Фирсов, Михаил Александрович

AU - Мананников, Степан Дмитриевич

AU - Бобров, Никита

AU - Гончаров, Даниил Юрьевич

AU - Баруткин, Илья Дмитриевич

AU - Якшигулов, Вадим Наилевич

AU - Шальнев, Владислав Александрович

AU - Муравьев, Кирилл Ильич

AU - Рахмукова, Анна Игоревна

AU - Щека, Дмитрий Вадимович

AU - Черников, Антон Александрович

AU - Кузин, Яков Сергеевич

AU - Синельников, Михаил Алексеевич

AU - Абросимов, Григорий

AU - Попов, Дмитрий

AU - Демченко, Артем Евгеньевич

AU - Белоконный, Сергей Александрович

AU - Соловьёва, Лиана-Юлия Викторовна

AU - Курбатов, Ярослав Андреевич

AU - Выродов, Михаил Владимирович

AU - Салью, Артур Кристофович

AU - Гайсин, Эдуард Ринатович

AU - Смирнов, Кирилл Константинович

PY - 2025/6/25

Y1 - 2025/6/25

N2 - Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.

AB - Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.

KW - Извлечение знаний

KW - Профилирование данных

KW - Исследование данных

KW - Обработка данных

KW - Извлечение шаблонов

KW - Обнаружение аномалий

KW - Анализ данных

KW - Anomaly Detection

KW - Data Analysis

KW - Data Exploration

KW - Data Mining

KW - Data Profiling

KW - Data Wrangling

KW - Knowledge Discovery

KW - Pattern Extraction

UR - https://dl.acm.org/doi/10.1145/3703323.3703725

UR - https://www.mendeley.com/catalogue/96ae8f21-3f8a-3129-a841-82b7bfaedd8b/

U2 - 10.1145/3703323.3703725

DO - 10.1145/3703323.3703725

M3 - Conference contribution

SN - 9798400711244

SP - 234

EP - 243

BT - CODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data

PB - Association for Computing Machinery

T2 - 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD)

Y2 - 18 December 2024 through 21 December 2024

ER -

ID: 140332521