Документы

  • 3703323.3703725

    Конечная издательская версия, 609 KB, Документ PDF

DOI

Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.
Язык оригиналаанглийский
Название основной публикацииCODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data
ИздательAssociation for Computing Machinery
Страницы234-243
Число страниц10
ISBN (электронное издание)9798400711244
ISBN (печатное издание)9798400711244
DOI
СостояниеОпубликовано - 25 июн 2025
Событие8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD) - Jodhpur India, Jodhpur, Индия
Продолжительность: 18 дек 202421 дек 2024
https://cods-comad.in/dec-2024/

конференция

конференция8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD)
Сокращенное названиеCODS-COMAD Dec'24
Страна/TерриторияИндия
ГородJodhpur
Период18/12/2421/12/24
Сайт в сети Internet

    Предметные области Scopus

  • Информационные системы

    Области исследований

  • Извлечение знаний, Профилирование данных, Исследование данных, Обработка данных, Извлечение шаблонов, Обнаружение аномалий, Анализ данных

ID: 140332521