Documents

DOI

Pioneering data profiling systems — such as Metanome and OpenClean — brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns, such as data dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. To address this drawback, we have developed Desbordante — a science-intensive, high-performance and open-source data profiling tool implemented in C++. To the best of our knowledge, Desbordante is currently the only profiler which possesses these three characteristics. Unlike similar systems, it is built with an emphasis on industrial use, and is up to an order of magnitude faster than similar systems, while requiring up to three times less memory. Desbordante aims to open industrial-grade primitive discovery to a broader audience, focusing on domain experts that are not IT professionals. Aside from discovery of various types of patterns, Desbordante offers pattern validation, which not only reports whether a given instance of a pattern holds or not, but also points out what prevents it from holding. Next, Desbordante lets users employ its functionality from Python scripts. Together with other Python libraries, it enables developing ad-hoc solutions for data deduplication, data cleaning, anomaly detection, and other data quality problems. In this paper, we present Desbordante, the vision behind it, and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. As a consolidation paper, it synthesizes more than six years of development work, integrating findings from numerous studies to provide a comprehensive overview.
Original languageEnglish
Title of host publicationCODS-COMAD 2024 - Proceedings of the 8th Jpint International Conference on Data Science and Management of Data
PublisherAssociation for Computing Machinery
Pages234-243
Number of pages10
ISBN (Electronic)9798400711244
ISBN (Print)9798400711244
DOIs
StatePublished - 25 Jun 2025
Event8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD) - Jodhpur India, Jodhpur, India
Duration: 18 Dec 202421 Dec 2024
https://cods-comad.in/dec-2024/

Conference

Conference8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD)
Abbreviated titleCODS-COMAD Dec'24
Country/TerritoryIndia
CityJodhpur
Period18/12/2421/12/24
Internet address

    Scopus subject areas

  • Information Systems

    Research areas

  • Anomaly Detection, Data Analysis, Data Exploration, Data Mining, Data Profiling, Data Wrangling, Knowledge Discovery, Pattern Extraction

ID: 140332521