Data Cleaning: The Hardest Job in Data Science
Abstract: Data quality is one of the most important problems in data management and data science, since dirty data often leads to inaccurate data analytics results and wrong business decisions. This explains why recent studies show that data scientists spend 60-80% of their time cleaning and transforming data sets. A typical data cleaning process consists of three steps: data quality rule specification, error detection, and error repair.
In this talk, I will discuss my proposals in dealing with challenges in data cleaning workflows. First, I will introduce a system to automatically discover data quality rules from possibly dirty data instances, instead of requiring domain experts to design these rules, which is often expensive and is rarely done in practice. Second, I will show a holistic error detection and error repair technique that accumulates evidence from a broad spectrum of data quality rules, and suggests more accurate data repairs. I will conclude the talk by discussing some ongoing work in IoT data quality and my long-term vision of debugging data analytics.
Bio: Xu Chu is a PhD candidate in the Data Systems Group at the University of Waterloo advised by Prof. Ihab Ilyas. He is broadly interested in data management, with a special focus on new theories, algorithms, and systems for managing large dirty and inconsistent data, including data quality rule discovery, error and outlier detection, and automatic data repair. Xu was awarded the Microsoft Research PhD fellowship in 2015, and the Cheriton Fellowship from the University of Waterloo in 2013.