Beyond declarative data cleaning
VCLA and WPI will host a talk by Paolo Papotti on July 2, 2015.
|DATE:||Thursday, July 2, 2015|
|VENUE:||Seminar room Zemanek, Favoritenstraße 9-11, 1040 Vienna|
In the "big data" era, data preparation is a popular activity both in academia and in industry. Unfortunately, it requires a great amount of manual work to obtain data that is polished enough to be useful in final applications, such as querying and mining. The problem is even harder in practice, as data is often dirty in nature because of errors, such as typos and duplicates, that can lead to poor results in the analytic tasks. To achieve the level of automation and scalability required by the large volume, several successful cleaning systems have been proposed. They rely on a declarative approach, where the users provide high-level specifications of the tasks, such as rules, and the systems compute optimal solutions without requiring human intervention on the generated code. However, despite the positive results, there is still a gap between these proposals and the leading commercial systems. The latter are harder to maintain, to debug, and to test, but provide the level of personalization and detail that are needed to solve real-world problems. In this tutorial, I will describe some of the recent approaches to data cleaning that deliver pragmatic solutions without being ad-hoc. These practical challenges include the ability to combine declarative and procedural cleaning tasks, and support for human involvement in the process with exploration tools and crowdsourcing. I will then sketch some promising future directions and a range of open problems for the community.