Project Description
The Information Integration Group has recently started a new project called Midas, with the goal of extracting, cleansing, and integrating data from multiple, publicly available, data sources. We are initially focused on two domains: 1) a financial domain, where the input dataset consists of a heterogeneous collection of company filings with the US Securities & Exchange Commission (SEC), and 2) a US government domain, with data sources containing information about Congress members, earmarks and federal spending. In both domains, we are building a scalable Hadoop-based system where the goal is to transform the data from a document or record view of the world to an object-centric view, where multiple facts about the same real-world entity are merged into one object with, ideally, clean and complete attributes.
One of the most salient features of our system is the synergistic integration into one framework of multiple components spanning the entire, end-to-end integration flow. The main stages in such flow are:
Unstructured information extraction (e.g., extraction of various facts from the multitude of text or html documents archived by SEC or present in other public data repositories)
Structured information integration, including: 1) mapping of the extracted facts to a target model or schema, 2) resolving and merging references to the same real-world entity (i.e., entity resolution), and 3) creating correct relationships among the resulting objects.
Temporal analysis and fusion (e.g., transforming a collection of unprocessed but time-stamped facts into objects with clearly defined timeline or history, which in turn provides the ability to go back in time while issuing queries).
Our research aims to develop novel algorithms and tools as well as scalable and reusable software modules for all the different stages mentioned above. In particular, we are looking at new algorithms for entity resolution that can be integrated with mapping and fusion algorithms and that can be applied on a continuous basis (i.e., as new documents or data sources are discovered). We are also investigating new ways of visualizing, browsing and understanding the integrated objects and their relationships. Finally, one of our most important goals is defining high-level abstractions and models that can be used to specify, at a high-level and declaratively, the entire integration flow. In turn, this will enable the applicability of the resulting framework and system to new domains (beyond financial and government) and to new users (i.e., domain experts that are not necessarily data integration experts).
Project Contact: Rajasekar Krishnamurthy
Publications
Barna Saha, Ioana R. Stanoi, Kenneth Clarkson: "Schema Covering: A Step Towards Enabling Reusability in Information Integration". ICDE 2010
Ronald Fagin, Laura M. Haas, Mauricio A. Hernández, Renée J. Miller, Lucian Popa, Yannis Velegrakis: "Clio: Schema Mapping Creation and Data Exchange". Conceptual Modeling: Foundations and Applications 2009: 198-236
Kevin S. Beyer, Vuk Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss, Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan, Huaiyu Zhu: "Towards a Scalable Enterprise Content Analytics Platform". IEEE Data Eng. Bull. 32(1): 28-35 (2009)
Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan: "Reverse data exchange: coping with nulls". PODS 2009: 23-32
Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal A. Younis: "Top-k generation of integrated schemas based on directed and weighted correspondences". SIGMOD Conference 2009: 641-654
Melanie Herschel, Mauricio A. Hernández, Wang Chiew Tan: "Artemis: A System for Analyzing Missing Answers". PVLDB 2(2): 1550-1553 (2009)
Howard Ho: "Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration". BIRTE (Informal Proceedings) 2008
Stefan Dessloch, Mauricio A. Hernández, Ryan Wisnesky, Ahmed Radwan, Jindan Zhou: "Orchid: Integrating Schema Mapping and ETL". ICDE 2008: 1307-1316
Mauricio A. Hernández, Paolo Papotti, Wang Chiew Tan: "Data Exchange with Data-metadata Translations". PVLDB 1(1): 260-273 (2008)
Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008: 833-846

