Project Description
With the tremendous growth in the volume of semi-structured and unstructured content within enterprises(e.g., email archives, customer support databases, etc.), there is increasing interest in harnessing this content to power search and business intelligence applications. Traditional enterprise infrastructure for analytics is not designed to meet the demands of large-scale compute-intensive analytics over semi-structured content.
In the CAP project, we are developing an enterprise content analytics platform that leverages the Hadoop map-reduce framework to support this emerging class of analytic workloads. Two core components of the platform are Jaql, a declarative language for expressing transformations over semi-structured data, and SystemT-IE, a high-performance information extraction engine. In addition, we are in the process of building MetaTracker -- a data-centric flow manager, to define, manage, and deploy analytic workflows on this software stack.

Project Contact: Sriram Raghavan
Publications
Kevin S. Beyer, Vuk Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss, Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan, Huaiyu Zhu: "Towards a Scalable Enterprise Content Analytics Platform". IEEE Data Eng. Bull. 32(1): 28-35 (2009)

