Project Overview
The tremendous growth in the price-performance of networking and storage
has fueled the explosive growth of the web. The amount of information easily
accessible from the desktop has dramatically increased by several orders
of magnitude in the last few years, and shows no signs of abating. Users
of the web are being confronted with the consequent information overload
problem. It can be exceedingly difficult to locate resources that are both
high-quality and relevant to their information needs. Traditional automated
methods for locating information are easily overwhelmed by low-quality
and unrelated content. Thus, the second generation of search engines will
have to have effective methods for focusing on the most authoritative
among these documents. The rich structure implicit in the hyperlinks among
Web documents offers a simple, and effective, means to deal with many of
these problems. The CLEVER search engine incorporates several algorithms
that make use of hyperlink structure for discovering high-quality information
on the Web.
Ongoing work in the Clever project focuses on higher-level applications
based on the basic Clever engine described in the publications below. There
are a number of emerging new directions within the CLEVER project.
-
Enhancements to HITS algorithm. A number of algorithmic methods
to improve the precision and functionality of the basic HITS algorithm.
There are several such related efforts, in Almaden and elsewhere (see for
instance our SIGIR98 Workshop paper).
-
Hypertext Classification. Classifying hypertext into a hierarchical
topic taxonomy: using a hyperlink induced feature set to significantly
improve classification accuracy (see the VLDB Journal paper).
-
Focused Crawling. Using Hypertext classification and topic distillation
tools to focus a crawler to work within a specific topic domain, ignoring
unrelated and irrelevant material. (see the WWW8 paper.)
-
Mining Communities. The web is home to more than 100,000 communities:
groups of people and web pages created and maintained by them based on
a shared interest on a particular topic. Finding and organizing them within
an organized informational framework presents significant technical challenges.
(See the WWW8 paper.)
-
Modeling the web as a graph. What is a good stochastic model for
the web as a graph? An answer to this question would give us ways of predicting
the growth and interconnection structure of the web, and allow us to tune
efficient algorithms for the web. (See the VLDB 99 paper.)
A more detailed technical summary will appear in the August 1999 issue
of IEEE Computer.
Publications
-
J. Kleinberg. Authoritative
sources in a hyperlinked environment. To appear in the Journal of
the ACM, 1999. Also appears as IBM Research Report RJ 10076, May 1997.
-
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
Automatic
Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
Proceedings of the 7th World-Wide Web conference, 1998. Copyright
owned by Elsevier Sciences, Amsterdam.
-
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring
Web Communities from Link Topologies. Proceedings of The Ninth ACM
Conference on Hypertext and Hypermedia, 1998. Copyright owned by ACM.
-
S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan. Scalable
feature selection, classification and signature generation for organizing
large text databases into hierarchical topic taxonomies. VLDB Journal,
1998 (invited).
-
S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan
and A. Tomkins. Spectral filtering for resource discoveryACM SIGIR workshop
on Hypertext Information Retrieval on the Web (1998), Melbourne, Australia.
-
S. Chakrabarti, B. Dom and P. Indyk. Enhanced
hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD
1998.
-
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan,
S. Rajagopalan and A. Tomkins. Hypersearching
the web. Scientific American, June, 1999.
-
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan,
S. Rajagopalan and A. Tomkins. Mining
the link structure of the World Wide Web IEEE Computer.
-
S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling
the Web for emerging cyber-communities Eighth World Wide Web conference,
Toronto, Canada, May 1999.
-
S. Chakrabarti, M. Van den Berg, B. Dom Focused
crawling: a new approach to topic specific resource discovery Eighth
World Wide Web conference, Toronto, 1999.
-
J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
The web as a graph: Measurements, models and methods.
Proceedings of the International Conference on Combinatorics and Computing,
1999; invited paper.
-
S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting
large scale knowledge bases from the web. IEEE International conference
on Very Large Databases (VLDB), Edinburgh, Scotland.
-
D. Gibson, J. Kleinberg and P. Raghavan. Clustering categorical data: an
approach based on dynamical systems. Proceedings of the VLDB conference,
1998.
Contacts
The Clever project is a part of the Computer
Science Principles and Methodologies Department at the IBM
Almaden Research Center.
Inquiries on licensing Clever technology may be directed to Rick LeVee,
rplevee@us.ibm.com,
(408) 927 1272.
Press inquiries may be directed to Nam LaMore, nlamore@us.ibm.com,
(408) 927 1282.
To request a copy of publications listed above, please send email to
sridhar@almaden.ibm.com.
,tomkins@almaden.ibm.com.
or ravi@almaden.ibm.com.