IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120.
1. Starting from a user-supplied query, HITS assembles an initial set of pages: typically, up to 200 pages returned by a text search engine such as AltaVista on that query. These pages are then expanded to a larger root set by adding any pages that are linked to or from any page in the initial set.
2. HITS then associates with each page p a hub-weight h(p) and an authority weight a(p), all initialized to 1.
3. HITS then iteratively updates the hub and authority weights of each page in the root set as follows. First, under the intuition that a page pointing to good authorities should be considered a good hub, we replace the hub score of each page by the sum of the authorities of the pages it points to. And second, dually, under the intuition that a page pointed to by good hubs should be considered a good authority, we replace the authority score of each page by the sum of the hub scores of the pages that point to it.
4. The update operations are performed for all the pages, and the process repeated (normalizing the weights after each iteration) for some number of rounds. Following this, the pages with the highest h(p) and a(p) scores are output as the best hubs and authorities.
See [Chakrabarti et al], [Kleinberg] for more details on HITS and other link-based search methods.
Next, in the course of developing CLEVER, we encountered a number of idiosyncracies of the web that caused us to modify the algorithm. We now describe these situations, and sketch the resultant modifications:
1. We assume that two pages on the same logical website were created by the same organization or individual, and so should not be allowed to confer authority upon one another. To address this, we vary the weights of the links between pages based on the domains of their end-points.
2. We seek a final set of hubs and authorities that provide good access to a wide variety of information. For instance, two pages that are extremely high quality but contain virtually identical information would not both be returned. To this end, after we output a hub, we diminish the scores of pages that are very similar to that hub.
3. We return only a single point-of-entry into a particular internet resource.
4. We often encounter situations in which a good hub page, for instance, appears with a different level of generality than the query for which it would be useful. As an example, consider the query "mango fruit." A high-quality hub page devoted to exotic fruit might have a section of links on papaya, another section on mango, and finally a section of guava. However, if we consider the page to be a universally good hub, the unrelated destination pages about papaya and guava will be considered to be good authorities. To address this, we identify interesting (physically contiguous) sections of web pages and use these sections to determine which other pages might be good hubs or authorities in their entirety.
1. The web currently contains around 270 million documents
BB98.
so rating the
entire corpus is a daunting task. But we cannot label only a small subset
because broad-topic queries routinely return a million page hits scattered
around the web.
2. Even if it were possible to create such a corpus, a million new pages
arrive daily, so the corpus could be used to evaluate actual search
engines for only a brief window (probably shorter than the time to gather the
relevance judgements) before too many new pages arrived.
3. The composition of a "typical" web document in terms of inlinks,
outlinks, average size, graphical content, layout, function, etc., is dynamic.
Even if we are not interested in comparisons that include actual search
engines, and instead wish to label a snapshot of the web then compare
algorithms on this snapshot, results for the web of a few years ago may not
generalize to the web of today.
4. Existing labeled corpora such as TREC are primarily
standalone text while we examine algorithms that rely fundamentally on the
hyperlinked nature of the web. Even other hyperlinked corpora tend to contain
documents of much higher quality than the web. So modifying existing labeled
corpora for evaluating web-targeted algorithms is also difficult.
5. The closest approximations to "relevance judgements" on today's web are
sites such as Yahoo!, which through human involvement collect
high-quality pages on a number of topics. Unfortunately, due to the growth
figures given above these sites cannot hope to index all pages on the web. If
a search engine returns a page on a particular topic that is also indexed by
Yahoo! under that topic, we have a strong indication that the page is high
quality. If, however (as is more likely), Yahoo! does not index the page, we
have no information about the quality of the page.
More precisely, for each of our 27 broad-topic queries we extracted ten pages from each of our three sources. Altavista and Clever were both run on the query in the usual manner. The same search was entered manually into Yahoo!'s search engine, and of the resulting leaf nodes, the one best matching the query was picked by hand. If the best match contained too few links, the process was repeated to generate additional links. Using this procedure we took the top ten pages from Altavista, the top five hubs and five authorities returned by CLEVER, and a random ten pages from the most relevant node or nodes of Yahoo! (Yahoo! lists pages alphabetically and performs no ranking, hence the unbiased choice of ten random pages.) We then merged these three sets and sorted the resulting approximately 30 pages alphabetically (there are almost never duplicate pages from the three sources). We asked each user to rank each of these pages "bad," "fair," "good," or "fantastic" based on how useful the page would be in learning about the topic. We took good and fantastic pages to be relevant, noted the source (AltaVista/CLEVER/Yahoo) of each page, and then computed precision in the traditional manner. Since our users evaluated only the pages returned from our three sources, but did not know which source returned which page, we refer to this type of data as blind post-facto relevance judgements.
The subjective evaluation of relevance was performed by a set of 37 subjects, yielding 1369 datapoints. The subjects were required to have access to the web and familiarity with a common web browser, but were not experts in computer science, or on the topics. Each subject was free to browse the list of pages at leisure, visiting each page as many times as desired, before deciding on a final quality score.
Figure 2: Average precision of each engine over all queries.