Personalization, Intent, and modifying PageRank calculations
On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.
Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.
2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.
3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.
4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.