Personalization, Intent, and modifying PageRank calculations
Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.
On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.
Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
Some thoughts:
1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.
2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.
3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (”Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.
4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.
Tags: pagerank, google, search, algorithms, research, intent, collaboration, yahoo, spam



























December 8th, 2005 at 4:56 pm
Thanks for the follow-up post. Great point on the potential for showing what’s popularized with usage data. However, to the extent that PageRank is attempting to be an indirect estimate of usage — by using link transitions as a proxy for traffic flow — I would think that this problem may already exist.
I think you also make a good point that never followed outgoing links may have value, though I am concerned that they usually may be spam, as in the example you gave.
You mentioned an interest in personalized search here and in your previous post. This paper focused on a profile-based method of personalized search, building a list of your interests, group or individualized personalization vectors for those interests, and using that to bias all of your searches. This is also the approach described by the Kaltix team and used in Google Personalized Search.
The problems with this approach are that it is expensive to compute all the personalization vectors (or vector fragments), the personalization will not adapt quickly to new data or trends, and the personalization has to be fairly coarse-grained to have any chance of being feasible.
The approach where I have focused my attention is using short-term behavior to do fine-grained search personalization. For example, if I do search A, don’t find what I want, then refine that search to search B, the two searches are treated independently. I see the same results for search B as everyone else sees. That is clearly wrong. There is valuable information in what I found or failed to find in search A that should be applied to improve the results in search B.
More generally, the search and clickstream history of each user seems like it should be part of computing the relevance of the search results for that user.
December 9th, 2005 at 1:29 pm
Yahoo goes after more tagging assets, buys del.icio.us
Yahoo continues down the path of more tagging and more collaborative content. Having already purchased Flickr, this morning they’re acquiring del.icio.us (terms undislosed):
From Joshua Schachter at the del.icio.us blog:
We’re proud to…