A reading list on PageRank and search algorithms
If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.
- Deeper Inside PageRank (PDF) - Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
- Online Reputation Systems: The Cost of Attack of PageRank (PDF) -
Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting - SpamRank - Fully Automatic Link Spam Detection - Work in progress (PDF) -
András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites. - Detecting Duplicate and near duplicate files - William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).
I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.
Tags: search, seo, google, algorithms, spam, research, authority, trust



























December 8th, 2005 at 4:09 pm
Personalization, Intent, and modifying PageRank calculations
Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.
On the probabilities of transitioning acros…
December 9th, 2005 at 1:06 am
Articles de R et D sur le PageRank, le SpamRank et le spam…
Ho John Lee résume quelques récents articles intéressants :
Deeper Inside PageRank (PDF) : un article d’Amy N. Langville et Carl D. Meyer très complet sur le PageRank (46 pages). Attention, grosse dose de mathématiques assurée…
Online…
December 10th, 2005 at 11:43 am
It’s only a small list, though - there are plenty more patents to reference.
December 11th, 2005 at 12:51 am
[…] Ho John Lee’s Weblog » A reading list on PageRank and search algorithms If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. A Compilation Search Technology Book Reviews Want to hack together your own search engine? Curious to dig deeper into data mining? Here’s a compilation of various search-related book reviews published in SearchDay over the past several years. And here’s a websites that you should read if you are interested in Search Engine. […]
December 11th, 2005 at 4:20 pm
Newsweek on white hat and black hat search engine optimization
via Seomoz:
This week’s Newsweek (December 12, 2005) features an article on white hat vs black hat search engine optimization.
A “black hat” anecdote:
Using an illicit software program he downloaded from the Net, he forcibly i…
December 22nd, 2005 at 3:32 pm
Why Link Farms (used to) Work
I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical M…
October 13th, 2006 at 2:06 am
[…] We all know the principle of search engines like Google and if you don’t there are plenty of articles on the mechanics of the web from Ho John Lee’s site though they are a bit technical you could also crawl through my de.icio.us accounts search tag and SEO tag. Search engines allow us to search through vast quantities of known links which are picked up either by users submitting those links or through a bot finding them, and while the page ranking system and similar algorithms used by the search engines are very clever they can not guarantee the content of the site. This became a problem in the late nineties as more and more link farms appeared in the search engines listings. Around the time the search engines developed managed directories also appeared including DMOZ these directories were maintained by humans rather then bots and often provided more relevant results, but these two were easy to fool principally because most categories only had one or a small group of editors. […]