Bookmarks for June 11th through June 12th

These are my links for June 11th through June 12th:

Randomly exploring the long tail of search results

I sometimes click on a random “deep” search result page to see if anything interesting turns up, because of the limitations of popularity and PageRank for some queries.

Paul Kedrosky points at a recent paper from CMU which suggests randomly mixing in some low ranking pages may improve search results over time.

Unfortunately, the correlation between popularity and quality
is very weak for newly-created pages that have few
visits and/or in-links. Worse, the process by which new,
high-quality pages accumulate popularity is actually inhibited
by search engines. Since search engines dole out
a limited number of clicks per unit time among a large
number of pages, always listing highly popular pages at
the top, and because users usually focus their attention on
the top few results, newly-created but high-quality
pages are “shut out.”

We propose a simple and elegant solution to
this problem: the introduction of a controlled
amount of randomness into search result ranking
methods. Doing so offers new pages a chance
to prove their worth, although clearly using too
much randomness will degrade result quality and
annul any benefits achieved. Hence there is a
tradeoff between exploration to estimate the quality
of new pages and exploitation of pages already
known to be of high quality. We study this tradeoff
both analytically and via simulation, in the context
of an economic objective function based on
aggregate result quality amortized over time. We
show that a modest amount of randomness leads
to improved search results.

Link:
Shuffling a Stacked Deck: The Case for Partially
Randomized Ranking of Search Engine Results
,

Reverse engineering a referer spam campaign

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

Referrer spam is more indirect. It is created by making an HTTP request with the REFERER header set to the URL being promoted. Most of the time, this will only be visible in the web server log.

Here is a typical HTTP log entry:

87.219.8.210 	[04/Feb/2006:15:20:35 	-0800]
    GET 	/weblog/archives/2005/09/15/google-blog-search-referrers-working-now 	HTTP/1.1
    403 	- 	"http://every-search.com"

Some blogs and other web sites post an automatically generated list of “recent referrers” on their home page or on a sidebar. In normal use, this would show a list of the sites that had linked to the site being viewed. Recent referrer lists are less common now, because of the rise of referrer spam.

Referrer spam will also show up in web site statistic and traffic summaries. These are usually private, but are sometimes left open to the public and to search engines.

One presumed objective of a link spam campaign is to increase the target site’s search engine ranking. In general this requires building a collection of valid inbound links, preferably without the “nofollow” attribute. Referrer spam may be more effective for generating inbound links, since recent referrer lists and web site reports typically don’t wrap their links with nofollow.

The landing pages for the links in this campaign are interesting in that they don’t contain advertising at all. This suggests that this campaign is trying to build a sort of PageRank farm to promote something else.

The actual pages are all built on the same blog template, and contain a combination of gibberish and sidebar links to subdomains based on “valuable” keywords. Using the blog format automatically provides a lot of site interlinking, and they also have “recent” and “top referer” lists, which are all from other spam sites in the network.

It looks like the content text should be easy to identify as spam based on frequency analysis. Perhaps having a very large cloud of spam sites linking to each other along with a dispersed set of incoming referrer spam links makes the sites look more plausible to a search engine? These sites don’t appear to have any, but I have come across other spam sites and comment spam posts that have links to non-spam sites such as .gov and .edu sites, perhaps trying to look more credible to a search engine ranking algorithm. All the sites being on the same subnet makes them easier to spot, though.

Given that there aren’t that many public web site stat pages and recent referrer lists around, I’m surprised that referrer spamming is worth the effort. If the spam network can achieved good ranking in the Google and the other search engines, they can probably boost the ranking for a selected target site by pruning back some of their initial links and adding some links pointing at the sites that they want to promote. Affiliate links to porn, gambling, or online pharmacy sites must pay reasonably well for this to work out for the spammers.

More reading: A list of references on PageRank and link spam detection.

If you’re having referrer spam problems on your site, you may find my notes on blocking referer spam useful.

Here’s some sample text from “search-buy.com”:

I search-buy over least and and next train. Ne so at cruelty the search-buy in after anaesthesia difficulty general urinating. T pastry a ben for search-buy boy. An refuses trip search-buy romances seemed azusa pacific university ca. Stoc of my is and search-buy direct having sex teen titans. Kid philadelphiaa would and york search-buy. G search-buy wore shed i dads. obstacles future search-buy right had satire nineteenth. The that i ups this on search-buy least finds audio express richmond. have this window been wonderful me search-buy so. Surel in actually search-buy our boy deep franklin notions. An search-buy it of my has of. To at head boy that a search-buy. O james search-buy everywhere of but. Alread originate search-buy good about since.

Here are a few spam sites from this campaign and their IP addresses:

bikini-now.com          A       85.255.114.212
babestrips.com          A       85.255.114.229
search-biz.biz          A       85.255.114.245
bustytart.com           A       85.255.114.250
cjtalk.net              A       85.255.114.227
search-galaxy.org             A       85.255.114.252
moresearch.org             A       85.255.114.237

Here is the WHOIS output for that netblock:

% Information related to '85.255.112.0 - 85.255.127.255'

inetnum:        85.255.112.0 - 85.255.127.255
netname:        inhoster
descr:          Inhoster hosting company
descr:          OOO Inhoster, Poltavskij Shliax 24, Kharkiv, 61000, Ukraine
remarks:        -----------------------------------
remarks:        Abuse notifications to: abuse@inhoster.com
remarks:        Network problems to: noc@inhoster.com
remarks:        Peering requests to: peering@inhoster.com
remarks:        -----------------------------------
country:        UA
org:            ORG-EST1-RIPE
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
tech-c:         FWHS1-RIPE
status:         ASSIGNED PI
mnt-by:         RIPE-NCC-HM-PI-MNT
mnt-lower:      RIPE-NCC-HM-PI-MNT
mnt-by:         RECIT-MNT
mnt-routes:     RECIT-MNT
mnt-domains:    RECIT-MNT
mnt-by:         DAV-MNT
mnt-routes:     DAV-MNT
mnt-domains:    DAV-MNT
source:         RIPE # Filtered

organisation:   ORG-EST1-RIPE
org-name:       INHOSTER
org-type:       NON-REGISTRY
remarks:        *************************************
remarks:        * Abuse contacts: abuse@inhoster.com *
remarks:        *************************************
address:        OOO Inhoster
address:        Poltavskij Shliax 24, Xarkov,
address:        61000, Ukraine
phone:          +38 066 4633621
e-mail:         support@inhoster.com
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
mnt-ref:        DAV-MNT
mnt-by:         DAV-MNT
source:         RIPE # Filtered

person:         Andrei Kislizin
address:        OOO Inhoster,
address:        ul.Antonova 5, Kiev,
address:        03186, Ukraine
phone:          +38 044 2404332
nic-hdl:        AK4026-RIPE
source:         RIPE # Filtered

person:       Fast Web Hosting Support
address:      01110, Ukraine, Kiev, 20Á, Solomenskaya street. room 201.
address:      UA
phone:        +357 99 117759
e-mail:       support@fwebhost.com
nic-hdl:      FWHS1-RIPE
source:       RIPE # Filtered

P.R.A.S.E. – PageRank assisted search engine – compare ranking on Google, Yahoo, and MSN

page rank assisted search engine
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.

I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.

As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.

PageRank is only used by Google, which no longer uses the original PageRank algorithm for ranking results, but it’s still interesting to see the top search results from the three major search engines laid out with PR scores to get some sense of the page linkage.

See also:

Why Link Farms (used to) Work

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ’sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.

See also: A reading list on PageRank and Search Algorithms

Personalization, Intent, and modifying PageRank calculations

Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.

On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.

Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

Some thoughts:

1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.

2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.

3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.

4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.