Bookmarks for June 11th through June 12th

These are my links for June 11th through June 12th:

Bookmarks for April 20th through April 23rd

These are my links for April 20th through April 23rd:

Bookmarks for April 9th through April 10th

These are my links for April 9th through April 10th:

Hacked by keymachine.de

I just noticed that my WordPress installation got hacked by a search engine spam injection attack sometime in the past few weeks. This particular one inserts invisible text with lots of keywords in footer.php. The changes to the file were made using the built-in theme editor, originating from ns.km20725.keymachine.de, which is currently at 84.19.188.144. The spam campaign automatically updates the spam payload every day or so. The links point to a variety of servers that have also been hacked to host the spam content. Here is a sample: http://www.nanosolar.com/feb3/talk.php?28/82138131762.html
I’ve sent an e-mail to Nanosolar, so they’ll probably have that content cleaned up before long. But the automated SEO spam campaign updates the keyword and link payload regularly, so any affected WordPress sites will be updated to point at the new hosting victims.

From a quick check on Google, it looks like keymachine.de is a regular offender

More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT – The first online interface for exploring the AOL search query data is up at www.aolsearchdatabase.com (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT – Here’s another online interface at dontdelete.com (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

AOL Research publishes 20 million search queries

More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.

Adam D’Angelo says:

This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.

Plentyoffish sees an opportunity for PPC and Adsense spammers:

Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.

I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.

More on the privacy angle from SiliconBeat, Zoli Erdos

See also: Coming soon to DVD – 1,146,580,664 common five-word sequences

Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.

Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data

The Long Tail of Invalid Clicks and other Google click fraud concepts

Some fine weekend reading for search engineers, SEOs, and spam network operators:

A 47-page independent report on Google Adwords / Adsense click fraud, filed yesterday as part of a legal dispute between Lane’s Gifts and Google, provides a great overview of the history and current state of click fraud, invalid clicks of all types, and the four-layered filtering process that Google uses to detect them.

Google has built the following four “lines of defense” against invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.

An interesting observation is that most click fraud can be eliminated through simple filters. Alexander Tuzhilin, author of the report, speculates on a Zipf-law Long Tail of invalid clicks of less common attacks, and observes:

Despite its current reasonable performance, this situation may change significantly in the future if new attacks will shift towards the Long Tail of the Zipf distribution by becoming more sophisticated and diverse. This means that their effects will be more prominent in comparison to the current situation and that the current set of simple filters deployed by Google may not be sufficient in the future. Google engineers recognize that they should remain vigilant against new possible types of attacks and are currently working on the Next Generation filters to address this problem and to stay “ahead of the curve” in the never-ending battle of detecting new types of invalid clicks.

He also highlights the irreducible problem of click fraud in a PPC model:

  • Click fraud and invalid clicks can be defined conceptually, but the only working defintion is an operationally defined one
  • The operational definition of invalid clicks can not be fully disclosed to the general public, because it will lead to massive click fraud.
  • If the operational definition is not disclosed to some degree, advertisers can not verify or dispute why they have been charged for certain clicks

The court settlement asks for an independent evaluation of whether Google’s efforts to combat click fraud are reasonable, which Tuzhulin believes they are. The more interesting question is whether they will continue to be sufficient as time progresses and the Long Tail of click fraud expands.

Links:

Google’s PageRank and Beyond – summer reading for search hackers

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

This book could serve as an excellent introduction to search algorithms for someone with a programming or mathematics background, covering PageRank at length, along with some discussion of HITS, SALSA, and antispam approaches. Some current topics, such as clustering, personalization, and reputation (TrustRank/SpamRank) are not covered here, although they are mentioned briefly. The bibliography and web resources provide a comprehensive source list for further research (up through around 2004), which will help point motivated readers in the right direction. I’m sure it will be popular at Google and Yahoo, and perhaps at various SEO agencies as well.

Those with less interest in the innards of search technology may enjoy a more casual summer read about Google, try John Battelle’s The Search. Or get Langville and Meyers’ book, skip the math, and just read the sidebars.

See also: A Reading List on PageRank and Search Algorithms, my del.icio.us links on search algorithms

Reverse engineering a referer spam campaign

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

Referrer spam is more indirect. It is created by making an HTTP request with the REFERER header set to the URL being promoted. Most of the time, this will only be visible in the web server log.

Here is a typical HTTP log entry:

87.219.8.210 	[04/Feb/2006:15:20:35 	-0800]
    GET 	/weblog/archives/2005/09/15/google-blog-search-referrers-working-now 	HTTP/1.1
    403 	- 	"http://every-search.com"

Some blogs and other web sites post an automatically generated list of “recent referrers” on their home page or on a sidebar. In normal use, this would show a list of the sites that had linked to the site being viewed. Recent referrer lists are less common now, because of the rise of referrer spam.

Referrer spam will also show up in web site statistic and traffic summaries. These are usually private, but are sometimes left open to the public and to search engines.

One presumed objective of a link spam campaign is to increase the target site’s search engine ranking. In general this requires building a collection of valid inbound links, preferably without the “nofollow” attribute. Referrer spam may be more effective for generating inbound links, since recent referrer lists and web site reports typically don’t wrap their links with nofollow.

The landing pages for the links in this campaign are interesting in that they don’t contain advertising at all. This suggests that this campaign is trying to build a sort of PageRank farm to promote something else.

The actual pages are all built on the same blog template, and contain a combination of gibberish and sidebar links to subdomains based on “valuable” keywords. Using the blog format automatically provides a lot of site interlinking, and they also have “recent” and “top referer” lists, which are all from other spam sites in the network.

It looks like the content text should be easy to identify as spam based on frequency analysis. Perhaps having a very large cloud of spam sites linking to each other along with a dispersed set of incoming referrer spam links makes the sites look more plausible to a search engine? These sites don’t appear to have any, but I have come across other spam sites and comment spam posts that have links to non-spam sites such as .gov and .edu sites, perhaps trying to look more credible to a search engine ranking algorithm. All the sites being on the same subnet makes them easier to spot, though.

Given that there aren’t that many public web site stat pages and recent referrer lists around, I’m surprised that referrer spamming is worth the effort. If the spam network can achieved good ranking in the Google and the other search engines, they can probably boost the ranking for a selected target site by pruning back some of their initial links and adding some links pointing at the sites that they want to promote. Affiliate links to porn, gambling, or online pharmacy sites must pay reasonably well for this to work out for the spammers.

More reading: A list of references on PageRank and link spam detection.

If you’re having referrer spam problems on your site, you may find my notes on blocking referer spam useful.

Here’s some sample text from “search-buy.com”:

I search-buy over least and and next train. Ne so at cruelty the search-buy in after anaesthesia difficulty general urinating. T pastry a ben for search-buy boy. An refuses trip search-buy romances seemed azusa pacific university ca. Stoc of my is and search-buy direct having sex teen titans. Kid philadelphiaa would and york search-buy. G search-buy wore shed i dads. obstacles future search-buy right had satire nineteenth. The that i ups this on search-buy least finds audio express richmond. have this window been wonderful me search-buy so. Surel in actually search-buy our boy deep franklin notions. An search-buy it of my has of. To at head boy that a search-buy. O james search-buy everywhere of but. Alread originate search-buy good about since.

Here are a few spam sites from this campaign and their IP addresses:

bikini-now.com          A       85.255.114.212
babestrips.com          A       85.255.114.229
search-biz.biz          A       85.255.114.245
bustytart.com           A       85.255.114.250
cjtalk.net              A       85.255.114.227
search-galaxy.org             A       85.255.114.252
moresearch.org             A       85.255.114.237

Here is the WHOIS output for that netblock:

% Information related to '85.255.112.0 - 85.255.127.255'

inetnum:        85.255.112.0 - 85.255.127.255
netname:        inhoster
descr:          Inhoster hosting company
descr:          OOO Inhoster, Poltavskij Shliax 24, Kharkiv, 61000, Ukraine
remarks:        -----------------------------------
remarks:        Abuse notifications to: abuse@inhoster.com
remarks:        Network problems to: noc@inhoster.com
remarks:        Peering requests to: peering@inhoster.com
remarks:        -----------------------------------
country:        UA
org:            ORG-EST1-RIPE
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
tech-c:         FWHS1-RIPE
status:         ASSIGNED PI
mnt-by:         RIPE-NCC-HM-PI-MNT
mnt-lower:      RIPE-NCC-HM-PI-MNT
mnt-by:         RECIT-MNT
mnt-routes:     RECIT-MNT
mnt-domains:    RECIT-MNT
mnt-by:         DAV-MNT
mnt-routes:     DAV-MNT
mnt-domains:    DAV-MNT
source:         RIPE # Filtered

organisation:   ORG-EST1-RIPE
org-name:       INHOSTER
org-type:       NON-REGISTRY
remarks:        *************************************
remarks:        * Abuse contacts: abuse@inhoster.com *
remarks:        *************************************
address:        OOO Inhoster
address:        Poltavskij Shliax 24, Xarkov,
address:        61000, Ukraine
phone:          +38 066 4633621
e-mail:         support@inhoster.com
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
mnt-ref:        DAV-MNT
mnt-by:         DAV-MNT
source:         RIPE # Filtered

person:         Andrei Kislizin
address:        OOO Inhoster,
address:        ul.Antonova 5, Kiev,
address:        03186, Ukraine
phone:          +38 044 2404332
nic-hdl:        AK4026-RIPE
source:         RIPE # Filtered

person:       Fast Web Hosting Support
address:      01110, Ukraine, Kiev, 20Á, Solomenskaya street. room 201.
address:      UA
phone:        +357 99 117759
e-mail:       support@fwebhost.com
nic-hdl:      FWHS1-RIPE
source:       RIPE # Filtered

P.R.A.S.E. – PageRank assisted search engine – compare ranking on Google, Yahoo, and MSN

page rank assisted search engine
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.

I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.

As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.

PageRank is only used by Google, which no longer uses the original PageRank algorithm for ranking results, but it’s still interesting to see the top search results from the three major search engines laid out with PR scores to get some sense of the page linkage.

See also:

Why Link Farms (used to) Work

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ‘sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.

See also: A reading list on PageRank and Search Algorithms

Newsweek on white hat and black hat search engine optimization

via Seomoz:

This week’s Newsweek (December 12, 2005) features an article on white hat vs black hat search engine optimization. Among other things, it’s interesting that the topic has made it into the mainstream media.

A “black hat” anecdote:

Using an illicit software program he downloaded from the Net, he forcibly injected a link to his own private-detectives referral site onto the site of Long Island’s Stony Brook University. Most search engines give a higher value to a link on a reputable university site.

The site in question appears to be “www.private-detectives.org”, still currently #1 at MSN and #4 at Yahoo for searches on “private detectives”. It appears to have been sandboxed on Google.

Another interesting post at Seomoz features comments from “randfish” and “EarlGrey”, the two SEO consultants interviewed by Newsweek on the merits of “White Hat” vs “Black Hat” search engine optimization, and gives further perspective on the motivation and outlook of the two approaches.

In some ways one can think of the difference between search engine optimization approaches as a “trading” approach vs a “building” approach to investment. The “Black Hat” approach articulated in the Seomoz article tends to focus purely on a tactical present cash return to the operator, while the “White Hat” approach presumes that the operator will realize ongoing future value by developing a useful information asset and making it visible to the search engines. This makes an implicit assumption that the site itself offers some unique and valuable information content, which can’t usually be the case in the long run.

From an information retrieval point of view, I’m obviously in the latter camp of thinking that identifying the most relevant results for the search user is a good thing. However, the black hat approach makes perfect sense if you consider it in terms of optimizing the short term value return to the publisher (cash as information), while possibly still presenting a useable information return to the search user. This is especially the case for commodity information or products, in which the actual information or goods are identical, such as affiliate sales.

I’m a little curious about the link from Stony Brook University. I took a quick look but wasn’t able to turn up a backlink. One of the problems with simply relying on trusted link sources is that they can be gamed, corrupted, or hacked.

See also: A reading list on PageRank and search algorithms

Update 12-12-2005 00:30 PST: Lots of comments on Matt Cutt’s post, plus Slashdot

A reading list on PageRank and search algorithms

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.

  • Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
  • Online Reputation Systems: The Cost of Attack of PageRank (PDF)
    Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
  • SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF)
    András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
  • Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).

I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.

Mod_rewrite for moving web content to a new domain

I just wasted 10 minutes getting this to work correctly, so I thought I’d write it down…

Here’s what you need to use mod_rewrite to implement a permanent 301 Moved HTTP response when you move a web site from a subdirectory on one domain to a new top level domain.

(Assuming you’re on a hosted service, and can use .htaccess):

RewriteEngine on
RewriteBase /
RewriteRule ^olddir/?(.*)$ http://new-domain.com/$1  [R=permanent,L]

where the old content was originally in a subdirectory called “olddir” and is getting moved to a new directory on a different server.

This allows you to move the content to a new, separate domain and/or server without breaking your existing links.

link: more on .htaccess and mod_rewrite in the Apache documentation

Yahoo Site Explorer

Yahoo Search Blog announces Yahoo Site Explorer a handy alternative to searching with “site:” or “link:” to see what’s getting indexed and linked at Yahoo Search. It’s billed as a work in progress, at the moment you can:

  • Show all subpages within a URL indexed by Yahoo!, which you can see for stanford.edu, here. You can also see subpages under a path, such as for Professor Knuth’s pages.
  • Show inlinks indexed by Yahoo! to a URL, such as for Professor Knuth’s pages, or for an entire site like stanford.edu.
  • Submit missing URLs to Yahoo

There is also a web service API for programmatic queries.

Discussion at Search Engine Watch, Webmaster World.

Danny Sullivan at Search Engine Watch posted a synopsis on the SEW Forum:

I’ve done a summary of things over here on the blog, which also links to a detailed look for SEW paid members.

Here are my top line thoughts:

You can see all pages from all domains, one domain, or a directory/section within a domain. Thumbs up!

You can NOT pattern match to find all URLs from a domain. That would be nice.

You can see all links to a specific page or a domain. Thumbs up!

You can NOT exclude your own links, very unfortunately. Two thumbs down!

You can export data, but only the first 50 items, unfortunately. Thumbs down!

More wish list stuff:

Search commands such as link: aren’t supported, and I hope that might come.

You can get a feed of your top pages, but I want a feed of backlinks to inform me of new ones that are found. Site owners deserve just as much fun as blog owners in knowing about new links to them!

Some of the other posts discuss interesting things you can do with the existing “advanced search” options. I’ll have to try some out, both through Yahoo Site Explorer and using some of the suggested link queries which apparently can’t be done yet through Site Explorer.

Dredging for Search Relevancy

I am apparently a well trained, atypical search user.

Users studied in a recently published paper users clicked on the top search result almost half the time. Not new, but in this study they also swapped the result order for some users, and people still mostly clicked on the top search result

I routinely scan the full page of search results, especially when I’m not sure where I’m going to find the information I’m looking for. I often randomly click on the deeper results pages as well, especially when looking for material from less-visible sites. This works for me because I’m able to scan the text on the page quickly, and the additional search pages also return quickly. This seems to work especially well on blog search, where many sites are essentially unranked for relevancy.

This approach doesn’t work well if you’re not used to scanning over pages of text, and also doesn’t work if the search page response time is slow.

On the other hand, I took a quick try at some of the examples in the research paper, and my queries (on Google) generally have the answer in the top 1-2 results already.

From Jakob Nielsen’s Alertbox, September 2005:

Professor Thorsten Joachim and colleagues at Cornell University conducted a study of search engines. Among other things, their study examined the links users followed on the SERP (search engine results page). They found that 42% of users clicked the top search hit, and 8% of users clicked the second hit. So far, no news. Many previous studies, including my own, have shown that the top few entries in search listings get the preponderance of clicks and that the number one hit gets vastly more clicks than anything else.

What is interesting is the researchers’ second test, wherein they secretly fed the search results through a script before displaying them to users. This script swapped the order of the top two search hits. In other words, what was originally the number two entry in the search engine’s prioritization ended up on top, and the top entry was relegated to second place.

In this swapped condition, users still clicked on the top entry 34% of the time and on the second hit 12% of the time.

For reference, here are the questions that were asked in the original study (182KB, PDF)

Navigational

  • Find the homepage of Michael Jordan, the statistician.
  • Find the page displaying the route map for Greyhound buses.
  • Find the homepage of the 1000 Acres Dude Ranch.
  • Find the homepage for graduate housing at Carnegie Mellon University.
  • Find the homepage of Emeril – the chef who has a television cooking program.

Informational

  • Where is the tallest mountain in New York located?
  • With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are democratic presidential primaries in New York?
  • Which actor starred as the main character in the original Time Machine movie?
  • A friend told you that Mr. Cornell used to live close to campus – near University and Steward Ave. Does anybody live in his house now? If so, who?
  • What is the name of the researcher who discovered the first modern antibiotic?

Google Blog Search – Referrers Working Now

Looks like Google Blog Search took out the redirects that were breaking the referrer headers.

Now the search keywords are visible again. Here’s a typical log entry:

xxx.xxx.xxx.xxx – - [15/Sep/2005:15:58:13 -0700]
“GET /weblog/archives/2005/09/15/podcasting-and-audio-search-at-sdforum-searchsig-september-2005/
HTTP/1.1″ 200 26981 “http://blogsearch.google.com/blogsearch?hl=en&q=odeo&btnG=Search+Blogs&scoring=d”
“Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.10) Gecko/20050716
Firefox/1.0.6″

Blogger Buzz says the redirect was in place during development to help keep the project under wraps.

Google Blog Search – No Referrer Keywords?

Feature request to Google Blog Search team: please add search query info to the referrer string.

Lots of coverage this morning from people trying out Google Blog Search. (Search Engine Watch, Anil Dash, lots more)

I’m seeing some traffic from Google Blog Search overnight, but it looks like they don’t send the search query in the referrer. Here’s a sample log entry:

xxx.xxx.xxx.xxx – - [14/Sep/2005:00:51:09 -0700] “GET /weblog/archives/2005/09/14/google-blog-search-launches/ HTTP/1.1″ 200 22964 “http://www.google.com/url?sa=D&q=http://www.hojohnlee.com/weblog/archives/2005/09/14/google-blog-search-launches/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4″

So there’s no way to know the original search query. I have a pretty good idea how the overnight traffic looking for the Google post got here, but there are also people landing on fairly obscure pages here and I’m always curious how they found them. I’m sure the SEO crowd will be all over this shortly.

There have been a number of comments that Google Blog Search is sort of boring, but I’m finding that there’s good novelty value in having really fast search result pages. Haven’t used it enough to get a sense of how good the coverage is, or how fast it updates, but it will be a welcome alternative to Technorati and the others.

Update 09-14-2005 14:01 PDT: These guys think Google forgot to remove some redirect headers.

Update 09-14-2005 23:25 PDT: Over at Blogger Buzz, Google says they left the redirect in by accident, will be taking them out shortly:

“After clicking on a result in Blog Search, I’m being passed through a redirect. Why?”
Sadly, this wasn’t part of an overly clever click-harvesting scheme. We had the redirects in place during testing to prevent referrer-leaking and simply didn’t remove them prior to launch. But they should be gone in the next 24 hours … which will have the advantage of improving click-through time.