Bookmarks for February 18th through February 19th

These are my links for February 18th through February 19th:

Google search results and DMOZ editorializing?

I’ve never seen a search result page like this before. The meta text “Conservative think tank claiming to report about events and nations strategically important to the United States” doesn’t appear any where in the referenced page, which doesn’t contain any useful <META> content. Searching for that text, it looks like the text originated from the DMOZ directory listing.

Another entry from the same DMOZ list, the Kensington Review, also returns the DMOZ meta text, this time in place of the <META> text in the actual page. DMOZ says “An e-magazine of political and social commentary. When the left says the glass is half full and the right says it is half empty, Kensington suggests that it might be too big.” Kensington’s own META says “An electronic journal of political, financial and social commentary”.  DMOZ is a more interesting description, but again does not originate from the content itself. 

So it appears that DMOZ editors have greater influence over certain Google search descriptions than the actual sites themselves, which is not necessarily bad, but was certainly unexpected (to me). Overall I’d prefer that Google limit its editorial function to ranking and presenting the search results, and perhaps make the editorial opinions known, but not presented as definitive. 

I’m not particularly familiar with the Jamestown Foundation, which is why I was searching in the first place. The DMOZ editor is clearly skeptical but I’d rather form my own opinion. 

google-jamestown-serp-meta 

Hello stealthy readers

Hello, dear readers. I had lunch with some friends the other day and they mentioned that I hadn’t posted in a while. Sorry I haven’t been paying much attention to this site lately, other than knocking back comment and link spam. I recently saw that Google Reader is starting to report subscription statistics, which prompted me to take a look. It’s been a while since I looked over the server logs, and I was surprised at the number of RSS subscriptions that have accumulated (i.e. it’s more than I can account for by friends, family, and random acquaintances). I didn’t know you were out there, but now that you’re decloaked and I can see you, I wanted to say hello.

I ended up taking a break from posting for a few weeks (since the beginning of the year). Not by coincidence, I’ve also ramped up my running since the beginning of the year, prepping for this year’s Big Sur Marathon, while holding other obligations roughly constant.

Anyway, I think I’ll try some different approaches to posting here and see how it works out.

Ms. Dewey – Stylish search, with whips, guns, and dating tips


It’s been a while since I’ve come across something I haven’t seen before online. Ms. Dewey fits the bill. It is a Flash-based application combining video clips of actress Janina Gavankar with Windows Live search.

As a search application, it’s fat, slow, and the query results aren’t great. However, as John Batelle observes, “clearly, search ain’t the point.” This is search with an flirty attitude, where the speed and quality of the results aren’t at the top of the priority list.

As short-attention-span theater goes, it’s quite entertaining.

If you can’t think of anything to search for, Ms. Dewey will fidget for a while and eventually reach out and tap on the screen. “Helloooo…type something here…”

It’s far more interesting to try some queries and check out the responses. I spent over half an hour typing in keywords to see what would come up, starting with some of the suggestions from Digg and Channel9. The application provides a semi-random set of video responses based on the search keywords, so you won’t always get the same reaction each time.

The whip and riding crop don’t always appear when you’d think, the lab coat seems to be keyed to science and math (try “partial differential equation”), and I’m not sure what brings on the automatic weapons.

“Ms. Dewey” also has a MySpace page with more video clips. The way the application is constructed, they can probably keep updating and adding responses as long as they want to.

I briefly tried using Ms. Dewey in place of Google, as a working search engine, but it takes too long to respond to a series of queries (have to wait for the video to play) and the search results aren’t great (Live is continuing to improve, though). At the moment this is a fun conceptual experiment.

I wonder if we’ll see a new category of search emphasizing style (entertainment, attitude, sex) over substance (relevance, speed, scope). Today’s version might already work for the occasional search user, but imagine Ms. Dewey with faster, non-blocking search results, a better search UI, and Google’s results. It all vaguely reminds me of a William Gibson novel.

More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT – The first online interface for exploring the AOL search query data is up at www.aolsearchdatabase.com (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT – Here’s another online interface at dontdelete.com (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

AOL Research publishes 20 million search queries

More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.

Adam D’Angelo says:

This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.

Plentyoffish sees an opportunity for PPC and Adsense spammers:

Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.

I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.

More on the privacy angle from SiliconBeat, Zoli Erdos

See also: Coming soon to DVD – 1,146,580,664 common five-word sequences

Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.

Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data

Google is having problems this evening?

This evening I’m getting slow response or connection timeouts from Google for the past half hour or so (20:30 – 21:00 PDT). Usually this means that the local network is having problems, but other major sites (Yahoo, CNN) are running as quickly as ever, along with various SSH sessions around the world, so it seems to be specific to Google.

So far I get slow or no response from the main search page, Gmail, Adsense, Adwords, Analytics, and Finance.

Pages that do respond are coming back in 10+ seconds, and some pages are loading without graphics or with templates only and no content.

Anyone else seeing these problems? This is the first time I’ve seen Google unusable for more than a minute or two. (Unlike this site, which has been bouncing up and down due to problems at Dreamhost lately).

Search referrals – July 2006 snapshot


Here’s a quick snapshot of incoming search engine referrals for the past few weeks. Compare this with another post last year on search engine referral share, recently referenced in a post at Alexa noting the discrepancy between the published search engine traffic reports and anecdotal observations by webmasters.

Is it just me, or are these charts a bit goofy? Does Yahoo really still have 23% of the search market? Is Google at less than half the search market?

I don’t believe it. Any webmaster will tell you that Google represents almost ALL of the search engine traffic. Yahoo is nowhere near 23%. Just read the blogs, here, here, here and here and on countless other blogs.

Already at 82% last October, Google has increased to even more of the incoming search traffic (92%) here, largely at the expense of “Other”. In the fall, it looked like those were mostly miscellaneous Chinese search engines, so perhaps my site is not getting indexed or ranked well there anymore, or Google is picking up market share, or both.

Some of the commenters at the Alexa post noted increasing traffic from Microsoft / MSN / Live search, including one who got most of their traffic through MSN search. I’m a little surprised that I don’t see more traffic from Yahoo and Microsoft search here, but that may also be a function of who’s likely to be searching for a given topic.

See also Greg Linden’s comments on the competitiveness of Yahoo and Microsoft search efforts

The Long Tail of Invalid Clicks and other Google click fraud concepts

Some fine weekend reading for search engineers, SEOs, and spam network operators:

A 47-page independent report on Google Adwords / Adsense click fraud, filed yesterday as part of a legal dispute between Lane’s Gifts and Google, provides a great overview of the history and current state of click fraud, invalid clicks of all types, and the four-layered filtering process that Google uses to detect them.

Google has built the following four “lines of defense” against invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.

An interesting observation is that most click fraud can be eliminated through simple filters. Alexander Tuzhilin, author of the report, speculates on a Zipf-law Long Tail of invalid clicks of less common attacks, and observes:

Despite its current reasonable performance, this situation may change significantly in the future if new attacks will shift towards the Long Tail of the Zipf distribution by becoming more sophisticated and diverse. This means that their effects will be more prominent in comparison to the current situation and that the current set of simple filters deployed by Google may not be sufficient in the future. Google engineers recognize that they should remain vigilant against new possible types of attacks and are currently working on the Next Generation filters to address this problem and to stay “ahead of the curve” in the never-ending battle of detecting new types of invalid clicks.

He also highlights the irreducible problem of click fraud in a PPC model:

  • Click fraud and invalid clicks can be defined conceptually, but the only working defintion is an operationally defined one
  • The operational definition of invalid clicks can not be fully disclosed to the general public, because it will lead to massive click fraud.
  • If the operational definition is not disclosed to some degree, advertisers can not verify or dispute why they have been charged for certain clicks

The court settlement asks for an independent evaluation of whether Google’s efforts to combat click fraud are reasonable, which Tuzhulin believes they are. The more interesting question is whether they will continue to be sufficient as time progresses and the Long Tail of click fraud expands.

Links:

Google’s PageRank and Beyond – summer reading for search hackers

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

This book could serve as an excellent introduction to search algorithms for someone with a programming or mathematics background, covering PageRank at length, along with some discussion of HITS, SALSA, and antispam approaches. Some current topics, such as clustering, personalization, and reputation (TrustRank/SpamRank) are not covered here, although they are mentioned briefly. The bibliography and web resources provide a comprehensive source list for further research (up through around 2004), which will help point motivated readers in the right direction. I’m sure it will be popular at Google and Yahoo, and perhaps at various SEO agencies as well.

Those with less interest in the innards of search technology may enjoy a more casual summer read about Google, try John Battelle’s The Search. Or get Langville and Meyers’ book, skip the math, and just read the sidebars.

See also: A Reading List on PageRank and Search Algorithms, my del.icio.us links on search algorithms

Google Finance launches

Google launched Google Finance today. Lots of people have written about it already, generally nonplussed. Here’s my quick reaction.

I like:

  • News events plotted on the stock chart timeline. I wish they’d add this to Yahoo Finance.
  • Ajax UI for scrolling the stock chart around and changing the time window
  • Recent blog search results on the right sidebar (although they seem to be a few hours behind)

I wish for:

  • More charting features. There basically aren’t any right now.
  • Better integration of the “More Resources” features. Things like SEC filings, institutional holders, and earning estimates are all provided by 3rd parties via outbound links, making it hard to flip through.

Technical charting and research reports are provided via Yahoo Finance, although the discussions are hosted at Google Groups.

The feature I’d really like to see is an intelligently filtered view of the Yahoo Finance discussion boards. There is some interesting and useful information there, but a far larger quantity of rants, spam, and trolling in between.

More tea leaves from Google’s analyst day presentation

It seems that a lot of the interesting content from last week’s analyst event at Google is in the speaker notes from the PowerPoint slide deck. Greg Linden and others have already pointed out the notes about Google’s storage plans (GDrive, Lighthouse on slide 19).

This afternoon there’s another blip on CNBC about accidental communications in the slides.

The previously undisclosed notes stated that Google’s core advertising business was expected to grow by nearly 60 percent to $9.5 billion in 2006 but that profit margins in its mainstay AdSense business could be squeezed this year and beyond.

I didn’t remember seeing a revenue forecast in there, so I went back and looked to see what it actually said (slide 14).

Our ads business for the moment is healthy and growing and we’re on a strong trajectory
projected to grow from $6bn this year to $9.5bn next year based purely on trends in traffic and monetization growth

But strong competitors are attempting to aggregate traffic
AdSense margins will be squeezed in 2006 and beyond
Y! and MSN will do un-economic things to grow share
The ad network will be commoditized over time
So, we need to build a more complete ads system that is characterized by two words: wider and deeper. That is, cast the net wider to attract new customer types) and deeper to enhance our relationship with existing customers.

Reuters says these particular notes were supposedly left in accidentally from internal planning discussions in late 2005.

“These notes were not created for financial planning purposes, and should not be regarded as financial guidance. Consistent with past practice, Google is not providing revenue guidance,” Google said in the filing.

I liked “Y! and MSN will do un-economic things to grow share”.

Don’t think we’ll be getting PowerPoint files from Google investor relations next time around. There’s a PDF file up now.

Update 03-08-2006 21:34 PDT: Paul Kedrosky has posted a copy of the original PPT slides.

Will Google grow at this rate forever? No? Then DIE!!

Today was a moderately exciting or irritating day to be a investor in public technology companies. Google’s CFO, George Reyes, apparently forgot that he was webcasting to a public group of investors rather than conferencing with an in-house team at the Googleplex during the Q&A session at the Merrill Lynch Internet, Advertising, Information, & Education conference: (Yahoo/AP News)

Q: Looking back to Q3 2005, was there anything in there that was maybe sort of one-time in nature that accounted for such strong revenue growth…?

A: So we went through a period of probably 18 months where we thought we had…well, let me characterize it…we had what was called a RevForce initiative–Revenue Force–which was really a team of really very bright technical engineers that were trying to tweak and optimize the ad system, and not–you know in very very responsible ways [Don't Be Evil!]–and that sort of paid off nicely with the fruits of that labor.

And what’s happened since then is that we got so good and so efficient at that back then that really most of what’s left is just organic growth, which means you have to grow your traffic and your have to grow your monetization.

But so, I think, we’re now, clearly our growth rates are slowing. And you see that each and every quarter. And we’re going to have to find other ways, you know, to monetize the business.

Later in the Q&A there’s something about the “law of large numbers” ultimately limiting growth due to running out of people to look at advertising. These are high class problems to have, and these sound like perfectly intelligent comments for an internal coffeetalk or private discussion. But when your stock is trading at 72x earnings, it’s a bad thing when the CFO says “growth is slowing” to a room of investors looking for extreme growth. The response is going to be “shoot first and figure it out later”, which is what happened this morning.

Reminds me of a scene in Ghostbusters:

Gozer: Are you a God?
Ray: No.
Gozer: Then — DIE!!

Winston: Ray, when someone asks if you’re a God”, you say YES!


How big is the growth rate? Pulling some data from Google’s IR site, this graph shows GOOG’s quarterly gross revenue growth for 2003-2005. The maroon line is Adsense sites, the light blue line is for Google-owned sites, and the dark blue line is the total.

One simplistic lower bound for future growth at Google would be to assume that it tracks the overall growth of internet use. I’ve inserted an additional blue line just above 4%, which is a rough estimate of the overall growth rate of the internet. I haven’t tried to find detailed data, this is from Jakob Nielsen’s Alertbox, which cites an 18% annualized growth rate from 2002 through 2005.

“We are getting to the point where the law of large numbers start to take root,” Reyes said Tuesday. “At the end of the day, growth will slow. Will it be precipitous? I doubt it.”

Google issued a press statement late in the afternoon:

As we have stated before, monetization improvements will continue to be a key factor in driving future revenue growth. We still see significant opportunities to improve monetization and intend to continue to focus our efforts in this area.

Moreover, as we have stated in our SEC filings, our revenue growth rate has generally declined over time and we expect that it will continue to do so as a result of the difficulty of maintaining growth rates on a percentage basis as our revenues increase to higher levels.

Hey, how’s that GBuy project going, anyway…

Webcast of the conference presentation (registration required)

Henry Blodget has a number of interesting posts on Google, including why he doesn’t own it, approaches to valuation, the most recent earnings, and today’s adventures.

The Google analyst day coming up this Thursday should be pretty interesting. Might be worth trying to catch the webcast. Bet George is getting some extra practice in.

Google and magazine covers as a contrary indicator

Is Google headed for a downturn? Not only is it featured in a generally negative cover article in this week’s Barron’s, but now it’s featured on the cover of Time as well. These magazines cater to very different audiences, so turning up on both at the same time could be considered a sign that Google is reaching a peak of sorts on both the financial and general cultural fronts.

There’s a long tradition of things going badly for companies and people after getting this sort of high profile magazine cover treatment. If Google turns up next on the cover of People or Entertainment Weekly they’re probably doomed…

Update 02-12-2006 18:31 PST: John Battelle suggests that having made the cover of Time, Google has “jumped the shark”, while Matt Cutts offers a recent historical perspective of Google’s non-shark-jumping behavior while simultaneously demonstrating effective link baiting technique.

I don’t consider myself an expert on shark-jumping, but I do think that hitting the covers of Barrons and Time is qualitatively different than the counter-examples that Matt offers. Google is transitioning out of being loved for being better, new, and whizzy, and into a stage where people expect it to “just work”. Google has gotten large enough that people are developing a love/hate relationship with it (and web services in general) like they have with e-mail, and where the discussion about privacy, media, and commerce is just starting to get some critical attention from people outside tech land.

Reverse engineering a referer spam campaign

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

Referrer spam is more indirect. It is created by making an HTTP request with the REFERER header set to the URL being promoted. Most of the time, this will only be visible in the web server log.

Here is a typical HTTP log entry:

87.219.8.210 	[04/Feb/2006:15:20:35 	-0800]
    GET 	/weblog/archives/2005/09/15/google-blog-search-referrers-working-now 	HTTP/1.1
    403 	- 	"http://every-search.com"

Some blogs and other web sites post an automatically generated list of “recent referrers” on their home page or on a sidebar. In normal use, this would show a list of the sites that had linked to the site being viewed. Recent referrer lists are less common now, because of the rise of referrer spam.

Referrer spam will also show up in web site statistic and traffic summaries. These are usually private, but are sometimes left open to the public and to search engines.

One presumed objective of a link spam campaign is to increase the target site’s search engine ranking. In general this requires building a collection of valid inbound links, preferably without the “nofollow” attribute. Referrer spam may be more effective for generating inbound links, since recent referrer lists and web site reports typically don’t wrap their links with nofollow.

The landing pages for the links in this campaign are interesting in that they don’t contain advertising at all. This suggests that this campaign is trying to build a sort of PageRank farm to promote something else.

The actual pages are all built on the same blog template, and contain a combination of gibberish and sidebar links to subdomains based on “valuable” keywords. Using the blog format automatically provides a lot of site interlinking, and they also have “recent” and “top referer” lists, which are all from other spam sites in the network.

It looks like the content text should be easy to identify as spam based on frequency analysis. Perhaps having a very large cloud of spam sites linking to each other along with a dispersed set of incoming referrer spam links makes the sites look more plausible to a search engine? These sites don’t appear to have any, but I have come across other spam sites and comment spam posts that have links to non-spam sites such as .gov and .edu sites, perhaps trying to look more credible to a search engine ranking algorithm. All the sites being on the same subnet makes them easier to spot, though.

Given that there aren’t that many public web site stat pages and recent referrer lists around, I’m surprised that referrer spamming is worth the effort. If the spam network can achieved good ranking in the Google and the other search engines, they can probably boost the ranking for a selected target site by pruning back some of their initial links and adding some links pointing at the sites that they want to promote. Affiliate links to porn, gambling, or online pharmacy sites must pay reasonably well for this to work out for the spammers.

More reading: A list of references on PageRank and link spam detection.

If you’re having referrer spam problems on your site, you may find my notes on blocking referer spam useful.

Here’s some sample text from “search-buy.com”:

I search-buy over least and and next train. Ne so at cruelty the search-buy in after anaesthesia difficulty general urinating. T pastry a ben for search-buy boy. An refuses trip search-buy romances seemed azusa pacific university ca. Stoc of my is and search-buy direct having sex teen titans. Kid philadelphiaa would and york search-buy. G search-buy wore shed i dads. obstacles future search-buy right had satire nineteenth. The that i ups this on search-buy least finds audio express richmond. have this window been wonderful me search-buy so. Surel in actually search-buy our boy deep franklin notions. An search-buy it of my has of. To at head boy that a search-buy. O james search-buy everywhere of but. Alread originate search-buy good about since.

Here are a few spam sites from this campaign and their IP addresses:

bikini-now.com          A       85.255.114.212
babestrips.com          A       85.255.114.229
search-biz.biz          A       85.255.114.245
bustytart.com           A       85.255.114.250
cjtalk.net              A       85.255.114.227
search-galaxy.org             A       85.255.114.252
moresearch.org             A       85.255.114.237

Here is the WHOIS output for that netblock:

% Information related to '85.255.112.0 - 85.255.127.255'

inetnum:        85.255.112.0 - 85.255.127.255
netname:        inhoster
descr:          Inhoster hosting company
descr:          OOO Inhoster, Poltavskij Shliax 24, Kharkiv, 61000, Ukraine
remarks:        -----------------------------------
remarks:        Abuse notifications to: abuse@inhoster.com
remarks:        Network problems to: noc@inhoster.com
remarks:        Peering requests to: peering@inhoster.com
remarks:        -----------------------------------
country:        UA
org:            ORG-EST1-RIPE
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
tech-c:         FWHS1-RIPE
status:         ASSIGNED PI
mnt-by:         RIPE-NCC-HM-PI-MNT
mnt-lower:      RIPE-NCC-HM-PI-MNT
mnt-by:         RECIT-MNT
mnt-routes:     RECIT-MNT
mnt-domains:    RECIT-MNT
mnt-by:         DAV-MNT
mnt-routes:     DAV-MNT
mnt-domains:    DAV-MNT
source:         RIPE # Filtered

organisation:   ORG-EST1-RIPE
org-name:       INHOSTER
org-type:       NON-REGISTRY
remarks:        *************************************
remarks:        * Abuse contacts: abuse@inhoster.com *
remarks:        *************************************
address:        OOO Inhoster
address:        Poltavskij Shliax 24, Xarkov,
address:        61000, Ukraine
phone:          +38 066 4633621
e-mail:         support@inhoster.com
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
mnt-ref:        DAV-MNT
mnt-by:         DAV-MNT
source:         RIPE # Filtered

person:         Andrei Kislizin
address:        OOO Inhoster,
address:        ul.Antonova 5, Kiev,
address:        03186, Ukraine
phone:          +38 044 2404332
nic-hdl:        AK4026-RIPE
source:         RIPE # Filtered

person:       Fast Web Hosting Support
address:      01110, Ukraine, Kiev, 20Á, Solomenskaya street. room 201.
address:      UA
phone:        +357 99 117759
e-mail:       support@fwebhost.com
nic-hdl:      FWHS1-RIPE
source:       RIPE # Filtered

P.R.A.S.E. – PageRank assisted search engine – compare ranking on Google, Yahoo, and MSN

page rank assisted search engine
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.

I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.

As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.

PageRank is only used by Google, which no longer uses the original PageRank algorithm for ranking results, but it’s still interesting to see the top search results from the three major search engines laid out with PR scores to get some sense of the page linkage.

See also:

Watching 4th graders use search engines

Last Friday I spent an hour with my daughter’s 4th grade class, helping them do online research for reports on early California explorers. They were individually assigned an explorer, and were looking for basic biographical information such as dates and places of birth and death, and notable historical achievements or other interesting items to write about. From my perspective, this turned out to be a sort of small focus group on using search engines.

I spend most of my time around people who are pretty good at using search engines and online research tools, so it was interesting to see what they would do with this assignment.

The kids are all familiar with computers to varying degrees. They have had classroom activities using the computer at least once a week since kindergarten, and most of them have some experience using computers at home (this is Palo Alto, after all). I don’t think they’ve done any organized “internet research” in school up to this point, though.

They all started with their research subject’s name written on a piece of paper and had about 20 minutes to find some useful information.

Here are some observations:

  • Simply typing in the names of the explorers was challenging for many of them (“Joseph Joaquin Moraga”, “Ivan Alexandrovich Kuskov”, and others I can’t recall).
  • They often tried to type the search phrase into the address bar. I also saw at least one person try to type the search phrase into a form entry field in an advertisement.
  • Their default home page is set to Yahooligans!, which is kid friendly but seems to sharply limit the search results. I had the kids try their queries there first, but most of them returned zero search results.
  • I then let the kids choose which search engine they wanted to use. About a third of the kids voluntarily expressed a preference for using Google, most of the rest didn’t know or care (I sent about half to Yahoo and half to Google), and one kid really wanted to use A9 (strange, I didn’t have a chance to find out why).
  • None of the kids were familiar with using quote marks to specify exact phrase matching. Some of the explorers’ names contain commonly occuring components and return a large number of irrelevant results without quotes.
  • None of the kids were familiar with the advanced search operators for excluding or qualifying search results. I had to help out in a couple of cases where they were having trouble finding relevant pages.
  • Some of them didn’t understand the difference between page content and the ads in the headers, footers, and sidebars.
  • Some of them were already both familiar with Wikipedia and the benefit and problem that anyone can change the page. One person wanted to look exclusively on Wikipedia after the subject came up.
  • The absence of a bookmarking system for the students to use tends to force them to print out pages they want to use later. This isn’t wonderful at a school lab, since the content is semi-disposable and they’re usually scrounging to conserve printer consumables like toner and paper. The kids liked having something to take back to the classroom with them, though
  • The variations in spelling for the mostly Spanish names caused problems for some queries. Google’s “did you mean” suggestions were helpful. At least one query (which I can’t recall) consisted entirely of common Hispanic names, which matched several famous people other than the intended query subject. This is similar to the problem of searching on common Asian names (like mine).
  • Some students quickly clicked themselves into a rathole of completely unrelated pages, usually after clicking on an ad.

Watching the kids trying to find useful pages highlighted the differences with my usual search behavior, which is to quickly scan the search results page, then refine the query using additional keywords and/or search operators, both of which are hard for 9- and 10-year-olds to do. In “research mode” I usually open results in a new browser tab or window. The kids actually click through the link, making it hard to work through a list of candidate results.

Coincidentally, earlier this week I came across a post on Google Blogoscoped which points to a recent dissertation on search user interface design geared towards kids, by Hilary Browne Hutchinson at University of Maryland which has some interesting observations and ideas.

Why Link Farms (used to) Work

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ‘sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.

See also: A reading list on PageRank and Search Algorithms

Googlepark: the battle for AOL


More business comics – the latest installment of Googlepark is up at Channel 9 (via Google Blogoscoped)

If you haven’t seen the previous episodes of Googlepark, here are links to the other installments: Googlepark.

Deconstructing search at Alexa

Wow! Although the basic idea is straightforward, crawling and indexing for a general purpose search engine requires huge resources. Web crawlers are effectively downloading copies of the entire internet over and over, turning them over to indexing applications which scan the contents for structure and meaning.

The sheer scale of the task is a substantial barrier to entry for anyone wanting to develop a new indexing or retrieval application. Some projects have narrowed the problem domain, which can reduce the problem scope to a manageable level, but this announcement from Alexa looks like it may offer an exciting alternative for building new search applications.

John Batelle writes:

Alexa, an Amazon-owned search company started by Bruce Gilliat and Brewster Kahle (and the spider that fuels the Internet Archive), is going to offer its index up to anyone who wants it (details are not up yet, but soon). Alexa has about 5 billion documents in its index – about 100 terabytes of data.

Anyone can also use Alexa’s servers and processing power to mine its index to discover things – perhaps, to outsource the crawl needed to create a vertical search engine, for example. Or maybe to build new kinds of search engines entirely, or …well, whatever creative folks can dream up. And then, anyone can run that new service on Alexa’s (er…Amazon’s) platform, should they wish.

The service will be priced on a usage basis: $1 per CPU hour, $1 per GB stored or uploaded, $1 per 50GB data processed.

There’s no announcement posted on the Alexa or Amazon sites yet, it’s apparently due out overnight. (Updated 12-13-2005 00:25 – the site is up now)

Not every search and retrieval application is necessarily going to fit onto the way Alexa has built their crawler and indexing infrastructure, or onto any other search engine platform, for that matter. But opening up access to more of the platform should make it possible for a lot of new ideas to be tried out quickly without having to build yet another crawler for each project. Up to this point, many search ideas can’t be evaluated without working at one of the major search engines. I suspect most development teams would prefer to get access to Google’s crawl and index data, but I’m certainly looking forward to seeing what’s available at Alexa when they get their documentation online in the morning.

More from Om Malik, TechCrunch, ReadWrite Web

Page 2 of 41234