Newsweek on white hat and black hat search engine optimization

via Seomoz:

This week’s Newsweek (December 12, 2005) features an article on white hat vs black hat search engine optimization. Among other things, it’s interesting that the topic has made it into the mainstream media.

A “black hat” anecdote:

Using an illicit software program he downloaded from the Net, he forcibly injected a link to his own private-detectives referral site onto the site of Long Island’s Stony Brook University. Most search engines give a higher value to a link on a reputable university site.

The site in question appears to be “”, still currently #1 at MSN and #4 at Yahoo for searches on “private detectives”. It appears to have been sandboxed on Google.

Another interesting post at Seomoz features comments from “randfish” and “EarlGrey”, the two SEO consultants interviewed by Newsweek on the merits of “White Hat” vs “Black Hat” search engine optimization, and gives further perspective on the motivation and outlook of the two approaches.

In some ways one can think of the difference between search engine optimization approaches as a “trading” approach vs a “building” approach to investment. The “Black Hat” approach articulated in the Seomoz article tends to focus purely on a tactical present cash return to the operator, while the “White Hat” approach presumes that the operator will realize ongoing future value by developing a useful information asset and making it visible to the search engines. This makes an implicit assumption that the site itself offers some unique and valuable information content, which can’t usually be the case in the long run.

From an information retrieval point of view, I’m obviously in the latter camp of thinking that identifying the most relevant results for the search user is a good thing. However, the black hat approach makes perfect sense if you consider it in terms of optimizing the short term value return to the publisher (cash as information), while possibly still presenting a useable information return to the search user. This is especially the case for commodity information or products, in which the actual information or goods are identical, such as affiliate sales.

I’m a little curious about the link from Stony Brook University. I took a quick look but wasn’t able to turn up a backlink. One of the problems with simply relying on trusted link sources is that they can be gamed, corrupted, or hacked.

See also: A reading list on PageRank and search algorithms

Update 12-12-2005 00:30 PST: Lots of comments on Matt Cutt’s post, plus Slashdot

Yahoo goes after more tagging assets, buys

Yahoo continues down the path of more tagging and more collaborative content. Having already purchased Flickr, this morning they’re acquiring (terms undislosed):

From Joshua Schachter at the blog:

We’re proud to announce that has joined the Yahoo! family. Together we’ll continue to improve how people discover, remember and share on the Internet, with a big emphasis on the power of community. We’re excited to be working with the Yahoo! Search team – they definitely get social systems and their potential to change the web. (We’re also excited to be joining our fraternal twin Flickr!)

From Jeremy Zawodny at Yahoo Search Blog:

And just like we’ve done with Flickr, we plan to give the resources, support, and room it needs to continue growing the service and community. Finally, don’t be surprised if you see My Web and borrow a few ideas from each other in the future.

From Lisa McMillan, an enthusiastic user of all 3 services (comment on the blog):

Yahoo that’s delicious! I live here. I live in flickr. I live at yahoo. This is insane. You deserve this success dude. Just please g-d don’t let me lose my bookmarks :-D I’m practically my own search engine. LOL

Tagged bookmarking sites such as can provide a rich source of input data for developing contextual and topical search. The early adopters that have used up to this point are unlikely to bookmark spam or very uninteresting pages, and the aggregate set of bookmarks and tags is likely to expose clustering of links and related tags which can be used to refine search results by improving estimates of user intent. Individuals are becoming their own search engine in a very personal, narrow way, which could be coupled to general purpose search engines such as Yahoo or Google.

I think Google needs to identify resources it can use to incorporate more user feedback into search results. Looking over the users’ shoulders via AdSense is interesting but inadequate on its own because there are a lot of sites that will never be AdSense publishers. Explicit input capturing the user’s intent, whether through tagging, voting, posting, publishing, is a strong indication of relevance and interest by that user. I think the basic Google philosophy of letting the algorithm do everything is much more scalable, but it looks like time to capture more human input into the algorithms.

In a recent post, I pointed out some work at Yahoo on computing conditional search ranking based on user intent. The range of topics on tends to be predictably biased, but for the areas that it covers well, I’d be looking for some opportunities to improve search results based on what humans thought was interesting. As far as I know, Google doesn’t have any assets in this space. Maybe Blogger or Orkut, but those are very noisy inputs.

This seems like a great move by Yahoo on multiple fronts, and I am very interested to see how this plays out.

See also:

Update 12-12-2005 12:30 PST: No hard numbers, but something like $10-15MM with earnouts looks plausible. More posts, analysis, and reader comments: Om Malik, John Batelle, Paul Kedrosky.

Personalization, Intent, and modifying PageRank calculations

Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.

On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.

Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

Some thoughts:

1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.

2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.

3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.

4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.

BrainJam, December 2005, search, privacy, transparency

Spent a few hours this afternoon at Chris Heuer’s BrainJam event. Wasn’t able to make it to the morning sessions, but arrived in time for the end of lunch and the “youth user panel”, consisting of four college students. They all love Facebook. Not sure how representative they are of the general student demographic, since two of them are trying to put together a web startup. They all use free online music and movie access, mostly through sharing within the dorm networks.

During the Q&A I asked for the panel members’ thoughts on privacy and about having their college lives online in perpituity. They’re vaguely concerned, but I don’t think the topic is really raising red flags for them. I think the high school and college users have more confidence in Facebook, MySpace, Xanga and others keeping their data private and/or it not making any difference to them in the future as social norms change. Part of it is that people are simply making things up on their pages, for the sake of attracting attention, and part of it is them not caring or not understanding that their web pages, chat transcripts, and even VOIP are mostly staying online forever. I think there’s going to be a lot of interesting conflicts in the future as people start running into their past personae 5, 10, 15 years later in a societal context that hasn’t adjusted yet to perpetual transparency.

Afterwards the group broke out into smaller topical discussions. The first session I went to was on the 2-way RSS proposal from Microsoft (Simple Sharing Extensions, SSE). I’m starting to think of SSE as a way for MSFT to use an RSS container for solving the sync problem for applications like Windows Mobile syncing a device and a desktop, or Active Directory performing distributed synchronization of directory data. I’m not really seeing a federated publishing model based on this, an idea that was floated in the conversation. It really feels like it solves an application sync problem for structured data.

The session on “what to do with all the data?” quickly turned into a discussion on privacy, transparency, and DRM. I’m personally disinclined to depend on trusting anyone’s DRM system to manage my criticall personal data, or for allowing anyone to indexing my private data in a way that eventually gets exposed to the world. One point of view expressed in this discussion was that the world would be better off if everyone just got used to the idea that everything they did was recorded and visible to the world (the Global Panopticon), although I think the majority disargreed that this would actually make people behave better. Personally, I think that documenting everything would break a lot of the ambiguity in relationships and conversations that allow the formation of reasonable opinions, by forcing people into adhering to “statements” and “positions” that were nothing more than passing conversation or exploration of a topic. This was part of my thinking behind asking the college kids about privacy. In real life, there are normally various social transitions that call for stepping away or de-emphasizing some aspects of one’s life, in favor of new ones. It doesn’t make the past behaviors and activities go away, but the combination of search engines and infinite, cheap storage is likely to keep some aspects of these folks’ “past” life in their face for a long time, which may make it harder to move forward.

Someone mentioned the idea of “privacy parity”, i.e. you can ask for my data, but I can see that you’re asking for it, sort of like being able to find out when someone has requested your credit report. This is interesting, but there are substantial asymmetries in the value of that information to each party. A bit of parity that would be very interesting would be a feed of who’s seen my site URLs and excerpts in a search results page — not the clickthrough, which I can already see, but when it’s turned up on the page at all.

A few of us continued a sidebar discussion on search, social networks, trust, and attention networks, and eventually got kicked out into the lobby where we were free to speculate on Google’s plan for world domination next to a huge globe in the SRI lobby. I haven’t bumped into anyone yet doing work on integrating the attention, social, and trust data into search. Doing this on a Google/Yahoo/Microsoft scale looks hard, because of the sheer scale, but I’m getting the sense that doing a custom search engine biased by the social / attention data inputs for a limited subject domain (100-1000′sGB) and a relatively small social / atttention network (1000′s – people you know or have heard of) is becoming more reasonable because of cheaper / faster / better IT hardware and because more of the data is actually becoming available now. Still chewing on this. I just came across Danah Boyd’s post on attention networks vs social networks yesterday, which concisely explains the directed vs undirected graph property which underlies part of the ranking algorithms that would be needed.
Perhaps someone’s already done this for a research project.

If Google Desktop were open source, it might be a logical place to insert a modified ranking algorithm based on attention, tags and social networks and also to insert an SSE-style interface to allow peer-to-peer federation of local search queries and results. This would keep the search index data local to “me” and “my documents”, but allow sharing with other clients that I trust. Perhaps it’s just an age thing. The college kids didn’t seem to mind having all of their documents on public servers, are counting on robots.txt to keep them out the global search engines, and apparently think that access controls on sites like Facebook will keep their personal postings out the of the public realm. For me, I still think twice sometimes about posting to my bookmarks list and keep anything really critical on physical media in a safe deposit box in a vault. So while I’ve gone from being Ungoogleable to Google search stardom, there’s a good portion of my digital life which is “dark matter” to the search engines. I’d like to find a way to fix it for myself, and share information with people I trust, and refine my searches over the public internet, but without having to give Google or anyone else all of my personal data.

Youth panel discussion Wrap up session

Took a few photos, photos from others will probably turn up tagged with “brainjams

Update 12-04-2005 21:15 PST: Audio from the Youth Panel discussion on Chris’s blog
KRON-4 television piece on BrainJams. Looks like I missed the hula hoop part in the morning. I also seem to have mostly missed the non-profit community-oriented discussion, as you can see from my notes. Perhaps that’s what was going on when we got kicked out into the lobby for being too loud…

A reading list on PageRank and search algorithms

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.

  • Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
  • Online Reputation Systems: The Cost of Attack of PageRank (PDF)
    Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
  • SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF)
    András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
  • Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).

I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.

Building better personalized search, filtering spam blogs

Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

  1. Individual users interests / search intent changes over time
  2. The click and viewing data available to do the personalization is limited
  3. Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
  4. Computers are often shared among multiple users with varying intent
  5. Queries are too short to accurately infer intent

Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.

Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (“this is my event information”) and the search user’s intent (“please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.

I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).

The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.

I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.

Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.

The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.

I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.

Map My Run

Map My Run is a new Google Maps-based application for plotting and measuring your runs. I just tried plotting one of my usual loops around the Stanford campus and it’s pretty close to what I get with my GPS running watch.

You can plot routes by clicking points on the map, or upload a GPS tracklog (didn’t try this, though). These sorts of applications are great for estimating your mileage when you don’t actually have a GPS or some way to measure the course. Unfortunately, Google’s map coverage is still somewhat limited outside the US, so it works great for plotting runs around London’s Hyde Park but not so good for loops around the Vidhana Soudha or Cubbon Park in Bangalore, although if you know your way around you can use the satellite view to make a rough guesstimate.

As an aside, it’s remarkably hard to find a good online map of Bangalore, given the huge number of technology-related business travellers that visit there. Maps of India has a reasonable city overview, but if you want street-level detail, try this one from Superseva (only seems to work on Internet Explorer). It’s an interactive scanned image of a paper map(!).

See also: Gmaps Pedometer, Favorite Run, Walk Jog Run, Motion Based

via Google Maps Mania

Follow the Money – Microsoft Windows Live, Google, and Web 2.0

Some thoughts following the Microsoft splash this week:

The big PR launch for Windows Live last Tuesday announced a set of web services initiatives. It probably drives a lot of Microsoft people crazy to have the technology and business resources that they do, and to have so little mindshare in the “web 2.0″ conversations that are going on. I haven’t read through or digested all the traffic in my feed reader, but it looks like a lot of people are unimpressed by the Microsoft pitch. Been there, done that. Which is true, as far as I can see. The more interesting question is whether this starts to change the flow of money and opportunities around developing for and with Microsoft products and technologies.

If I do a quick round of free association, I get something like this:


  • corporate desktop
  • security update
  • vista delayed
  • who’s departed this week

Microsoft is a huge, wildly profitable company. It initially got there by being “good enough” to make a new class of applications and solution developers successful in addressing and building new markets using personal computers, doing things that previously required a minicomputer and an IT staff. Startup companies and individual developers that worked with Microsoft products made a lot of money, doing things that they couldn’t do before. All you needed was a PC and some relatively inexpensive development tools, and you could be off selling applications and utilties, or full business solutions built on packages like dBase or FoxPro.

Microsoft made a lot of money, but the software and solutions developers and other business partners and resellers also made a lot of money, and the customers got a new or cheaper capability than what they had before. Along the way, a huge and previously non-existent consumer market for IT equipment and services also emerged. Meanwhile, the market for expensive, low end minicomputers and applications disappeared (Wang, Data General, DEC Rainbow, HP 98xx) or moved on to engineering workstations (Sun, SGI, HP, DEC/MIPS) where they could still make money.

The current crop of lightweight web services and “web 2.0″ sites feels a little like the early days of PC software. In addition to recognizable software companies, individual developers would build yet-another-text editor or game and upload it to USENET or a BBS somewhere, finding an audience of tens or hundreds of people, occasionally breaking out into mass awareness. Bits and pieces are still around, like ZIP compression, but most of it has disappeared or been absorbed and consolidated into other software somewhere. I have a CD snapshot of the old SIMTEL archive from years ago that’s full of freeware and shareware applications that all had a modest following somewhere or another. Very few people made any money from that way. In the days before the internet, distribution of software was expensive, and payment meant writing and mailing a check, directly from the end user to the developer.

Google has become a huge, wildly profitable company so far by building a better search engine to draw in a large base of users, and using their platform to do a better job of matching relevant advertising to the content it’s indexing. Now, a small application can quickly find an audience by generating buzz on the blogging circuit, or through search engines, and receive two important kinds of feedback

  • Usage data – what are the users doing and how is the application behaving
  • Economic data (money) – which advertising sponsors and affiliates provide the best return

Google’s Adsense and other affiliate sales programs are effectively providing a form of micropayments that are providing incentives and funding for new content and applications, with no investment in direct sales or payment processing by the developers, and no committment from the individual end user.

It’s simply a lot easier for a small consumer targeted startup to come up with a near term path to profitability based on maximizing the number of possible clients (=cross platform, browser based), being able to scale out easily by adding more boxes (not hassling with tracking and paying for additional licenses), and with a short path to revenue (i.e. Adsense, affiliate sales). A developer who might have coded a shareware app in the 80′s can now build a comparable web site or service and find an audience, and actually make a little (or a lot of) money. Google makes a lot of money from paid search ($675MM from Adsense partner sites in 3Q05), but now some of that money is flowing to teams building interesting web applications and content.

In contrast, in the corporate environment (where it’s effectively all Microsoft desktops now), things are different. Most organizations won’t let individuals or departments randomly throw new applications onto the network and see what happens. This is a space that usually requires deep domain expertise, and/or C-level friends, in order to get close enough to the problems to do something about it. But the desktops all have browsers, and the IT managers don’t want to pay for any more Windows or Oracle licenses than they are forced to, so there’s some economic pressure to move away from Windows. But there’s also huge infrastructure pain, if your company is built on Exchange. There’s less impetus here for new features, the issue is to keep it secure, keep it running, and make it cost less. Network management, security, and application management are all doing OK in the enterprise, along with line-of-business systems, but these are really solutions and consulting businesses in the end. The fastest way to get “web 2.0″ into these environments is for Microsoft to build these capabilities into their products, preferably in as boring but useful a way as possible. Not a friendly place for trying out a whizzy new idea, and generally a hard place for a lightweight software project to crack.

On another front, Microsoft also has most of the consumer desktop market, but by default rather than by corporate policy. Mass market consumers are likely to use whatever came with their computer, which is usually Windows. They’re also much more likely to actually click on the advertisements. Jeremy Zawodny posted some data from his site showing that most of his search traffic comes from Google, but the highest conversion rates come from MSN and AOL. MSN users also turn out to be the most valuable on an individual basis, in terms of the effective CPM of those referrals on his site.

So let’s see:

  • Many new application developers are following the shortest path to money, presently leading away from Microsoft and toward open source platforms, with revenue generation by integrating Google and other advertising and affiliate services
  • Microsoft has access to corporate desktops, as well as mainstream consumer desktops, where it’s been increasingly difficult for independent software developers to make any money selling applications
  • Microsoft is launching a lot of new me-too services in terms of technical capability, but which will have some uptake by default in the corporate and mass market
  • Microsoft’s corporate users and MSN users are likely to be later adopters, but may be more likely to be paying customers for the services offered by advertisers.
  • Microsoft could attract more new web service development if there were some technical or economic incentives to do so; at present it costs more to build a new service on Microsoft products, and there’s little alignment of financial incentives between Microsoft, prospective web application developers, and their common customers and partners.

Mike Arrington at TechCrunch has a great set of play-by-play notes from the presentation and a followup summary. He thinks the desktop gadgets and VOIP integration are exciting.

what really got me today was the Gadget extensibility and the full VOIP IM integration.

In the past, Microsoft grew and made a lot of money by helping a lot of other people make money. Today, the developers are following the money and heading elsewhere, mostly to Google. This could quickly change if Microsoft comes up with a way to steer some of their valuable customers and associated indirect revenue toward new web application developers. They are the incumbent, with huge market share and distribution reach. I don’t think they’ll ever have the “cool” factor of today’s web2.0 startups, and I don’t think they’ll regain the levels of market share they have had in the past with Windows, Office, and Internet Explorer. But they could be getting back in the game, and if they come up with a plan to make some real money for 3rd party web developers we’ll know they’re serious.

October 2005 Search Referrals

Jeremy Zawodny posted a summary of his October search referral statistics, and I thought I’d take a quick look at mine.

october 2005 search referrals

Nearly all of the search referrals here come through Google. I also have a relatively large number of “Other”, some of which (I think) are various Chinese search engines.

Jeremy says:

The gap between Google and Yahoo! is hard to interpret, since it doesn’t come close to matching the publicly available market share numbers. The same is true of the numbers for MSN and AOL. They should be higher.

There are two ways I can think to explain this:

1. People who use Google are more likely to be searching for content that’s on my site.
2. The market share numbers are wrong. Google actually generates more traffic than has been reported and MSN and AOL have been over-estimated.

I suspect that #1 is closer to reality. After all, I most often write about topics that are of interest to an audience that’s more technical than average. And I suspect that crowd skews toward Google in a more dramatic fashion than the general population of Internet users. If that’s true, it would seem to confirm many of the stereotypes about AOL and MSN users.

It looks like my site has even less appeal for a consumer audience than his…


Google Park Kids
Brad Feld points out this awesome comic series that went by on Channel9 recently featuring Larry, Sergey, and Scoble (among others) as the South Park kids.

Update 11-06-2005 19:39 PST A new installment! GooglePark: Disruption
Update 12-19-2005 14:35 PST The Battle For AOL

Update 02-13-2006 18:33 PST The Spaghetti Code

Mobile Search = US$1 billion 411 calls per year


Today, mobile search in the US = $1 billion per year in 411 calls.

Well, that’s a gross oversimplification, but it gets to one of the main points from this evening’s sold-out, standing-room-only joint Search SIG and Mobile Monday session on Mobile Search, held at Google this evening.

The panel discussion was moderated by David Weiden from Morgan Stanley, with panelists

  • Elad Gil (Google)
  • Mihir Shah (Yahoo)
  • Mark Grandcolas (Caboodle)
  • Ted Burns (4info)
  • Jack Denenberg (Cingular)

Jack Denenberg from Cingular was the lone representative from the carrier world. During the panel, he made the observation that 411 “voice search” was at least 2-3x the volume of SMS and WAP-based search, and that Cingular (US) is doing around 1 million 411 calls per day at an average billing cost of between $1.25 to $1.40. All US carriers combined do around 3 million 411 calls per day.

This works out to more than $1 billion per year in 411 fees!

Other comments from Jack: Wireless 411 use is still rising. Wireline 411 use is starting to decline. Today mobile search is based on user fees (airtime) and search fees (411). In the future, we may see some movement toward advertiser listing fees. The carrier provides a channel for business to communicate with prospective customers.

Pithy comment from the audience: “I see 4 guys trying to make the best of a bad situation, and 1 guy creating a bad situation.” More comments on why no location based services, why SMS is still limited to 160 characters, and partner-unfriendly pricing. Why $1.40 for an address when Google is free (except for $.20 for data fees).

Mihir from Yahoo mentioned that they have been running trials of paid mobile search listings using Overture back end on Vodafone in the UK, and it’s going well, so the mobile paid listings are starting to happen already.

Lot of comments about user interfaces being too complicated. Mark from Caboodle says that for each click into the menu system, 50% of the users will give up trying to buy something, such as a ringtone. They have a system for simplifying this, but unfortunately they weren’t able to get his demo onto the big screen so we never got to see it live.

Some discussion on the carriers generally having a preference to simplify the user experience by giving them a single, (carrier-branded) aggregated search, taking advantage of the proprietary clickstream and data traffic information available to them through the data billing system.

The Yahoo mobile applications seemed the most plausibly useful. The send-to-phone feature allows you to send driving directions and other info from Yahoo Local to your phone. The mobile shopping application could be used for price comparison while shopping in person (although this doesn’t do much for the online merchants today). Some of their SMS search result messages allow you to reply to get an update, so you can send an SMS with “2″ in it to get an update of a previous weather forecast.

For his demo, Jack ran an impromptu contest among 3 audience volunteers, to see who could find the address of the New York Metropolitan Museum of Art the fastest, using voice (411), 2-way SMS search, or WAP search. All of them got the answer, although 411 was the quickest by perhaps a minute or a bit less.


The 4info demo looked interesting. They use a short code (44636) SMS with the query text in the body, geared toward sports, weather, addresses. They also provide recipes and pointers to local bars if you key in the name of a drink. Someone in the audience pointed out that searching 4info for “Linux” returned a drink recipe, which Ted reproduced on the big screen. Not sure what the drink promo or the Linux recipe was about.

During open mike time, someone (mumble) from IBM did a 15 second demo of their speech-activated mobile search, in which he looked up the address of the New York Metropolitan Museum of Art by speaking at the phone, and the results were returned in a text page. Very slick.

Lots of interesting data-oriented mobile search projects are building for the future. But $1 billion in 411 calls right now is pretty interesting too. Who makes those 411 calls? Are they happy paying that much?

I avoid calling 411 because I feel ripped off afterwards, and often get a wrong number or address anyway. But it’s not always possible to key in a search.

Photos at Flickr

Quick Take on Google Reader

My quick notes on trying out Google Reader:


  • The AJAX user interface is whizzy and fun, and is similar to an e-mail reader.
  • Importing feeds is really slow.
  • Keyboard navigation shortcuts are great.
  • Searching through your own feeds or for new feeds is convenient using Google
  • I hate having a single item displayed at a time.
  • “Blog This” action is handy, if you use Blogger. They could easily make this go to other blogging services later.
  • This could be a good “starter” service for introducing someone to feed readers, but
  • No apparent subscription export mechanism
  • Doesn’t deal well with organizing a large number of feeds.

More notes:
I started importing the OPML subscription file from Bloglines into Google Reader on Friday evening. I have around 500 subscriptions in that list, and I’m not sure how long it ended up taking to import. It was more than 15 minutes, which was when I headed off to bed, and completed sometime before this afternoon.

I love having keyboard navigation shortcuts. The AJAX-based user interface is zippy and “fun”. Unfortunately, Google Reader displays articles one at a time, a little like reading e-mail. I’m in the habit of scanning sections of the subscription lists to see which sections I want to look at, then scanning and scrolling through lists of articles in Bloglines. Even though this requires mousing and clicking, it’s a lot faster than flashing one article at a time in Google Reader.

I don’t think the current feed organization system works on Google Reader, at least for me. My current (bad) feed groupings from Bloglines show up on Google Reader as “Labels” for groups of feeds, which is nice. It’s hard to just read a set of feeds, though. Postings show up in chronological order, or by relevance. This is totally unusable for a large set of feeds, especially when several of them are high-traffic, low-priority (e.g. Metafilter,, USGS earthquakes). If I could get the “relevance” tuned by context (based on label or tag?) it might be useful.

When you add a new feed, it starts out empty, and appears to add articles only as they are posted. It would be nice to have them start out with whatever Google has cached already. I’m sure I’m not the first subscriber to most of the feeds on my list.

On the positive side, this seems like a good starting point for someone who’s new to feed readers and wants a web-based solution. It looks nice, people have heard of Google, and the default behaviors probably play better with a modest number of feeds. Up to this point, I’ve been steering people at Bloglines in the past, and more recently pointing them at Rojo.

I wish the Bloglines user interface could be revised to make it quicker to get around. I really like keyboard navigation. I can also see some potential in the Google Reader’s listing by “relevance” rather than date listing, and improved search and blogging integration. I’m frequently popping up another window to run searches while reading in Bloglines.

Google Reader doesn’t seem like it’s quite what I’m looking for just now, but I’ll keep an eye on it.

Wishful thinking:
I think I want something to manage even more feeds than I have now, but where I’m reading a few regularly, a few articles from a pool of feeds based on “relevance”, and articles from the “neighborhood” of my feeds when they hit some “relevance” criteria. I’d also like to search my pool of identified / tagged feeds, along with some “neighborhood” of feeds and other links. I think a lot of this is about establishing context, intent, and some sort of “authoritativeness”, to augment the usual search keyword matching.

Trying out Google Reader

Just read about Google Reader over at Jeff Clavier’s blog. I’m loading up my 500+ Bloglines feeds to see how it plays. It’s taking it a little while (more than 5 minutes so far).

I’ve been operating at about half speed the past couple of days, trying to avoid coming down with a cold. So it’s a good day for trying out alternate feed readers. I took a stab at some regular expressions for knocking back referrer spam this evening but I may be too fuzzyheaded for that right now.

Announcement on the Google Blog.

Ungoogleable to #1 in six months

Despite being online for a very long time by today’s standards (~1980), I have been difficult to find in search engines until fairly recently.

This basically has 4 reasons:

  1. The components of my name, “Ho”, “John”, and “Lee” are all short and common in several different contexts, so there are a vast number of indexed documents with those components.
  2. Papers I’ve published are listed under “Lee, H.J.” or something similar, lumping them together with the thousands of other Korean “Lee, H.J.”s. Something like 14% of all Koreans have the “Lee” surname, and “Ho” and “Lee” are both common surnames in Chinese as well. Various misspellings, manglings and transcriptions mean that old papers don’t turn up in searches even when they do eventually make it online.
  3. Much of the work that I’ve done resides behind various corporate firewalls, and is unlikely to be indexed, ever. A fair amount of it is on actual paper, and not digitized at all.
  4. I’ve generally been conscious that everything going into the public space gets recorded or logged somewhere, so even back in the Usenet days I have tended to stay on private networks and e-mail lists rather than posting everything to “world”.

Searching for “Ho John Lee” (no quotes) at the beginning of 2005 would have gotten you a page full of John Lee Hooker and Wen Ho Lee articles. Click here for an approximation. With quotes, you would have seen a few citations here and there from print media working its way online, along with miscellaneous RFCs.

Among various informal objectives for starting a public web site, one was to make myself findable again, especially for people I know but haven’t stayed in contact with. After roughly six months, I’m now the top search result for my name, on all search engines.

As Steve Martin says in The Jerk (upon seeing his name in the phone book for the first time), “That really makes me somebody! Things are going to start happening to me now…”

Wired this month on people who are Ungoogleable:

As the internet makes greater inroads into everyday life, more people are finding they’re leaving an accidental trail of digital bread crumbs on the web — where Google’s merciless crawlers vacuum them up and regurgitate them for anyone who cares to type in a name. Our growing Googleability has already changed the face of dating and hiring, and has become a real concern to spousal-abuse victims and others with life-and-death privacy needs.

But despite Google’s inarguable power to dredge up information, some people have succeeded — either by luck, conscious effort or both — in avoiding the search engine’s all-seeing eye.

Dredging for Search Relevancy

I am apparently a well trained, atypical search user.

Users studied in a recently published paper users clicked on the top search result almost half the time. Not new, but in this study they also swapped the result order for some users, and people still mostly clicked on the top search result

I routinely scan the full page of search results, especially when I’m not sure where I’m going to find the information I’m looking for. I often randomly click on the deeper results pages as well, especially when looking for material from less-visible sites. This works for me because I’m able to scan the text on the page quickly, and the additional search pages also return quickly. This seems to work especially well on blog search, where many sites are essentially unranked for relevancy.

This approach doesn’t work well if you’re not used to scanning over pages of text, and also doesn’t work if the search page response time is slow.

On the other hand, I took a quick try at some of the examples in the research paper, and my queries (on Google) generally have the answer in the top 1-2 results already.

From Jakob Nielsen’s Alertbox, September 2005:

Professor Thorsten Joachim and colleagues at Cornell University conducted a study of search engines. Among other things, their study examined the links users followed on the SERP (search engine results page). They found that 42% of users clicked the top search hit, and 8% of users clicked the second hit. So far, no news. Many previous studies, including my own, have shown that the top few entries in search listings get the preponderance of clicks and that the number one hit gets vastly more clicks than anything else.

What is interesting is the researchers’ second test, wherein they secretly fed the search results through a script before displaying them to users. This script swapped the order of the top two search hits. In other words, what was originally the number two entry in the search engine’s prioritization ended up on top, and the top entry was relegated to second place.

In this swapped condition, users still clicked on the top entry 34% of the time and on the second hit 12% of the time.

For reference, here are the questions that were asked in the original study (182KB, PDF)


  • Find the homepage of Michael Jordan, the statistician.
  • Find the page displaying the route map for Greyhound buses.
  • Find the homepage of the 1000 Acres Dude Ranch.
  • Find the homepage for graduate housing at Carnegie Mellon University.
  • Find the homepage of Emeril – the chef who has a television cooking program.


  • Where is the tallest mountain in New York located?
  • With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are democratic presidential primaries in New York?
  • Which actor starred as the main character in the original Time Machine movie?
  • A friend told you that Mr. Cornell used to live close to campus – near University and Steward Ave. Does anybody live in his house now? If so, who?
  • What is the name of the researcher who discovered the first modern antibiotic?

Google Secure Access

via Om Malik:

Google seems to have developed a secure WiFi VPN software tool – Google Secure Access Client. The information can be found here. Google Rumors has all the details. To sum it up, what they are doing is giving away a VPN tool that takes some of the security risks out of open WiFi. Companies like JiWire and Boingo also have these type of secure WiFi software solutions. While on paper this sounds like a perfectly good deal, Inside Google says not so fast, and writes, “Google Secure Access has the same benefits for Google as Web Accelerator did, with fewer of the things that scared away people the first time.” They dig deep into the GSA privacy policy …

Another take at Inside Google:

Located at, GSA connects you to a Google-run Virtual Private Network. Your internet traffic becomes encrypted when you send it out, decrypted by Google, the requested data downloaded by Google, encrypted and sent to you, and decrypted on your machine. This has the effect of protecting your traffic data from others who may want to access it. GSA’s FAQ describes it as a Google engineer’s 20% project

Google Secure Access FAQ

Google Blog Search – Referrers Working Now

Looks like Google Blog Search took out the redirects that were breaking the referrer headers.

Now the search keywords are visible again. Here’s a typical log entry: – - [15/Sep/2005:15:58:13 -0700]
“GET /weblog/archives/2005/09/15/podcasting-and-audio-search-at-sdforum-searchsig-september-2005/
HTTP/1.1″ 200 26981 “”
“Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.10) Gecko/20050716

Blogger Buzz says the redirect was in place during development to help keep the project under wraps.

Google Blog Search – No Referrer Keywords?

Feature request to Google Blog Search team: please add search query info to the referrer string.

Lots of coverage this morning from people trying out Google Blog Search. (Search Engine Watch, Anil Dash, lots more)

I’m seeing some traffic from Google Blog Search overnight, but it looks like they don’t send the search query in the referrer. Here’s a sample log entry: – - [14/Sep/2005:00:51:09 -0700] “GET /weblog/archives/2005/09/14/google-blog-search-launches/ HTTP/1.1″ 200 22964 “” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4″

So there’s no way to know the original search query. I have a pretty good idea how the overnight traffic looking for the Google post got here, but there are also people landing on fairly obscure pages here and I’m always curious how they found them. I’m sure the SEO crowd will be all over this shortly.

There have been a number of comments that Google Blog Search is sort of boring, but I’m finding that there’s good novelty value in having really fast search result pages. Haven’t used it enough to get a sense of how good the coverage is, or how fast it updates, but it will be a welcome alternative to Technorati and the others.

Update 09-14-2005 14:01 PDT: These guys think Google forgot to remove some redirect headers.

Update 09-14-2005 23:25 PDT: Over at Blogger Buzz, Google says they left the redirect in by accident, will be taking them out shortly:

“After clicking on a result in Blog Search, I’m being passed through a redirect. Why?”
Sadly, this wasn’t part of an overly clever click-harvesting scheme. We had the redirects in place during testing to prevent referrer-leaking and simply didn’t remove them prior to launch. But they should be gone in the next 24 hours … which will have the advantage of improving click-through time.

Google Blog Search Launches

Google’s entry into blog search launched this evening, go try it out or read their help page.

This will be interesting competition for the existing blog search companies. It definitely responds fast at the moment, let’s see how it holds up when the next flash news crowd turns up…

via Niall Kennedy and Kevin Burton

Google Purge – Destroying all Unindexed Information

Google Announces Plan To Destroy All Information It Can’t Index. (via Batelle’s Searchblog)

MOUNTAIN VIEW, CA—Executives at Google, the rapidly growing online-search company that promises to “organize the world’s information,” announced Monday the latest step in their expansion effort: a far-reaching plan to destroy all the information it is unable to index.

I haven’t looked at the Onion in a long time. Good fun…

Page 3 of 41234