Slides from the Social Graph Symposium panel

Some introductory slides from a panel session at the Social Graph Symposium.

Social Graph Symposium Panel – May 2010 – Presentation Transcript

1. Social Graph Symposium Panel
Ho John Lee | Principal Program Manager | Bing Social Search
2. About me:
Ho John Lee
hojohn . lee @ microsoft . com
Past: Bing Twitter (v1), SocialQuant, trading, investing/consulting (China, India)
HP Labs, MIT, Stanford, Harvard
Current: Bing Social Search – graph and time series analysis, data mining
Twitter, Facebook, new products, technical planning
3. What can we do by observing social networks?
On the internet, no one knows you’re a dog.
But in social networks, we can tell if you act like a dog, what groups you belong to, and some of your interests
4. How many Twitter users are there?
from a search on twopular, May 2009
5. Graph analysis for relevance and ranking
Spam marketing campaign
(teeth whitening)
Naturally connected community (#smx)
Real time relevance needs data mining to filter and rank based on history
Spammy communities can be highly visible
Social graph, topic/concept graph, and behavior/gesture graphs are all useful tools
6. Information diffusion in the graph
Observed incidence network of retweets in Twitter
Kwak, Lee, et al, What is Twitter, a Social Network or a News Media? WWW2010
Information flow and behaviors form an implicit interaction graph
7. Topic / sentiment range, volume, trend analysis
What is the baseline rate of mentions / sentiment per unit time?
Look for changes in attention flow around a subject, location, topic
Watch for correlated signals from multiple sources
Consider source relevance and authority as well
8. Applying graph analysis
Attention flow vs information flow
Leads to utility functions, cost functions
Variable diffusion rates by actor / network / info type
Predicting interests and affiliations
Content creation follows attention
Self-organized communities of attention
If there’s no content, you can ask for some
Observable propagation of information
9. Clustering and fuzzing properties and identities
* Frequently used terms can identify interests, affinities, latent query intent
* But can potentially be used to identify likely individual users!
* Infochaff – fuzzing out identity, behavior, properties
10. Thank You
Ho John Lee
hojohn . lee @ microsoft . com

RESEARCH: Insights from the latest social graph studies
Moderator: Eric Siegel – President at Prediction Impact and Conference Chair at Predictive Analytics World
Sharad Goel – Research Scientist at Yahoo
Ho John Lee – Principal Program Manager at Microsoft
DJ Patil – Chief Scientist at LinkedIn
Marc Smith – Chief Social Scientist at Connected Action Consulting Group

My slides from the Real Time Search Panel at SES Chicago last week

Although real time search is fairly new, as we end 2009, the ability to index and search fresh results is rapidly becoming a commodity, with Bing, various startups, and now Google all integrating status feeds from social networking services. The next set of challenges in 2010 will be around providing better relevance, information discovery, and topic exploration for social search, using signals from the dynamic behavior of users and their interaction with the social and topic graphs.

I gave a short talk on real time and social search for a panel at SES Chicago last week. I’ve been heads down for the past few months working on Bing Twitter Search, so now that the first launch is out the door it was a nice chance to talk with people about some of the work we’re doing. There was a lot of interest in the sentiment, trend, and social graph analysis slides (9 and 10). I will write about those in a separate post, but wanted to get the presentation up for those who have been asking about it.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09

View more presentations from Ho John Lee.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09 – Presentation Transcript

  1. What’s different about real time and social search?
    Ho John Lee
    Principal Program Manager
    Bing Social Search
    Search Engine Strategies
    Chicago – December 7, 2009
  2. What’s Real Time Search Good For, Anyway?
  3. Twitter is Great for Watching Uninformed Panics Unfold Live
    …or finding balloons
  4. Some characteristics of Twitter / Social media
    Immediacy, Sentiment, Brevity
    Not always accurate
    Feelings, reactions, impressions
    Context is often essential to determine meaning
    Gestural – @user, #hashtag, RT, favorites, follows
    Self-organizing communities of attention and authority
    Content follows attention
    People talk about what others are talking about
    Observations and commentary from everywhere
    If there’s no content, you can ask for some
    Extreme head and tail coverage
    Low relevance “noise” can become “signal” in aggregate
  5. Your product or brand could suddenly be at the center of a huge conversation
    Tiger Woods
    Balloon Boy
    Breaking Story
    Persistent Story
    Big Story
    Bigger Story
  6. Some characteristics of Real time / Social Search
    • Real time and social search is qualitatively different from traditional web search
    • Differences in ranking, relevance, use model
    • Social graph, user behavior, location, event correlation and other input signals
    • Real time search is frequently about discovery, not search per se
    • “what is everyone talking about”, followed by “what are people saying about ”
    • Top real time and social search results will usually differ from top web search results
  7. Bing Twitter Search at a glance
    Top Tweets
    Top Shared Links
    Tweets/Sentiment per link
    Adult /Spam filter; Tweets/Links ranking & relevance
  8. Bing Fall 2009: Twitter vertical, News, MSN, Maps
    MSN Local Edition
    Page 2: Tweets or Links
    Page 1: Tweets & Links
    Twitter Answer on News SERP
    MSN Hot Topics
  9. Topic / sentiment range, volume, trend analysis
    What is the baseline rate of mentions / sentiment per unit time?
    Changes in attention flow around a subject, location, topic
    Watch for correlated signals from multiple sources
    Consider source relevance and authority as well
  10. Graph analysis for relevance and ranking
    Spam marketing campaign
    Naturally connected community
    Spammy communities are highly visible – don’t be part of one!
  11. Bing Twitter Maps Demo
  12. To rise above the noise, there is more to do as search gets more social
  13. Thank You
    Ho John Lee
    hojohn . lee @
The session was moderated by Barbara Coll, CEO, Inc., with panelists Bill Fischer, Co-Founder & Director, Workdigital, Ltd., Rob Walk, Managing Partner, NovaRising, Nathan Stoll, Co-Founder, Aardvark, and  Ho John Lee, Principal Program Manager, Social and Real Time Search, Microsoft Bing.

When you come to a fork in the road…

Crossroads of the World at the Beach Bar, Waikiki

Crossroads of the World at the Beach Bar, Waikiki

As some of you know, I have been exploring a variety of paths forward for SocialQuant, my real time social search and analytics project. My family, friends, and colleagues have given me much support, patience, and advice during this process, which has reached a crossroads, and as Yogi Berra says, “When you come to a fork in the road, take it!”

The rise of Twitter, Facebook, and other social media, combined with web-based applications, smartphones, and cloud computing have all set the stage for new applications and use models based on social discovery, collaboration, and communications, in addition to traditional search. What we’re all calling “real time search” lately isn’t exactly real time, nor is it exactly search, in which you find a definitive/authoritative answer. Much of the opportunity revolves around discovering people, discussions, and events that are relevant to you and bringing it to your attention in a timely, actionable fashion. Information streams from social media are transient, unreliable, and noisy. At the same time, the sheer volume of data can help provide the basis for building better filters. As an added bonus, you can ask questions to people in the social graph itself, and there are numerous examples of communities of interest forming around current events such as Barack Obama’s inauguration, the Iran elections, or even Michael Jackson’s funeral, all of which help surface information content, opinion, and sentiment that were previously inaccessible online. One interesting aspect of real time social media is that it’s not just algorithmic, it’s based on human connections and emotions. So a message  that “feels right” from people you trust can be more relevant than one that is “correct” at times.

The challenge then is in filtering and ranking the massive flow of information in a way that helps direct the user’s limited (and non-expanding) time and attention in a way that’s most valuable to them. With today’s information technology, amazing things are possible with limited resources. I personally have more computing and storage resources than the facility we launched HP’s original photo site with (for millions of dollars), at a fraction of the cost, routinely pushing around datasets of millions of rows on the local development servers. Unfortunately, that’s just the ante to get started on the problem. Running ranking, clustering, and semantic analysis for filtering the ever-growing stream of social media eventually requires web scale computing, even with careful problem selection and data pruning. The bar is also going up every day as the social media user base grows, and as well funded teams make progress on their platforms (+Google).  So very shortly, to be competitive in real time, social search and discovery is going to require access to lots of data and either getting a datacenter or working with someone who has one.

In my case, I have recently chosen the latter path, and will be joining the Microsoft Bing search team, focusing on real time and social search. Microsoft itself has been showing signs of a renaissance, with search relaunching, Windows 7 looking leaner, Azure becoming non-vaporous, more web APIs getting published, core online applications starting to turn up, and a cool Office 2010 video. Even Mini-Microsoft is getting positive recently. And Google is starting to have “bigness” issues.

I look forward to working with Sean Suchter and the Microsoft Bing search team (and likely expanding their carbon footprint) in pursuit of new applications and services as the social media and online application space evolves.

You can follow along on Twitter (@hjl). As always, any and all opinions here are solely mine and do not reflect the position of any past, present, or future employer, partner, or business associate.

Twitter’s amazing user growth

Twitter estimated userbase through May 2009

Twitter estimated userbase through May 2009

The graph above shows an estimate of Twitter’s user population from its launch in March 2006 through May 2009, based on a sample of around 6 million observed user profiles. The dashed blue line is around the 2009 US inauguration of Barack Obama and where the transition from early adopter to early mass audience seems to have taken off.

The entire user population of Twitter appears to have reached 1 million sometime in January but today there are several accounts that have over 1M followers each.

Stated another way, if you signed up before February 2009, you can consider yourself something of an early adopter on Twitter, and among the earliest 15% or so of the entire user population.

The numbers in this survey are inexact but representative, taken from research I’ve been doing for SocialQuant and FailWatch.  There is some survivor bias built in, since I’m pruning spam and suspended accounts. Only Twitter knows the true state of the user base and the social graph, of course.

The initial Twitter users tend to know each other more in real  life, since much of the social network grew from friends of founders, SWSX attendees, and the San Francisco / Silicon Valley tech community. The more recent (post-Obama)  arrivals tend not to have connections to those networks, and often don’t know anyone else to follow. They arrive via mass media and celebrity campaigns, and end up following mass media and celebrities, either from the suggested users list or because those are the only people they know of.

If you look carefully, you can see the rate of increase slows down toward the end of the graph. There was a huge ramp in  new user signups around the time of the Oprah show, which has receded somewhat. This has led to blog posts about Twitter’s impending demise, but looking back, there have been previous surges in the user base (typically around SXSW etc) which led to a peak, then a drop in new user signups to an off-peak but higher-than-before average. So far the current surge is the largest, but seems to be following the pattern. In the absence of any  new driver, user growth should continue at an off-peak but higher level, until the next big jump, or something better comes along.

Google search results and DMOZ editorializing?

I’ve never seen a search result page like this before. The meta text “Conservative think tank claiming to report about events and nations strategically important to the United States” doesn’t appear any where in the referenced page, which doesn’t contain any useful <META> content. Searching for that text, it looks like the text originated from the DMOZ directory listing.

Another entry from the same DMOZ list, the Kensington Review, also returns the DMOZ meta text, this time in place of the <META> text in the actual page. DMOZ says “An e-magazine of political and social commentary. When the left says the glass is half full and the right says it is half empty, Kensington suggests that it might be too big.” Kensington’s own META says “An electronic journal of political, financial and social commentary”.  DMOZ is a more interesting description, but again does not originate from the content itself. 

So it appears that DMOZ editors have greater influence over certain Google search descriptions than the actual sites themselves, which is not necessarily bad, but was certainly unexpected (to me). Overall I’d prefer that Google limit its editorial function to ranking and presenting the search results, and perhaps make the editorial opinions known, but not presented as definitive. 

I’m not particularly familiar with the Jamestown Foundation, which is why I was searching in the first place. The DMOZ editor is clearly skeptical but I’d rather form my own opinion. 


Hacked by

I just noticed that my WordPress installation got hacked by a search engine spam injection attack sometime in the past few weeks. This particular one inserts invisible text with lots of keywords in footer.php. The changes to the file were made using the built-in theme editor, originating from, which is currently at The spam campaign automatically updates the spam payload every day or so. The links point to a variety of servers that have also been hacked to host the spam content. Here is a sample:
I’ve sent an e-mail to Nanosolar, so they’ll probably have that content cleaned up before long. But the automated SEO spam campaign updates the keyword and link payload regularly, so any affected WordPress sites will be updated to point at the new hosting victims.

From a quick check on Google, it looks like is a regular offender

Ms. Dewey – Stylish search, with whips, guns, and dating tips

It’s been a while since I’ve come across something I haven’t seen before online. Ms. Dewey fits the bill. It is a Flash-based application combining video clips of actress Janina Gavankar with Windows Live search.

As a search application, it’s fat, slow, and the query results aren’t great. However, as John Batelle observes, “clearly, search ain’t the point.” This is search with an flirty attitude, where the speed and quality of the results aren’t at the top of the priority list.

As short-attention-span theater goes, it’s quite entertaining.

If you can’t think of anything to search for, Ms. Dewey will fidget for a while and eventually reach out and tap on the screen. “Helloooo…type something here…”

It’s far more interesting to try some queries and check out the responses. I spent over half an hour typing in keywords to see what would come up, starting with some of the suggestions from Digg and Channel9. The application provides a semi-random set of video responses based on the search keywords, so you won’t always get the same reaction each time.

The whip and riding crop don’t always appear when you’d think, the lab coat seems to be keyed to science and math (try “partial differential equation”), and I’m not sure what brings on the automatic weapons.

“Ms. Dewey” also has a MySpace page with more video clips. The way the application is constructed, they can probably keep updating and adding responses as long as they want to.

I briefly tried using Ms. Dewey in place of Google, as a working search engine, but it takes too long to respond to a series of queries (have to wait for the video to play) and the search results aren’t great (Live is continuing to improve, though). At the moment this is a fun conceptual experiment.

I wonder if we’ll see a new category of search emphasizing style (entertainment, attitude, sex) over substance (relevance, speed, scope). Today’s version might already work for the occasional search user, but imagine Ms. Dewey with faster, non-blocking search results, a better search UI, and Google’s results. It all vaguely reminds me of a William Gibson novel.

More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT – The first online interface for exploring the AOL search query data is up at (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT – Here’s another online interface at (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

AOL Research publishes 20 million search queries

More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.

Adam D’Angelo says:

This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.

Plentyoffish sees an opportunity for PPC and Adsense spammers:

Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.

I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.

More on the privacy angle from SiliconBeat, Zoli Erdos

See also: Coming soon to DVD – 1,146,580,664 common five-word sequences

Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.

Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data

Coming soon to DVD – 1,146,580,664 common five-word sequences

Google Research is publishing a huge n-gram dataset distilled from trillions of words perused by Google’s vast search spidering effort:

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

This looks like just the thing for developing some interesting predictive text applications, or just random data mining. The 6-DVD set will be distributed by the Linguistic Data Consortium, which collects and distributes interesting speech and text databases and training sets. Some other items in their collection include transcribed speech from 3000 speakers, a mapping between Chinese and English place, organization, and corporate names, and a transcription of colloquial Levantine Arabic speech.

Update Sunday 08-06-2006 16:41 PDT: See also AOL Research publishes 20 million search queries

Google is having problems this evening?

This evening I’m getting slow response or connection timeouts from Google for the past half hour or so (20:30 – 21:00 PDT). Usually this means that the local network is having problems, but other major sites (Yahoo, CNN) are running as quickly as ever, along with various SSH sessions around the world, so it seems to be specific to Google.

So far I get slow or no response from the main search page, Gmail, Adsense, Adwords, Analytics, and Finance.

Pages that do respond are coming back in 10+ seconds, and some pages are loading without graphics or with templates only and no content.

Anyone else seeing these problems? This is the first time I’ve seen Google unusable for more than a minute or two. (Unlike this site, which has been bouncing up and down due to problems at Dreamhost lately).

Search referrals – July 2006 snapshot

Here’s a quick snapshot of incoming search engine referrals for the past few weeks. Compare this with another post last year on search engine referral share, recently referenced in a post at Alexa noting the discrepancy between the published search engine traffic reports and anecdotal observations by webmasters.

Is it just me, or are these charts a bit goofy? Does Yahoo really still have 23% of the search market? Is Google at less than half the search market?

I don’t believe it. Any webmaster will tell you that Google represents almost ALL of the search engine traffic. Yahoo is nowhere near 23%. Just read the blogs, here, here, here and here and on countless other blogs.

Already at 82% last October, Google has increased to even more of the incoming search traffic (92%) here, largely at the expense of “Other”. In the fall, it looked like those were mostly miscellaneous Chinese search engines, so perhaps my site is not getting indexed or ranked well there anymore, or Google is picking up market share, or both.

Some of the commenters at the Alexa post noted increasing traffic from Microsoft / MSN / Live search, including one who got most of their traffic through MSN search. I’m a little surprised that I don’t see more traffic from Yahoo and Microsoft search here, but that may also be a function of who’s likely to be searching for a given topic.

See also Greg Linden’s comments on the competitiveness of Yahoo and Microsoft search efforts

The Long Tail of Invalid Clicks and other Google click fraud concepts

Some fine weekend reading for search engineers, SEOs, and spam network operators:

A 47-page independent report on Google Adwords / Adsense click fraud, filed yesterday as part of a legal dispute between Lane’s Gifts and Google, provides a great overview of the history and current state of click fraud, invalid clicks of all types, and the four-layered filtering process that Google uses to detect them.

Google has built the following four “lines of defense” against invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.

An interesting observation is that most click fraud can be eliminated through simple filters. Alexander Tuzhilin, author of the report, speculates on a Zipf-law Long Tail of invalid clicks of less common attacks, and observes:

Despite its current reasonable performance, this situation may change significantly in the future if new attacks will shift towards the Long Tail of the Zipf distribution by becoming more sophisticated and diverse. This means that their effects will be more prominent in comparison to the current situation and that the current set of simple filters deployed by Google may not be sufficient in the future. Google engineers recognize that they should remain vigilant against new possible types of attacks and are currently working on the Next Generation filters to address this problem and to stay “ahead of the curve” in the never-ending battle of detecting new types of invalid clicks.

He also highlights the irreducible problem of click fraud in a PPC model:

  • Click fraud and invalid clicks can be defined conceptually, but the only working defintion is an operationally defined one
  • The operational definition of invalid clicks can not be fully disclosed to the general public, because it will lead to massive click fraud.
  • If the operational definition is not disclosed to some degree, advertisers can not verify or dispute why they have been charged for certain clicks

The court settlement asks for an independent evaluation of whether Google’s efforts to combat click fraud are reasonable, which Tuzhulin believes they are. The more interesting question is whether they will continue to be sufficient as time progresses and the Long Tail of click fraud expands.


Google’s PageRank and Beyond – summer reading for search hackers

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

This book could serve as an excellent introduction to search algorithms for someone with a programming or mathematics background, covering PageRank at length, along with some discussion of HITS, SALSA, and antispam approaches. Some current topics, such as clustering, personalization, and reputation (TrustRank/SpamRank) are not covered here, although they are mentioned briefly. The bibliography and web resources provide a comprehensive source list for further research (up through around 2004), which will help point motivated readers in the right direction. I’m sure it will be popular at Google and Yahoo, and perhaps at various SEO agencies as well.

Those with less interest in the innards of search technology may enjoy a more casual summer read about Google, try John Battelle’s The Search. Or get Langville and Meyers’ book, skip the math, and just read the sidebars.

See also: A Reading List on PageRank and Search Algorithms, my links on search algorithms adds private bookmarks is testing out private bookmarks now.

I’ve been playing with a private instance of Scuttle ever since was purchased by Yahoo a few months back, but have continued using for posting public links anyway.

My links are automatically posted here (except when one end or the other is out of service for some reason), don’t know if that would include the private ones or not. Also don’t know exactly where the private bookmarks might be visible, aside from in one’s own account. I’ll have to give it a try.

More tea leaves from Google’s analyst day presentation

It seems that a lot of the interesting content from last week’s analyst event at Google is in the speaker notes from the PowerPoint slide deck. Greg Linden and others have already pointed out the notes about Google’s storage plans (GDrive, Lighthouse on slide 19).

This afternoon there’s another blip on CNBC about accidental communications in the slides.

The previously undisclosed notes stated that Google’s core advertising business was expected to grow by nearly 60 percent to $9.5 billion in 2006 but that profit margins in its mainstay AdSense business could be squeezed this year and beyond.

I didn’t remember seeing a revenue forecast in there, so I went back and looked to see what it actually said (slide 14).

Our ads business for the moment is healthy and growing and we’re on a strong trajectory
projected to grow from $6bn this year to $9.5bn next year based purely on trends in traffic and monetization growth

But strong competitors are attempting to aggregate traffic
AdSense margins will be squeezed in 2006 and beyond
Y! and MSN will do un-economic things to grow share
The ad network will be commoditized over time
So, we need to build a more complete ads system that is characterized by two words: wider and deeper. That is, cast the net wider to attract new customer types) and deeper to enhance our relationship with existing customers.

Reuters says these particular notes were supposedly left in accidentally from internal planning discussions in late 2005.

“These notes were not created for financial planning purposes, and should not be regarded as financial guidance. Consistent with past practice, Google is not providing revenue guidance,” Google said in the filing.

I liked “Y! and MSN will do un-economic things to grow share”.

Don’t think we’ll be getting PowerPoint files from Google investor relations next time around. There’s a PDF file up now.

Update 03-08-2006 21:34 PDT: Paul Kedrosky has posted a copy of the original PPT slides.

Randomly exploring the long tail of search results

I sometimes click on a random “deep” search result page to see if anything interesting turns up, because of the limitations of popularity and PageRank for some queries.

Paul Kedrosky points at a recent paper from CMU which suggests randomly mixing in some low ranking pages may improve search results over time.

Unfortunately, the correlation between popularity and quality
is very weak for newly-created pages that have few
visits and/or in-links. Worse, the process by which new,
high-quality pages accumulate popularity is actually inhibited
by search engines. Since search engines dole out
a limited number of clicks per unit time among a large
number of pages, always listing highly popular pages at
the top, and because users usually focus their attention on
the top few results, newly-created but high-quality
pages are “shut out.”

We propose a simple and elegant solution to
this problem: the introduction of a controlled
amount of randomness into search result ranking
methods. Doing so offers new pages a chance
to prove their worth, although clearly using too
much randomness will degrade result quality and
annul any benefits achieved. Hence there is a
tradeoff between exploration to estimate the quality
of new pages and exploitation of pages already
known to be of high quality. We study this tradeoff
both analytically and via simulation, in the context
of an economic objective function based on
aggregate result quality amortized over time. We
show that a modest amount of randomness leads
to improved search results.

Shuffling a Stacked Deck: The Case for Partially
Randomized Ranking of Search Engine Results

Will Google grow at this rate forever? No? Then DIE!!

Today was a moderately exciting or irritating day to be a investor in public technology companies. Google’s CFO, George Reyes, apparently forgot that he was webcasting to a public group of investors rather than conferencing with an in-house team at the Googleplex during the Q&A session at the Merrill Lynch Internet, Advertising, Information, & Education conference: (Yahoo/AP News)

Q: Looking back to Q3 2005, was there anything in there that was maybe sort of one-time in nature that accounted for such strong revenue growth…?

A: So we went through a period of probably 18 months where we thought we had…well, let me characterize it…we had what was called a RevForce initiative–Revenue Force–which was really a team of really very bright technical engineers that were trying to tweak and optimize the ad system, and not–you know in very very responsible ways [Don't Be Evil!]–and that sort of paid off nicely with the fruits of that labor.

And what’s happened since then is that we got so good and so efficient at that back then that really most of what’s left is just organic growth, which means you have to grow your traffic and your have to grow your monetization.

But so, I think, we’re now, clearly our growth rates are slowing. And you see that each and every quarter. And we’re going to have to find other ways, you know, to monetize the business.

Later in the Q&A there’s something about the “law of large numbers” ultimately limiting growth due to running out of people to look at advertising. These are high class problems to have, and these sound like perfectly intelligent comments for an internal coffeetalk or private discussion. But when your stock is trading at 72x earnings, it’s a bad thing when the CFO says “growth is slowing” to a room of investors looking for extreme growth. The response is going to be “shoot first and figure it out later”, which is what happened this morning.

Reminds me of a scene in Ghostbusters:

Gozer: Are you a God?
Ray: No.
Gozer: Then — DIE!!

Winston: Ray, when someone asks if you’re a God”, you say YES!

How big is the growth rate? Pulling some data from Google’s IR site, this graph shows GOOG’s quarterly gross revenue growth for 2003-2005. The maroon line is Adsense sites, the light blue line is for Google-owned sites, and the dark blue line is the total.

One simplistic lower bound for future growth at Google would be to assume that it tracks the overall growth of internet use. I’ve inserted an additional blue line just above 4%, which is a rough estimate of the overall growth rate of the internet. I haven’t tried to find detailed data, this is from Jakob Nielsen’s Alertbox, which cites an 18% annualized growth rate from 2002 through 2005.

“We are getting to the point where the law of large numbers start to take root,” Reyes said Tuesday. “At the end of the day, growth will slow. Will it be precipitous? I doubt it.”

Google issued a press statement late in the afternoon:

As we have stated before, monetization improvements will continue to be a key factor in driving future revenue growth. We still see significant opportunities to improve monetization and intend to continue to focus our efforts in this area.

Moreover, as we have stated in our SEC filings, our revenue growth rate has generally declined over time and we expect that it will continue to do so as a result of the difficulty of maintaining growth rates on a percentage basis as our revenues increase to higher levels.

Hey, how’s that GBuy project going, anyway…

Webcast of the conference presentation (registration required)

Henry Blodget has a number of interesting posts on Google, including why he doesn’t own it, approaches to valuation, the most recent earnings, and today’s adventures.

The Google analyst day coming up this Thursday should be pretty interesting. Might be worth trying to catch the webcast. Bet George is getting some extra practice in.

Google and magazine covers as a contrary indicator

Is Google headed for a downturn? Not only is it featured in a generally negative cover article in this week’s Barron’s, but now it’s featured on the cover of Time as well. These magazines cater to very different audiences, so turning up on both at the same time could be considered a sign that Google is reaching a peak of sorts on both the financial and general cultural fronts.

There’s a long tradition of things going badly for companies and people after getting this sort of high profile magazine cover treatment. If Google turns up next on the cover of People or Entertainment Weekly they’re probably doomed…

Update 02-12-2006 18:31 PST: John Battelle suggests that having made the cover of Time, Google has “jumped the shark”, while Matt Cutts offers a recent historical perspective of Google’s non-shark-jumping behavior while simultaneously demonstrating effective link baiting technique.

I don’t consider myself an expert on shark-jumping, but I do think that hitting the covers of Barrons and Time is qualitatively different than the counter-examples that Matt offers. Google is transitioning out of being loved for being better, new, and whizzy, and into a stage where people expect it to “just work”. Google has gotten large enough that people are developing a love/hate relationship with it (and web services in general) like they have with e-mail, and where the discussion about privacy, media, and commerce is just starting to get some critical attention from people outside tech land.

Reverse engineering a referer spam campaign

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

Referrer spam is more indirect. It is created by making an HTTP request with the REFERER header set to the URL being promoted. Most of the time, this will only be visible in the web server log.

Here is a typical HTTP log entry: 	[04/Feb/2006:15:20:35 	-0800]
    GET 	/weblog/archives/2005/09/15/google-blog-search-referrers-working-now 	HTTP/1.1
    403 	- 	""

Some blogs and other web sites post an automatically generated list of “recent referrers” on their home page or on a sidebar. In normal use, this would show a list of the sites that had linked to the site being viewed. Recent referrer lists are less common now, because of the rise of referrer spam.

Referrer spam will also show up in web site statistic and traffic summaries. These are usually private, but are sometimes left open to the public and to search engines.

One presumed objective of a link spam campaign is to increase the target site’s search engine ranking. In general this requires building a collection of valid inbound links, preferably without the “nofollow” attribute. Referrer spam may be more effective for generating inbound links, since recent referrer lists and web site reports typically don’t wrap their links with nofollow.

The landing pages for the links in this campaign are interesting in that they don’t contain advertising at all. This suggests that this campaign is trying to build a sort of PageRank farm to promote something else.

The actual pages are all built on the same blog template, and contain a combination of gibberish and sidebar links to subdomains based on “valuable” keywords. Using the blog format automatically provides a lot of site interlinking, and they also have “recent” and “top referer” lists, which are all from other spam sites in the network.

It looks like the content text should be easy to identify as spam based on frequency analysis. Perhaps having a very large cloud of spam sites linking to each other along with a dispersed set of incoming referrer spam links makes the sites look more plausible to a search engine? These sites don’t appear to have any, but I have come across other spam sites and comment spam posts that have links to non-spam sites such as .gov and .edu sites, perhaps trying to look more credible to a search engine ranking algorithm. All the sites being on the same subnet makes them easier to spot, though.

Given that there aren’t that many public web site stat pages and recent referrer lists around, I’m surprised that referrer spamming is worth the effort. If the spam network can achieved good ranking in the Google and the other search engines, they can probably boost the ranking for a selected target site by pruning back some of their initial links and adding some links pointing at the sites that they want to promote. Affiliate links to porn, gambling, or online pharmacy sites must pay reasonably well for this to work out for the spammers.

More reading: A list of references on PageRank and link spam detection.

If you’re having referrer spam problems on your site, you may find my notes on blocking referer spam useful.

Here’s some sample text from “”:

I search-buy over least and and next train. Ne so at cruelty the search-buy in after anaesthesia difficulty general urinating. T pastry a ben for search-buy boy. An refuses trip search-buy romances seemed azusa pacific university ca. Stoc of my is and search-buy direct having sex teen titans. Kid philadelphiaa would and york search-buy. G search-buy wore shed i dads. obstacles future search-buy right had satire nineteenth. The that i ups this on search-buy least finds audio express richmond. have this window been wonderful me search-buy so. Surel in actually search-buy our boy deep franklin notions. An search-buy it of my has of. To at head boy that a search-buy. O james search-buy everywhere of but. Alread originate search-buy good about since.

Here are a few spam sites from this campaign and their IP addresses:          A          A          A           A              A             A             A

Here is the WHOIS output for that netblock:

% Information related to ' -'

inetnum: -
netname:        inhoster
descr:          Inhoster hosting company
descr:          OOO Inhoster, Poltavskij Shliax 24, Kharkiv, 61000, Ukraine
remarks:        -----------------------------------
remarks:        Abuse notifications to:
remarks:        Network problems to:
remarks:        Peering requests to:
remarks:        -----------------------------------
country:        UA
org:            ORG-EST1-RIPE
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
tech-c:         FWHS1-RIPE
status:         ASSIGNED PI
mnt-by:         RIPE-NCC-HM-PI-MNT
mnt-lower:      RIPE-NCC-HM-PI-MNT
mnt-by:         RECIT-MNT
mnt-routes:     RECIT-MNT
mnt-domains:    RECIT-MNT
mnt-by:         DAV-MNT
mnt-routes:     DAV-MNT
mnt-domains:    DAV-MNT
source:         RIPE # Filtered

organisation:   ORG-EST1-RIPE
org-name:       INHOSTER
org-type:       NON-REGISTRY
remarks:        *************************************
remarks:        * Abuse contacts: *
remarks:        *************************************
address:        OOO Inhoster
address:        Poltavskij Shliax 24, Xarkov,
address:        61000, Ukraine
phone:          +38 066 4633621
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
mnt-ref:        DAV-MNT
mnt-by:         DAV-MNT
source:         RIPE # Filtered

person:         Andrei Kislizin
address:        OOO Inhoster,
address:        ul.Antonova 5, Kiev,
address:        03186, Ukraine
phone:          +38 044 2404332
nic-hdl:        AK4026-RIPE
source:         RIPE # Filtered

person:       Fast Web Hosting Support
address:      01110, Ukraine, Kiev, 20Á, Solomenskaya street. room 201.
address:      UA
phone:        +357 99 117759
nic-hdl:      FWHS1-RIPE
source:       RIPE # Filtered
Page 1 of 512345