A reading list on PageRank and search algorithms

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.

  • Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
  • Online Reputation Systems: The Cost of Attack of PageRank (PDF)
    Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
  • SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF)
    András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
  • Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).

I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.

Building better personalized search, filtering spam blogs

Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

  1. Individual users interests / search intent changes over time
  2. The click and viewing data available to do the personalization is limited
  3. Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
  4. Computers are often shared among multiple users with varying intent
  5. Queries are too short to accurately infer intent

Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.

Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (“this is my event information”) and the search user’s intent (“please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.

I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).

The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.

I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.

Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.

The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.

I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.

Ammazon Mechanikal Truk




Ammazon Mechanikal Truk:

Artificial…um…Real Smart Truk

See also: Amazon Mechanical Turk: Putting Humans in the Loop

(via Turk Lurker)

Free 411 service?

At last month’s Mobile Monday, Jack Denenberg from Cingular Wireless commented that 411 calls accounted for a huge chunk of revenue to the US cellular carriers, with Cingular servicing around 1 million 411 calls per day at an average billing cost of between $1.25 to $1.40. All US carriers combined do around 3 million 411 calls per day, which works out to more than $1 billion per year in 411 fees!

They’re going to be really unhappy if these guys get some traction:

A few weeks ago I met Andre Vanier, CEO of 1-800-411-SAVE (my friend Ajay, the guy with the cool geek car, introduced us). I was intrigued by his new business and he’s on the phone with me announcing his new service that turns on tonight at midnight.

We are considerably cheaper, he says. 1-800-411-SAVE is a free call.

His service is using the same database that the carriers use to provide 411 information. This service is using the latest data the big phone companies use (they are forced to share that data with other phone carriers), while many of the Internet-based services are using much older and less complete databases.

What’s the business model? 1-800-411-SAVE pays for the cost of the 411 call. The model is to recover the cost from advertisers. Not just any advertisers but specifically advertisers that fit into the overall concept of “save.”

(via Scoble)

Update 11-16-2005 00:41 PST – The corollary to saving $1.50 for listening to an ad from a sponsor before getting the phone number from 411 is that the customer service lines for banks and credit cards should pay me for listening to their upsell message that gets played before getting to the automated response or being put on hold. At least with the free 411 I get to make a choice…

Follow the Money – Microsoft Windows Live, Google, and Web 2.0


Some thoughts following the Microsoft splash this week:

The big PR launch for Windows Live last Tuesday announced a set of web services initiatives. It probably drives a lot of Microsoft people crazy to have the technology and business resources that they do, and to have so little mindshare in the “web 2.0″ conversations that are going on. I haven’t read through or digested all the traffic in my feed reader, but it looks like a lot of people are unimpressed by the Microsoft pitch. Been there, done that. Which is true, as far as I can see. The more interesting question is whether this starts to change the flow of money and opportunities around developing for and with Microsoft products and technologies.

If I do a quick round of free association, I get something like this:

Microsoft:

  • corporate desktop
  • security update
  • vista delayed
  • who’s departed this week

Microsoft is a huge, wildly profitable company. It initially got there by being “good enough” to make a new class of applications and solution developers successful in addressing and building new markets using personal computers, doing things that previously required a minicomputer and an IT staff. Startup companies and individual developers that worked with Microsoft products made a lot of money, doing things that they couldn’t do before. All you needed was a PC and some relatively inexpensive development tools, and you could be off selling applications and utilties, or full business solutions built on packages like dBase or FoxPro.

Microsoft made a lot of money, but the software and solutions developers and other business partners and resellers also made a lot of money, and the customers got a new or cheaper capability than what they had before. Along the way, a huge and previously non-existent consumer market for IT equipment and services also emerged. Meanwhile, the market for expensive, low end minicomputers and applications disappeared (Wang, Data General, DEC Rainbow, HP 98xx) or moved on to engineering workstations (Sun, SGI, HP, DEC/MIPS) where they could still make money.

The current crop of lightweight web services and “web 2.0″ sites feels a little like the early days of PC software. In addition to recognizable software companies, individual developers would build yet-another-text editor or game and upload it to USENET or a BBS somewhere, finding an audience of tens or hundreds of people, occasionally breaking out into mass awareness. Bits and pieces are still around, like ZIP compression, but most of it has disappeared or been absorbed and consolidated into other software somewhere. I have a CD snapshot of the old SIMTEL archive from years ago that’s full of freeware and shareware applications that all had a modest following somewhere or another. Very few people made any money from that way. In the days before the internet, distribution of software was expensive, and payment meant writing and mailing a check, directly from the end user to the developer.

Google has become a huge, wildly profitable company so far by building a better search engine to draw in a large base of users, and using their platform to do a better job of matching relevant advertising to the content it’s indexing. Now, a small application can quickly find an audience by generating buzz on the blogging circuit, or through search engines, and receive two important kinds of feedback

  • Usage data – what are the users doing and how is the application behaving
  • Economic data (money) – which advertising sponsors and affiliates provide the best return

Google’s Adsense and other affiliate sales programs are effectively providing a form of micropayments that are providing incentives and funding for new content and applications, with no investment in direct sales or payment processing by the developers, and no committment from the individual end user.

It’s simply a lot easier for a small consumer targeted startup to come up with a near term path to profitability based on maximizing the number of possible clients (=cross platform, browser based), being able to scale out easily by adding more boxes (not hassling with tracking and paying for additional licenses), and with a short path to revenue (i.e. Adsense, affiliate sales). A developer who might have coded a shareware app in the 80′s can now build a comparable web site or service and find an audience, and actually make a little (or a lot of) money. Google makes a lot of money from paid search ($675MM from Adsense partner sites in 3Q05), but now some of that money is flowing to teams building interesting web applications and content.

In contrast, in the corporate environment (where it’s effectively all Microsoft desktops now), things are different. Most organizations won’t let individuals or departments randomly throw new applications onto the network and see what happens. This is a space that usually requires deep domain expertise, and/or C-level friends, in order to get close enough to the problems to do something about it. But the desktops all have browsers, and the IT managers don’t want to pay for any more Windows or Oracle licenses than they are forced to, so there’s some economic pressure to move away from Windows. But there’s also huge infrastructure pain, if your company is built on Exchange. There’s less impetus here for new features, the issue is to keep it secure, keep it running, and make it cost less. Network management, security, and application management are all doing OK in the enterprise, along with line-of-business systems, but these are really solutions and consulting businesses in the end. The fastest way to get “web 2.0″ into these environments is for Microsoft to build these capabilities into their products, preferably in as boring but useful a way as possible. Not a friendly place for trying out a whizzy new idea, and generally a hard place for a lightweight software project to crack.

On another front, Microsoft also has most of the consumer desktop market, but by default rather than by corporate policy. Mass market consumers are likely to use whatever came with their computer, which is usually Windows. They’re also much more likely to actually click on the advertisements. Jeremy Zawodny posted some data from his site showing that most of his search traffic comes from Google, but the highest conversion rates come from MSN and AOL. MSN users also turn out to be the most valuable on an individual basis, in terms of the effective CPM of those referrals on his site.

So let’s see:

  • Many new application developers are following the shortest path to money, presently leading away from Microsoft and toward open source platforms, with revenue generation by integrating Google and other advertising and affiliate services
  • Microsoft has access to corporate desktops, as well as mainstream consumer desktops, where it’s been increasingly difficult for independent software developers to make any money selling applications
  • Microsoft is launching a lot of new me-too services in terms of technical capability, but which will have some uptake by default in the corporate and mass market
  • Microsoft’s corporate users and MSN users are likely to be later adopters, but may be more likely to be paying customers for the services offered by advertisers.
  • Microsoft could attract more new web service development if there were some technical or economic incentives to do so; at present it costs more to build a new service on Microsoft products, and there’s little alignment of financial incentives between Microsoft, prospective web application developers, and their common customers and partners.

Mike Arrington at TechCrunch has a great set of play-by-play notes from the presentation and a followup summary. He thinks the desktop gadgets and VOIP integration are exciting.

what really got me today was the Gadget extensibility and the full VOIP IM integration.

In the past, Microsoft grew and made a lot of money by helping a lot of other people make money. Today, the developers are following the money and heading elsewhere, mostly to Google. This could quickly change if Microsoft comes up with a way to steer some of their valuable customers and associated indirect revenue toward new web application developers. They are the incumbent, with huge market share and distribution reach. I don’t think they’ll ever have the “cool” factor of today’s web2.0 startups, and I don’t think they’ll regain the levels of market share they have had in the past with Windows, Office, and Internet Explorer. But they could be getting back in the game, and if they come up with a plan to make some real money for 3rd party web developers we’ll know they’re serious.

October 2005 Search Referrals

Jeremy Zawodny posted a summary of his October search referral statistics, and I thought I’d take a quick look at mine.

october 2005 search referrals

Nearly all of the search referrals here come through Google. I also have a relatively large number of “Other”, some of which (I think) are various Chinese search engines.

Jeremy says:

The gap between Google and Yahoo! is hard to interpret, since it doesn’t come close to matching the publicly available market share numbers. The same is true of the numbers for MSN and AOL. They should be higher.

There are two ways I can think to explain this:

1. People who use Google are more likely to be searching for content that’s on my site.
2. The market share numbers are wrong. Google actually generates more traffic than has been reported and MSN and AOL have been over-estimated.

I suspect that #1 is closer to reality. After all, I most often write about topics that are of interest to an audience that’s more technical than average. And I suspect that crowd skews toward Google in a more dramatic fashion than the general population of Internet users. If that’s true, it would seem to confirm many of the stereotypes about AOL and MSN users.

It looks like my site has even less appeal for a consumer audience than his…

Alexa Web Information Service

Alexa Web Information Service has been in beta for a year and is officially launched this week.

The Alexa Web Information Service provides the following operations:
URL Information
Examples of information that can be accessed are site popularity, related sites, detailed usage/traffic stats, supported character-set/locales, and site contact information. This is most of the data that can be found on the Alexa Web site and in the Alexa toolbar, plus additional information that is being made available for the first time with this release.
Web Search
The Web Search operation is a brand new search index based on Alexa’s extensive Web crawl. The search query format is similar to a Google query and allows up to 1,000 results per page.
Browse Category
This service returns Web pages and sub-categories within a specified category. The returned URLs are filtered through the Alexa traffic data and then ordered by popularity.
Web Map
The Web Map operation gives developers access to links-in and links-out information for all pages in the crawl. For example, given a URL as an input, the service returns a list of all links-in and links-out to or from that URL. This Web map information can be used as inputs to search-engine ranking algorithms such as PageRank and HITS, and for Internet research.
Crawl Meta Data
The Crawl Meta Data operation gives developers access to metadata collected in Alexa’s Web Crawl. For example, a developer can get pages size, checksum, total links, link text, images, frames, and any Javascript-embedded URLs for any page in the crawl.
Pricing
First 10,000 requests per month are free
additional requests are $0.00015 per request ($0.15 for 1,000 requests)

via Paul Kedrosky

Mobile Search = US$1 billion 411 calls per year

IMG_4937

Today, mobile search in the US = $1 billion per year in 411 calls.

Well, that’s a gross oversimplification, but it gets to one of the main points from this evening’s sold-out, standing-room-only joint Search SIG and Mobile Monday session on Mobile Search, held at Google this evening.

The panel discussion was moderated by David Weiden from Morgan Stanley, with panelists

  • Elad Gil (Google)
  • Mihir Shah (Yahoo)
  • Mark Grandcolas (Caboodle)
  • Ted Burns (4info)
  • Jack Denenberg (Cingular)

Jack Denenberg from Cingular was the lone representative from the carrier world. During the panel, he made the observation that 411 “voice search” was at least 2-3x the volume of SMS and WAP-based search, and that Cingular (US) is doing around 1 million 411 calls per day at an average billing cost of between $1.25 to $1.40. All US carriers combined do around 3 million 411 calls per day.

This works out to more than $1 billion per year in 411 fees!

Other comments from Jack: Wireless 411 use is still rising. Wireline 411 use is starting to decline. Today mobile search is based on user fees (airtime) and search fees (411). In the future, we may see some movement toward advertiser listing fees. The carrier provides a channel for business to communicate with prospective customers.

Pithy comment from the audience: “I see 4 guys trying to make the best of a bad situation, and 1 guy creating a bad situation.” More comments on why no location based services, why SMS is still limited to 160 characters, and partner-unfriendly pricing. Why $1.40 for an address when Google is free (except for $.20 for data fees).

Mihir from Yahoo mentioned that they have been running trials of paid mobile search listings using Overture back end on Vodafone in the UK, and it’s going well, so the mobile paid listings are starting to happen already.

Lot of comments about user interfaces being too complicated. Mark from Caboodle says that for each click into the menu system, 50% of the users will give up trying to buy something, such as a ringtone. They have a system for simplifying this, but unfortunately they weren’t able to get his demo onto the big screen so we never got to see it live.

Some discussion on the carriers generally having a preference to simplify the user experience by giving them a single, (carrier-branded) aggregated search, taking advantage of the proprietary clickstream and data traffic information available to them through the data billing system.

The Yahoo mobile applications seemed the most plausibly useful. The send-to-phone feature allows you to send driving directions and other info from Yahoo Local to your phone. The mobile shopping application could be used for price comparison while shopping in person (although this doesn’t do much for the online merchants today). Some of their SMS search result messages allow you to reply to get an update, so you can send an SMS with “2″ in it to get an update of a previous weather forecast.

For his demo, Jack ran an impromptu contest among 3 audience volunteers, to see who could find the address of the New York Metropolitan Museum of Art the fastest, using voice (411), 2-way SMS search, or WAP search. All of them got the answer, although 411 was the quickest by perhaps a minute or a bit less.

IMG_4942

The 4info demo looked interesting. They use a short code (44636) SMS with the query text in the body, geared toward sports, weather, addresses. They also provide recipes and pointers to local bars if you key in the name of a drink. Someone in the audience pointed out that searching 4info for “Linux” returned a drink recipe, which Ted reproduced on the big screen. Not sure what the drink promo or the Linux recipe was about.

During open mike time, someone (mumble) from IBM did a 15 second demo of their speech-activated mobile search, in which he looked up the address of the New York Metropolitan Museum of Art by speaking at the phone, and the results were returned in a text page. Very slick.

Lots of interesting data-oriented mobile search projects are building for the future. But $1 billion in 411 calls right now is pretty interesting too. Who makes those 411 calls? Are they happy paying that much?

I avoid calling 411 because I feel ripped off afterwards, and often get a wrong number or address anyway. But it’s not always possible to key in a search.

Photos at Flickr

Quick Take on Google Reader

My quick notes on trying out Google Reader:

Summary:

  • The AJAX user interface is whizzy and fun, and is similar to an e-mail reader.
  • Importing feeds is really slow.
  • Keyboard navigation shortcuts are great.
  • Searching through your own feeds or for new feeds is convenient using Google
  • I hate having a single item displayed at a time.
  • “Blog This” action is handy, if you use Blogger. They could easily make this go to other blogging services later.
  • This could be a good “starter” service for introducing someone to feed readers, but
  • No apparent subscription export mechanism
  • Doesn’t deal well with organizing a large number of feeds.

More notes:
I started importing the OPML subscription file from Bloglines into Google Reader on Friday evening. I have around 500 subscriptions in that list, and I’m not sure how long it ended up taking to import. It was more than 15 minutes, which was when I headed off to bed, and completed sometime before this afternoon.

I love having keyboard navigation shortcuts. The AJAX-based user interface is zippy and “fun”. Unfortunately, Google Reader displays articles one at a time, a little like reading e-mail. I’m in the habit of scanning sections of the subscription lists to see which sections I want to look at, then scanning and scrolling through lists of articles in Bloglines. Even though this requires mousing and clicking, it’s a lot faster than flashing one article at a time in Google Reader.

I don’t think the current feed organization system works on Google Reader, at least for me. My current (bad) feed groupings from Bloglines show up on Google Reader as “Labels” for groups of feeds, which is nice. It’s hard to just read a set of feeds, though. Postings show up in chronological order, or by relevance. This is totally unusable for a large set of feeds, especially when several of them are high-traffic, low-priority (e.g. Metafilter, del.icio.us, USGS earthquakes). If I could get the “relevance” tuned by context (based on label or tag?) it might be useful.

When you add a new feed, it starts out empty, and appears to add articles only as they are posted. It would be nice to have them start out with whatever Google has cached already. I’m sure I’m not the first subscriber to most of the feeds on my list.

On the positive side, this seems like a good starting point for someone who’s new to feed readers and wants a web-based solution. It looks nice, people have heard of Google, and the default behaviors probably play better with a modest number of feeds. Up to this point, I’ve been steering people at Bloglines in the past, and more recently pointing them at Rojo.

I wish the Bloglines user interface could be revised to make it quicker to get around. I really like keyboard navigation. I can also see some potential in the Google Reader’s listing by “relevance” rather than date listing, and improved search and blogging integration. I’m frequently popping up another window to run searches while reading in Bloglines.

Google Reader doesn’t seem like it’s quite what I’m looking for just now, but I’ll keep an eye on it.

Wishful thinking:
I think I want something to manage even more feeds than I have now, but where I’m reading a few regularly, a few articles from a pool of feeds based on “relevance”, and articles from the “neighborhood” of my feeds when they hit some “relevance” criteria. I’d also like to search my pool of identified / tagged feeds, along with some “neighborhood” of feeds and other links. I think a lot of this is about establishing context, intent, and some sort of “authoritativeness”, to augment the usual search keyword matching.

Ungoogleable to #1 in six months

Despite being online for a very long time by today’s standards (~1980), I have been difficult to find in search engines until fairly recently.

This basically has 4 reasons:

  1. The components of my name, “Ho”, “John”, and “Lee” are all short and common in several different contexts, so there are a vast number of indexed documents with those components.
  2. Papers I’ve published are listed under “Lee, H.J.” or something similar, lumping them together with the thousands of other Korean “Lee, H.J.”s. Something like 14% of all Koreans have the “Lee” surname, and “Ho” and “Lee” are both common surnames in Chinese as well. Various misspellings, manglings and transcriptions mean that old papers don’t turn up in searches even when they do eventually make it online.
  3. Much of the work that I’ve done resides behind various corporate firewalls, and is unlikely to be indexed, ever. A fair amount of it is on actual paper, and not digitized at all.
  4. I’ve generally been conscious that everything going into the public space gets recorded or logged somewhere, so even back in the Usenet days I have tended to stay on private networks and e-mail lists rather than posting everything to “world”.

Searching for “Ho John Lee” (no quotes) at the beginning of 2005 would have gotten you a page full of John Lee Hooker and Wen Ho Lee articles. Click here for an approximation. With quotes, you would have seen a few citations here and there from print media working its way online, along with miscellaneous RFCs.

Among various informal objectives for starting a public web site, one was to make myself findable again, especially for people I know but haven’t stayed in contact with. After roughly six months, I’m now the top search result for my name, on all search engines.

As Steve Martin says in The Jerk (upon seeing his name in the phone book for the first time), “That really makes me somebody! Things are going to start happening to me now…”

Wired this month on people who are Ungoogleable:

As the internet makes greater inroads into everyday life, more people are finding they’re leaving an accidental trail of digital bread crumbs on the web — where Google’s merciless crawlers vacuum them up and regurgitate them for anyone who cares to type in a name. Our growing Googleability has already changed the face of dating and hiring, and has become a real concern to spousal-abuse victims and others with life-and-death privacy needs.

But despite Google’s inarguable power to dredge up information, some people have succeeded — either by luck, conscious effort or both — in avoiding the search engine’s all-seeing eye.

Yahoo Site Explorer

Yahoo Search Blog announces Yahoo Site Explorer a handy alternative to searching with “site:” or “link:” to see what’s getting indexed and linked at Yahoo Search. It’s billed as a work in progress, at the moment you can:

  • Show all subpages within a URL indexed by Yahoo!, which you can see for stanford.edu, here. You can also see subpages under a path, such as for Professor Knuth’s pages.
  • Show inlinks indexed by Yahoo! to a URL, such as for Professor Knuth’s pages, or for an entire site like stanford.edu.
  • Submit missing URLs to Yahoo

There is also a web service API for programmatic queries.

Discussion at Search Engine Watch, Webmaster World.

Danny Sullivan at Search Engine Watch posted a synopsis on the SEW Forum:

I’ve done a summary of things over here on the blog, which also links to a detailed look for SEW paid members.

Here are my top line thoughts:

You can see all pages from all domains, one domain, or a directory/section within a domain. Thumbs up!

You can NOT pattern match to find all URLs from a domain. That would be nice.

You can see all links to a specific page or a domain. Thumbs up!

You can NOT exclude your own links, very unfortunately. Two thumbs down!

You can export data, but only the first 50 items, unfortunately. Thumbs down!

More wish list stuff:

Search commands such as link: aren’t supported, and I hope that might come.

You can get a feed of your top pages, but I want a feed of backlinks to inform me of new ones that are found. Site owners deserve just as much fun as blog owners in knowing about new links to them!

Some of the other posts discuss interesting things you can do with the existing “advanced search” options. I’ll have to try some out, both through Yahoo Site Explorer and using some of the suggested link queries which apparently can’t be done yet through Site Explorer.

Search Attenuation and Rollyo

“Search attenuation” is a new term to me, but seems like a good description of the process of filtering feeds and search results to a manageable size. As more content becomes available in RSS, I tend to subscribe to anything that looks interesting, but am looking for improved methods for searching and filtering content within that set.

Catching up a little on the feed aggregator, I see an article at O’Reilly about Rollyo, a new “Roll Your Own Search Engine” site from Dave Pell of Davenetics.

ROLLYO is the latest mind warp from Dave Pell. Rollyo affords anyone the ability to roll their own Yahoo!-powered search engine, attenuating results to a set of up to 25 sites. And while the searchrolls (as they’re called) you create are around a particular topic (e.g. Food and Dining), they are also attached to a real person (e.g. Food and Dining is by Jason Kottke). The result is a topic-specific search created and maintained by a trusted source.

Rolly’s basic premise is one I’ve been preaching of late: attenuation is the next aggregation …

Recently, I’ve been looking at this from a related angle, which is how to infer topical relevance among people or sources you trust based on links, tagging, and search, and named entity discovery. People are already linking, tagging, and searching, so some data is available as a byproduct of work that they’re already doing. On the other hand, if enough people you trust take the additional step of explicitly declaring the sources they think are relevant, this would help a lot.

See also Memeorandum, Findory, Personal Bee.

More on this from TechCrunch

Dredging for Search Relevancy

I am apparently a well trained, atypical search user.

Users studied in a recently published paper users clicked on the top search result almost half the time. Not new, but in this study they also swapped the result order for some users, and people still mostly clicked on the top search result

I routinely scan the full page of search results, especially when I’m not sure where I’m going to find the information I’m looking for. I often randomly click on the deeper results pages as well, especially when looking for material from less-visible sites. This works for me because I’m able to scan the text on the page quickly, and the additional search pages also return quickly. This seems to work especially well on blog search, where many sites are essentially unranked for relevancy.

This approach doesn’t work well if you’re not used to scanning over pages of text, and also doesn’t work if the search page response time is slow.

On the other hand, I took a quick try at some of the examples in the research paper, and my queries (on Google) generally have the answer in the top 1-2 results already.

From Jakob Nielsen’s Alertbox, September 2005:

Professor Thorsten Joachim and colleagues at Cornell University conducted a study of search engines. Among other things, their study examined the links users followed on the SERP (search engine results page). They found that 42% of users clicked the top search hit, and 8% of users clicked the second hit. So far, no news. Many previous studies, including my own, have shown that the top few entries in search listings get the preponderance of clicks and that the number one hit gets vastly more clicks than anything else.

What is interesting is the researchers’ second test, wherein they secretly fed the search results through a script before displaying them to users. This script swapped the order of the top two search hits. In other words, what was originally the number two entry in the search engine’s prioritization ended up on top, and the top entry was relegated to second place.

In this swapped condition, users still clicked on the top entry 34% of the time and on the second hit 12% of the time.

For reference, here are the questions that were asked in the original study (182KB, PDF)

Navigational

  • Find the homepage of Michael Jordan, the statistician.
  • Find the page displaying the route map for Greyhound buses.
  • Find the homepage of the 1000 Acres Dude Ranch.
  • Find the homepage for graduate housing at Carnegie Mellon University.
  • Find the homepage of Emeril – the chef who has a television cooking program.

Informational

  • Where is the tallest mountain in New York located?
  • With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are democratic presidential primaries in New York?
  • Which actor starred as the main character in the original Time Machine movie?
  • A friend told you that Mr. Cornell used to live close to campus – near University and Steward Ave. Does anybody live in his house now? If so, who?
  • What is the name of the researcher who discovered the first modern antibiotic?

Tagging and Searching: How transparent do you want to be?

This note captures some thoughts in progress, feel free to chip in with your comments…

Here’s a feature wish list for link tagging:

  • Private-only links – only I can see them at all
  • Group-only links – only members of the group can see them
  • Group-only tags – only members of the group can see my application of a set of tags
  • Unattributed links – link counts and tags are visible to the public, but not the contributor or comments

Tagged bookmarking services such as del.icio.us allow individuals to save and organize their own collection of web links, along with user-defined short descriptions and tags. This is already convenient for the individual user, but the interesting part comes from being able to search the entire universe of saved bookmarks by user-defined tags as an alternative or adjunct to conventional search engines.

Bits of collective wisdom embodied in a community can be captured through aggregating user actions representing their attention, i.e. the click streams, bookmarks, tags, and other incremental choices that are incidental to whatever they happened to be doing online. The result of a tag search are typically much smaller, but are often more focused or topically relevant than a search on Google or Yahoo.

It’s also interesting to browse the bookmarks of other people who have tagged or saved similar items. To some extent the bookmark and tag collection can be treated as a proxy for that person’s set of interests and attention.

In a similar fashion, clicking on a link (or actually purchasing an item), can be treated as a indication of interest. This is part of what makes Google Adsense, Yahoo Publisher Network, and Amazon’s Recommendations work. The individual decisions are incidental to any one person’s experience, and taken on their own have little value, but can be combined to form information sets which are mutually beneficial to the individual and the aggregator. Web 2.0 thrives on the sharing of “privately useless but socially valuable” information, the contribution of individuals toward a shared good.

In the case of bookmarking services, the exchange of values is: I get a convenient way to save my links, and del.icio.us gets my link and tag data to be shared with other users

One problem I run into regularly is that everything is public on del.icio.us. For most links I add, I am happy to share them, along with the fact that I looked at them, cared to save it, and any comments and tags I might add. Del.icio.us starts out with the assumption that everyone who bookmarked something there would want to share. As I use it more regularly, though, I sometimes find situations where I want to save something, but not necessarily in public. Typically either

      a) don’t want to make the URL visible to the public, or
      b) don’t mind sharing the link, but don’t want to leave a detailed trail open to the public.

The first case, in which I’d like to save a link for my private use, is arguably just private information and shouldn’t actually be in a “social bookmarks” system to begin with. However, there is a social variant of the private link, which is when I’d like to share my link data with a group, but not all users. This might be people such as members of a project team, or family or friends. It’s analogous to the various photo sharing models, in which photos are typically shared to the public, or with varying systems of restrictions.

The second case, in which I’m willing to share my link data, but would like to do so without attribution, is interesting. In thinking about my link bookmarking, I find that I’m actually willing to share my link, and possibly my tag and comment data, but don’t want to have someone browse my bookmark list and find the aggregated collection there, as it probably introduces too much transparency into what I’m working on. At some point in time, it’s also likely that I would be happy to make the link data fully visible, tags, comments, and all, perhaps after some project or activity is completed and the presence of that information is no longer as sensitive.

The feature wish list above would address some of the not-quite-public link data problems, while continuing to accrete community contributed data. In the meantime, I’m still accumulating links back behind the firewall.

Another useful change to existing systems would be to aggregate tag or search results based on a selected set of users to improve relevance. This is along the lines of Memeorandum, which uses a selected set of more-authoritative blogs as a starting point to gauge relevance of blog posts. In the tagged search case, it would be interesting if I could select a number of people as “better” or “more relevant” at generating useful links, and return search results with ranking biased toward search nodes that were in the neighborhood of links that were tagged by my preferred community of taggers.

It’s possible to subscribe to specific tags or users on del.icio.us, but what I had in mind was more like being able to tag the users as “favorites” or by topic and then rank my search results based on their link and tag neighborhoods. I don’t actually want to look at all of their bookmarks all the time.

Something similar might also work with search result page clickthroughs. These sorts of approaches seem attractive, but also seem too messy to scale very well.

Unattributed links may be too vulnerable to spamming to be useful. One possible fix could be to filter unattributed links based on the authority of the source, without disclosing the source to the public.

I was at the Techcrunch meetup last night, didn’t have a chance to talk with the del.icio.us folks who were apparently around somewhere, but Ofer Ben-Shachar from Raw Sugar did mention that they were looking at providing some sort of group-only access option for their tagging system.

A lot of this could be hacked onto the existing systems to solve the end user problem easily, but some of the initial approaches that come to mind start to break the social value creation, and I think those could be preserved while making better provisions for “private” or “group” restrictions by working on the platform side.

Better Information Isn’t Always Beneficial

In today’s WSJ, David Wessels outlines some systemic social problems that can arise as information becomes widely available at lower costs.

I like the description of the problem in a quote by Kenneth Arrow: “socially useless but privately valuable” information, which can provide individual benefits, but at an overall cost to society at large.

This is the inverse of the dynamics driving “web 2.0″, which thrives on the sharing of “privately useless but socially valuable” information such as click streams, tagging, location awareness, presence awareness, etc.

Computer and communications technology is making more and better information available ever more quickly. This is a good thing — usually.
But there are some things we don’t want to do more efficiently. Doing them better adds neither to the U.S. national psyche nor to the gross domestic product. Figuring out which is which is a growing challenge to society as technology makes gathering and analyzing information easier and cheaper.

This issue predates computers. “The contrast between the private profitability and the social uselessness of foreknowledge may seem surprising,” the late economist Jack Hirshleifer wrote in 1971. But there are instances, he argued, where “the community as a whole obtains no benefit … from either the acquisition or the dissemination (by resale or otherwise) of private foreknowledge.”

Imagine a place with uncertain weather where food is plentiful in rainy spots, but not in others. Residents, in essence, buy insurance. The lucky feed the unlucky. No one starves. Then it becomes possible to buy accurate weather forecasts. One who buys the forecast knows whether he needs insurance or not; he profits. But the total amount of food available is unchanged. And if everyone buys the weather forecast, the insurance market becomes impossible. “There is a double social loss — the resources used unnecessarily in acquiring information and the destruction of a market for risk sharing,” Mr. Arrow said when he posed this example in 1973.

Update 09-23-2005 10:31 PDT: Discussion on this topic at Slashdot (noisy, but some interesting comments).

iTunes has video podcasting support

I wrote earlier today about my reluctant late-adopter status for audio podcasting, and now I come across an article about Apple quietly introducing video content to iTunes Music Store.

The quiet, fanfare-less launch of video podcasting (in fact, it’s not even clear when it was launched) is a bit surprising for the company, but there may be a reason: there’s not too many video podcasts out there in the wild. Furthermore, video podcasts are currently only playable on your computer, although it seems clear enough that a video iPod is on the way. If you didn’t believe it before, you should definitely believe it now.

I don’t recall if anyone mentioned video on iTunes at last night’s Search SIG discussion. Ev Williams (from Odeo) commented that a lot of what makes audio podcasting compelling doesn’t apply to video, in that audio can be consumed anywhere, and has an existing use model (drive-time radio), while video is typically consumed while sitting down in front of an increasingly large television at home. Eric Rice did show a live demo of video blogging on Audioblog, illustrating the possibility of large scale user-created video content in the future. I’m not sure who’s going to look at all the video, though. Perhaps the same people who are watching reality TV shows.

Once again, I’m well outside the demographic, since I barely watch any television at all these days. If I could get a commercial video podcast service to replace cable TV with, I’d probably subscribe now, though.

Google Blog Search – Referrers Working Now

Looks like Google Blog Search took out the redirects that were breaking the referrer headers.

Now the search keywords are visible again. Here’s a typical log entry:

xxx.xxx.xxx.xxx – - [15/Sep/2005:15:58:13 -0700]
“GET /weblog/archives/2005/09/15/podcasting-and-audio-search-at-sdforum-searchsig-september-2005/
HTTP/1.1″ 200 26981 “http://blogsearch.google.com/blogsearch?hl=en&q=odeo&btnG=Search+Blogs&scoring=d”
“Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.10) Gecko/20050716
Firefox/1.0.6″

Blogger Buzz says the redirect was in place during development to help keep the project under wraps.

Google Blog Search – No Referrer Keywords?

Feature request to Google Blog Search team: please add search query info to the referrer string.

Lots of coverage this morning from people trying out Google Blog Search. (Search Engine Watch, Anil Dash, lots more)

I’m seeing some traffic from Google Blog Search overnight, but it looks like they don’t send the search query in the referrer. Here’s a sample log entry:

xxx.xxx.xxx.xxx – - [14/Sep/2005:00:51:09 -0700] “GET /weblog/archives/2005/09/14/google-blog-search-launches/ HTTP/1.1″ 200 22964 “http://www.google.com/url?sa=D&q=http://www.hojohnlee.com/weblog/archives/2005/09/14/google-blog-search-launches/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4″

So there’s no way to know the original search query. I have a pretty good idea how the overnight traffic looking for the Google post got here, but there are also people landing on fairly obscure pages here and I’m always curious how they found them. I’m sure the SEO crowd will be all over this shortly.

There have been a number of comments that Google Blog Search is sort of boring, but I’m finding that there’s good novelty value in having really fast search result pages. Haven’t used it enough to get a sense of how good the coverage is, or how fast it updates, but it will be a welcome alternative to Technorati and the others.

Update 09-14-2005 14:01 PDT: These guys think Google forgot to remove some redirect headers.

Update 09-14-2005 23:25 PDT: Over at Blogger Buzz, Google says they left the redirect in by accident, will be taking them out shortly:

“After clicking on a result in Blog Search, I’m being passed through a redirect. Why?”
Sadly, this wasn’t part of an overly clever click-harvesting scheme. We had the redirects in place during testing to prevent referrer-leaking and simply didn’t remove them prior to launch. But they should be gone in the next 24 hours … which will have the advantage of improving click-through time.

Google Blog Search Launches

Google’s entry into blog search launched this evening, go try it out or read their help page.

This will be interesting competition for the existing blog search companies. It definitely responds fast at the moment, let’s see how it holds up when the next flash news crowd turns up…

via Niall Kennedy and Kevin Burton

Lazy Sheep considered harmful?

Rashmi just posted some thoughts about the Lazy Sheep bookmarklet.

From the Lazy Sheep page:

Using the tags and descriptions shared by other del.icio.us users, Lazy Sheep makes tagging a page a one-click operation. In order to best suit any user, Lazy Sheep also includes a comprehensive set of options that can be configured to your exact specifications.

Rashmi’s comments:

It makes some sense at the individual level – I can gain from the wisdom of the others, without doing any work. But even at the individual level, there are disadvantages. First, the auto-tags might not capture my idiosyncratic associations (reducing findability when I look for the article later on). Second, it replaces the self-knowledge with social knowledge. Instead of a moment of reflection on my current interests, I simply find out how others think about the topic. Social knowledge in the context of self-knowledge is a beautiful thing, mere social knowledge just encourages the sheep mentality (which is the point of the bookmarklet I guess).

At the social level (which is what worries me more), if enough people started doing this, the value of del.icio.us would be diluted. We would loose some of the richness of the longtail, and just reinforce what the majority is saying. The first few people who tagged the article would set the trend – others would merely follow.

I seem to be having a lot of conversations with people lately about tagging and group search. I think of the auto tagging embodied in Lazy Sheep as an amplifier for the biases of the first few taggers. A less problematic solution would be to only use your own tags as input to the Lazy Sheep, or perhaps to select some “similar-thinking” taggers as a starting point.

I’ve been thinking about something like the latter for building a better personal search and tagging system. I’d like to be able to bias the search results based on the attention choices of people I think might be relevant, not the entire world. On the other hand, I don’t want to give up my entire clickstream for public consumption.

An aside on the tagging bias issue: Hal Abelson mentioned to me the other day that “IRC” and “Mouse” are closely related in some tag relatedness searches, because “IRC” associated with “Chat”, and “Chat” in French is “Cat”, which related to “Mouse”.

In my case, I consciously tend not to look at what tags have already been applied, because I’m hoping in the future to apply some sort of clustering or other relatedness filters on my own bookmarks to improve searches if I eventually accumulate enough data and motivation.

I think auto tagging can be very helpful, but it might be like using PowerPoint templates: after a while everything starts turning out the same way if you’re not careful.

Page 4 of 512345