Most Popular Posts of 2005

As 2005 comes to a close, a look back at some of the top posts this year based on page views, which seems to have been a mix of technology, business, travel, and random.

Go to Sleep!

Go to sleep!

Why Link Farms (used to) Work

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ’sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.

See also: A reading list on PageRank and Search Algorithms

The Return of Vinyl


It’s been a long time since I’ve had a working turntable at home. This evening I suddenly have lots of new old stuff to listen to.

There’s a divide in the music I’ve been listening to for the past ten years or so. I packed away the records and turntable around the time our daughter was born, thinking that I’d put it back together when she was old enough not to destroy the records. So, ten years later, I have a fairly large collection of digital music, and a large collection of analog recordings which don’t overlap much, but which have languishing in storage.

I’m happy to find that the turntable still works. Modern stereos don’t have phono inputs, so I ended up rummaging in the garage to dig up an old amplifier, which makes for a large but serviceable preamp. Right now I’m listening to Brian Eno’s Music For Airports.

Looking through the boxes I’ve hauled out so far is like receiving a musical time capsule from myself. There are a lot of albums I haven’t heard in a while and that Emily’s never heard at all. Tomorrow I think I’ll see how she likes J. Geils Live or The Roches. The plan is to gradually migrate the vinyl to digital and put it on the server with everything else, but this evening I’m just enjoying a bit of analog technology and album artwork the way it was meant to be.

I haven’t started researching the best solution for digitizing the albums and possibly cleaning up scratches, pops, clicks, and surface noise. Anyone have a favorite method they’d like to recommend?

Random Dreamhost issues

In case you were wondering where the site went, the past 24 hours or so has been a day of random issues with Dreamhost.

Yesterday afternoon they were having connectivity problems, which took all their customers offline for a few hours.

This morning, I discovered that this site was running, but all Dreamhost sites were unreachable via SBC/PacBell here in the Bay Area. From the logs it looks like Comcast and a variety of overseas networks were still able to connect. The Google proxy hack mentioned this morning on O’Reilly provided another quick path for looking at the web site from a different network to verify that connectivity was still working, at least from the Google data center.

A couple of hours ago I got what I thought was a response to my e-mail regarding the network connectivity problem, but which turned out to be one of the CPU utilization warning letters that have been going out lately:

[your] CPU minute usage for today is 56.15. The daily limit is 60 CPU minutes. You will continue to receive these notifications as long as your resource consumption is over 50 CPU minutes.

A little mysterious, since traffic to the site was off because of the network outage, and spam traffic hasn’t spiked either.

There aren’t any resource utilization logs posted yet. I wonder if the flaky networking over the past day contributed to the high CPU use by leaving a lot of processes around waiting for I/O that was coming in slowly or never.

Anyway, the site seems to be running normally as of this afternoon (or at least, I can get to it now).

See also: Dreamhost load average = 1004.16?

Googlepark: the battle for AOL


More business comics – the latest installment of Googlepark is up at Channel 9 (via Google Blogoscoped)

If you haven’t seen the previous episodes of Googlepark, here are links to the other installments: Googlepark.

Dilbert VC comics


Dilbert meets Vijay, the world’s most desperate venture capitalist.

See also: VC Comic Strips, GooglePark

Filtering, aggregating, searching, and monetizing the Long Tail

David Hornik asks: Where’s the Money in the Long Tail?

It is certainly the case that in the aggregate, Long Tail content is extraordinarily valuable. The question for VCs and entrepreneurs is “for whom?”

The real money is in aggregation and filtering and those will continue to be interesting businesses for the foreseeable future.

He points out that aggregators are building convenient one-stop shopping for people looking for topically-focused content, and derive economic value even when the content publishers do not.

David Beisel follows the money a little further:

…in the long run, the value of the network is not only determined by the number of nodes in it, but in the ability for the network to monetize those nodes.
…in calculating the value of a network, any equation describing it should contain a variable with the monetization rate (or proxied by the value to the user which can be monetized in the future). So while the number of nodes in a network surely is a fundamental (if not the majority, in many cases) driver of value, the value of the network itself to the user is also a very important component to the overall total.

Being the provider of a filtered view of online content is somewhat analogous to being an editor at a magazine or newspaper, a program director at a radio station, or an A&R rep at a record label. It usually doesn’t make sense to pursue some topics or styles as there’s either no audience, or a very low value audience, or an audience that’s too hard to reach.

Conversely, some publications do well on a very small base (financial newsletters and independent musicians come to mind). When the individual publisher (writer, musician, artist) develops their own audience, they are able to capture more of the value placed on their content by the consumers of content (readers, listeners, viewers) than when they are simply one of many aggregated content producers. People seek out their favorite writers in newspapers and magazines, talk show hosts on television, or musicians in local concerts. The content producers gain relative power over the distributors and a few can become their own branded media empire. (Think “Oprah”.)

From an investment point of view, it’s difficult to justify betting on any particular content producer becoming an online media star, for the same reasons aspiring writers/musicians/actors don’t get VC investment. (How are you going to know when you’ve got the next J.K. Rowling or Dan Brown on your doorstep looking for seed funding to write their book? )

In contrast, search, filtering and aggregation services can be built for specific audiences. The trick though is not just to find an audience, but to provide a service that is valuable over time to the audience, service provider, and content publishers. The Alexa Web Search Platform announcement this week is interesting not because it’s the best general purpose search engine, but because it may drop the effective cost of building some targeted filtering and aggregation services low enough to uncover some new interesting niches, in addition to the areas that are already being addressed by vertical search startups. Many of these niches may be profitable short term projects for a small team (or single person) but not durable enough to be investable, though.

Greg Linden adds:

Massive selection isn’t enough. To make the long tail accessible, irrelevant items should be hidden. Interesting items should be emphasized. Millions of poor choices should be reduced to tens of good ones. The value is in surfacing the gems from the sea of noise.

David Beisel has some suggestions:

Where’s my “social portal” for me as a skier enthusiast? Better yet, where’s the “About.com of social portals?” Or why isn’t About.com more social?

I suspect that someone will have that social portal for skiing enthusiasts in limited beta somewhere real soon now…

See also: The Home Pages of this New Era

Greenfuel – producing biofuel from smokestack emissions

algae biofuel reactor

Greenfuel Technologies creates bio-fuels or bio-diesel from the emissions of power plants and industrial facilities. The company’s system is being tested at MIT’s 20-megawatt power plant and it has an open invitation to other power plants. Its system produces raw oil stock from smokestack gases, reducing carbon dioxide emissions by 40% and nitrogen oxide emissions by 86%.

The system works by passing the smokestack emssions through an algae cultivation system which captures the carbon dioxide and also break down NOx. The algae can eventually be processed into biodiesel fuel.

via alarm:clock

See also: How Algae Clean the Air (Business 2.0, October 2005), Is Algae in your future? (Boston Museum of Science)

Deconstructing search at Alexa

Wow! Although the basic idea is straightforward, crawling and indexing for a general purpose search engine requires huge resources. Web crawlers are effectively downloading copies of the entire internet over and over, turning them over to indexing applications which scan the contents for structure and meaning.

The sheer scale of the task is a substantial barrier to entry for anyone wanting to develop a new indexing or retrieval application. Some projects have narrowed the problem domain, which can reduce the problem scope to a manageable level, but this announcement from Alexa looks like it may offer an exciting alternative for building new search applications.

John Batelle writes:

Alexa, an Amazon-owned search company started by Bruce Gilliat and Brewster Kahle (and the spider that fuels the Internet Archive), is going to offer its index up to anyone who wants it (details are not up yet, but soon). Alexa has about 5 billion documents in its index – about 100 terabytes of data.

Anyone can also use Alexa’s servers and processing power to mine its index to discover things – perhaps, to outsource the crawl needed to create a vertical search engine, for example. Or maybe to build new kinds of search engines entirely, or …well, whatever creative folks can dream up. And then, anyone can run that new service on Alexa’s (er…Amazon’s) platform, should they wish.

The service will be priced on a usage basis: $1 per CPU hour, $1 per GB stored or uploaded, $1 per 50GB data processed.

There’s no announcement posted on the Alexa or Amazon sites yet, it’s apparently due out overnight. (Updated 12-13-2005 00:25 – the site is up now)

Not every search and retrieval application is necessarily going to fit onto the way Alexa has built their crawler and indexing infrastructure, or onto any other search engine platform, for that matter. But opening up access to more of the platform should make it possible for a lot of new ideas to be tried out quickly without having to build yet another crawler for each project. Up to this point, many search ideas can’t be evaluated without working at one of the major search engines. I suspect most development teams would prefer to get access to Google’s crawl and index data, but I’m certainly looking forward to seeing what’s available at Alexa when they get their documentation online in the morning.

More from Om Malik, TechCrunch, ReadWrite Web

Bangalore to be renamed Bengaluru

Looks like Bangalore is in line for an official renaming to either “Bengaluru” or “Bengalooru”. Times of India:

Chief minister N Dharam Singh told reporters in Gulbarga on Sunday: “We will rename Bangalore as Bengaluru on November 1, 2006, to mark the launch of Karnataka’s Golden Jubilee year – Suvarna Karnataka – on that day. I have issued a directive to chief secretary B K Das in this regard.”

The name, however, may undergo another change, for Ananthamurthy told The Times of India: “The name should be Bengal-oo-ru.” The CM spelt it out as Bengal-u-ru.

See also: Bangalore boom, traffic congestion

It’s the holiday season


Spent most of yesterday afternoon on the roof stringing lights. Fortunately, it was 60F, clear, and sunny here in California, unlike back east where they’re having huge weather.

I’m having a hard time getting my head into “holiday season” mode, though. Perhaps I’ve gotten too disconnected from mass media? I don’t see TV ads, I don’t go shopping at the mall, I don’t see most web ads, and the bushels of seasonal catalogs and junk mail go straight into the recycling bin. They don’t have Christmas plays at school either, although this year we did manage to catch the remastered “Charlie Brown Christmas” special on broadcast TV. I need to get my copy of the Vince Guaraldi Trio’s soundtrack album off vinyl and onto the server.

We were also up in San Francisco last week, the lights in Union Square were nice.
Christmas tree at Union Square

How (and where) to download your del.icio.us bookmarks

Last Friday’s announcement that Yahoo is buying del.icio.us has probably got more than a few people thinking about the future of the service and whether they want to keep using it. In any case, as with all of the interesting and useful web services out there, it’s good to take time now and then to back up your personal data, in case something goes sideways and the service becomes unavailable or unusable for whatever reason.

I’m personally planning on continuing to use del.icio.us, although there are a number of interesting tagged bookmarking alternatives out there, including running your own.

The first step is to get your personal bookmark data, which can be obtained through the del.icio.us API. You can retrieve all your saved bookmarks at del.icio.us/api/posts/all, which will return an XML file that can be saved to your local system and used as a backup or to import your bookmarks into another web application elsewhere.

The next step is to decide what you want to do with the data. Some alternative tagged bookmarking solutions include:

The following services are based on open source projects, so you can (or in some cases have to) run your own bookmarking system.

Yahoo already runs MyWeb2.0, which presumably will begin to merge with del.icio.us at some point. It has a lot of interesting features, but hasn’t had enough to get me to switch over up to this point. I’ve been wanting private bookmarks and tags on del.icio.us for a while, although I think I’ll be moving those off my desktop onto a roll-your-own server solution.

Any more suggestions? Reply in the comments and I’ll pull them up to the main post.

Here’s an extensive list of free bookmark managers at lights.com (via David Beisel)

Newsweek on white hat and black hat search engine optimization

via Seomoz:

This week’s Newsweek (December 12, 2005) features an article on white hat vs black hat search engine optimization. Among other things, it’s interesting that the topic has made it into the mainstream media.

A “black hat” anecdote:

Using an illicit software program he downloaded from the Net, he forcibly injected a link to his own private-detectives referral site onto the site of Long Island’s Stony Brook University. Most search engines give a higher value to a link on a reputable university site.

The site in question appears to be “www.private-detectives.org”, still currently #1 at MSN and #4 at Yahoo for searches on “private detectives”. It appears to have been sandboxed on Google.

Another interesting post at Seomoz features comments from “randfish” and “EarlGrey”, the two SEO consultants interviewed by Newsweek on the merits of “White Hat” vs “Black Hat” search engine optimization, and gives further perspective on the motivation and outlook of the two approaches.

In some ways one can think of the difference between search engine optimization approaches as a “trading” approach vs a “building” approach to investment. The “Black Hat” approach articulated in the Seomoz article tends to focus purely on a tactical present cash return to the operator, while the “White Hat” approach presumes that the operator will realize ongoing future value by developing a useful information asset and making it visible to the search engines. This makes an implicit assumption that the site itself offers some unique and valuable information content, which can’t usually be the case in the long run.

From an information retrieval point of view, I’m obviously in the latter camp of thinking that identifying the most relevant results for the search user is a good thing. However, the black hat approach makes perfect sense if you consider it in terms of optimizing the short term value return to the publisher (cash as information), while possibly still presenting a useable information return to the search user. This is especially the case for commodity information or products, in which the actual information or goods are identical, such as affiliate sales.

I’m a little curious about the link from Stony Brook University. I took a quick look but wasn’t able to turn up a backlink. One of the problems with simply relying on trusted link sources is that they can be gamed, corrupted, or hacked.

See also: A reading list on PageRank and search algorithms

Update 12-12-2005 00:30 PST: Lots of comments on Matt Cutt’s post, plus Slashdot

Yahoo goes after more tagging assets, buys del.icio.us

Yahoo continues down the path of more tagging and more collaborative content. Having already purchased Flickr, this morning they’re acquiring del.icio.us (terms undislosed):

From Joshua Schachter at the del.icio.us blog:

We’re proud to announce that del.icio.us has joined the Yahoo! family. Together we’ll continue to improve how people discover, remember and share on the Internet, with a big emphasis on the power of community. We’re excited to be working with the Yahoo! Search team – they definitely get social systems and their potential to change the web. (We’re also excited to be joining our fraternal twin Flickr!)

From Jeremy Zawodny at Yahoo Search Blog:

And just like we’ve done with Flickr, we plan to give del.icio.us the resources, support, and room it needs to continue growing the service and community. Finally, don’t be surprised if you see My Web and del.icio.us borrow a few ideas from each other in the future.

From Lisa McMillan, an enthusiastic user of all 3 services (comment on the del.icio.us blog):

Yahoo that’s delicious! I live here. I live in flickr. I live at yahoo. This is insane. You deserve this success dude. Just please g-d don’t let me lose my bookmarks :-D I’m practically my own search engine. LOL

Tagged bookmarking sites such as del.icio.us can provide a rich source of input data for developing contextual and topical search. The early adopters that have used del.icio.us up to this point are unlikely to bookmark spam or very uninteresting pages, and the aggregate set of bookmarks and tags is likely to expose clustering of links and related tags which can be used to refine search results by improving estimates of user intent. Individuals are becoming their own search engine in a very personal, narrow way, which could be coupled to general purpose search engines such as Yahoo or Google.

I think Google needs to identify resources it can use to incorporate more user feedback into search results. Looking over the users’ shoulders via AdSense is interesting but inadequate on its own because there are a lot of sites that will never be AdSense publishers. Explicit input capturing the user’s intent, whether through tagging, voting, posting, publishing, is a strong indication of relevance and interest by that user. I think the basic Google philosophy of letting the algorithm do everything is much more scalable, but it looks like time to capture more human input into the algorithms.

In a recent post, I pointed out some work at Yahoo on computing conditional search ranking based on user intent. The range of topics on del.icio.us tends to be predictably biased, but for the areas that it covers well, I’d be looking for some opportunities to improve search results based on what humans thought was interesting. As far as I know, Google doesn’t have any assets in this space. Maybe Blogger or Orkut, but those are very noisy inputs.

This seems like a great move by Yahoo on multiple fronts, and I am very interested to see how this plays out.

See also:

Update 12-12-2005 12:30 PST: No hard numbers, but something like $10-15MM with earnouts looks plausible. More posts, analysis, and reader comments: Om Malik, John Batelle, Paul Kedrosky.

Personalization, Intent, and modifying PageRank calculations

Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.

On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.

Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

Some thoughts:

1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.

2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.

3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.

4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.

Five principles of user generated content – Trust, Attention, Relevance, Authority, and Intent

Brad Feld summarizes much of the ongoing discussion about user-generated content into three points, in a recent post. Here’s a recap, with some additions:

  1. Trust
  2. Attention
  3. Relevance
  4. Authority (added in a reader comment)
  5. Intent (added by me)

These are recurring themes for the current generation of collaborative, intent-capturing, tagged, social-network-based, “web 2.0″ applications.

It’s interesting to look at the difference between Trust and Authority. As an example, Wikipedia is clearly not “Authoritative” on any subject, yet people ascribe “Trust” to the content there. Topics that are strongly subjective or open to interpretation can sometimes be organized based on Trust more easily than through Authority. The “disputed content” mechanism on Wikipedia allows for a little of this, but part of the confusion comes from the underlying model of an encyclopedia, which is generally intended to be authoritative.

Attention is a big deal because communications and information technology is providing easier access to more and more information content, but there are still only 24 hours in a day, and human cognition isn’t increasing exponentially. This creates a scarcity of attention, and makes the ability for a 3rd party to steer a viewer’s attention more valuable over time. New tools and interaction models can help allocate attention (cell phone conversations while driving, mobile messaging while in meetings, multiple windows and displays on the desktop), but these don’t scale very far.

One of the reader comments also suggests Priorities as a fifth concept, although I think this might really be captured by Relevance. Time and Place are also important, but I would put them with Relevance as well.

I’m adding “Intent” to my augmented list. One of the challenges with keyword-driven search is trying to guess what the user is trying to ask. Social software applications tend to increase relevance and trust of information shared among users, and implicitly create alignment of intent among the participants. If you have better information about what the user is trying to accomplish, search queries and other interactions become much more effective, which is one of the reasons AdSense works so well.

At the end of his post, Brad was fishing around for acronyms using three letters – “TAR”. With my list of five items, it’s no longer a TLA, but “TIARA” or “TARIA” might work.

Bangalore boom, traffic congestion

IMG_1559
Today’s (Sunday) San Jose Mercury News features a cover story on Bangalore, India, and draws some parallels with the Bay Area. The headline reads “The tech boom didn’t die. It just moved to India.” I find that I unexpectedly run into people from the Bay Area quite often during trips out there, and there has been amazing growth in salaries and real estate prices which reminds me of late ‘99 here. At the same time they seem to be hitting resource limits of various sorts. The water and power supplies can be spotty, the storm drains routinely flood the streets during monsoon season, the roads are overloaded, there’s often a shortage of hotel rooms, and the airport is remarkably bad, considering that so much of the local economy depends on foreign business travel.

Bangalore, the tech center of India, is booming as the Bay Area once did, becoming a world-class hub for tech jobs, economic activity and, increasingly, innovation. While Silicon Valley still retains a hold on high-end tech jobs, countless lower-level positions, particularly in software — and now some sophisticated research and development work — are shifting to this city of 6.5 million in southern India. The emergence of Bangalore — and of India — as a tech power signals a new world economic order that is both opportunity and threat to Silicon Valley.

The article also mentions the traffic (and the fact that it can take an hour to go a few miles). Reminded me to go dig up some video clips I’ve been meaning to do something with. Nothing spectacular, but as I travel, I find the differentness of the mundane aspects of daily life interesting, and there are lots of little things to see in these. (WM9 only, no Quicktime, I don’t have an encoder handy at the moment.)

See also:

I wonder if the Mercury News found the same cow that hangs out on Hosur Road. There are a few that are always wandering around along the side of the road, they must live nearby somewhere.
IMG_1572

BrainJam, December 2005, search, privacy, transparency

brainjams
Spent a few hours this afternoon at Chris Heuer’s BrainJam event. Wasn’t able to make it to the morning sessions, but arrived in time for the end of lunch and the “youth user panel”, consisting of four college students. They all love Facebook. Not sure how representative they are of the general student demographic, since two of them are trying to put together a web startup. They all use free online music and movie access, mostly through sharing within the dorm networks.

During the Q&A I asked for the panel members’ thoughts on privacy and about having their college lives online in perpituity. They’re vaguely concerned, but I don’t think the topic is really raising red flags for them. I think the high school and college users have more confidence in Facebook, MySpace, Xanga and others keeping their data private and/or it not making any difference to them in the future as social norms change. Part of it is that people are simply making things up on their pages, for the sake of attracting attention, and part of it is them not caring or not understanding that their web pages, chat transcripts, and even VOIP are mostly staying online forever. I think there’s going to be a lot of interesting conflicts in the future as people start running into their past personae 5, 10, 15 years later in a societal context that hasn’t adjusted yet to perpetual transparency.

Afterwards the group broke out into smaller topical discussions. The first session I went to was on the 2-way RSS proposal from Microsoft (Simple Sharing Extensions, SSE). I’m starting to think of SSE as a way for MSFT to use an RSS container for solving the sync problem for applications like Windows Mobile syncing a device and a desktop, or Active Directory performing distributed synchronization of directory data. I’m not really seeing a federated publishing model based on this, an idea that was floated in the conversation. It really feels like it solves an application sync problem for structured data.

The session on “what to do with all the data?” quickly turned into a discussion on privacy, transparency, and DRM. I’m personally disinclined to depend on trusting anyone’s DRM system to manage my criticall personal data, or for allowing anyone to indexing my private data in a way that eventually gets exposed to the world. One point of view expressed in this discussion was that the world would be better off if everyone just got used to the idea that everything they did was recorded and visible to the world (the Global Panopticon), although I think the majority disargreed that this would actually make people behave better. Personally, I think that documenting everything would break a lot of the ambiguity in relationships and conversations that allow the formation of reasonable opinions, by forcing people into adhering to “statements” and “positions” that were nothing more than passing conversation or exploration of a topic. This was part of my thinking behind asking the college kids about privacy. In real life, there are normally various social transitions that call for stepping away or de-emphasizing some aspects of one’s life, in favor of new ones. It doesn’t make the past behaviors and activities go away, but the combination of search engines and infinite, cheap storage is likely to keep some aspects of these folks’ “past” life in their face for a long time, which may make it harder to move forward.

Someone mentioned the idea of “privacy parity”, i.e. you can ask for my data, but I can see that you’re asking for it, sort of like being able to find out when someone has requested your credit report. This is interesting, but there are substantial asymmetries in the value of that information to each party. A bit of parity that would be very interesting would be a feed of who’s seen my site URLs and excerpts in a search results page — not the clickthrough, which I can already see, but when it’s turned up on the page at all.

A few of us continued a sidebar discussion on search, social networks, trust, and attention networks, and eventually got kicked out into the lobby where we were free to speculate on Google’s plan for world domination next to a huge globe in the SRI lobby. I haven’t bumped into anyone yet doing work on integrating the attention, social, and trust data into search. Doing this on a Google/Yahoo/Microsoft scale looks hard, because of the sheer scale, but I’m getting the sense that doing a custom search engine biased by the social / attention data inputs for a limited subject domain (100-1000’sGB) and a relatively small social / atttention network (1000’s – people you know or have heard of) is becoming more reasonable because of cheaper / faster / better IT hardware and because more of the data is actually becoming available now. Still chewing on this. I just came across Danah Boyd’s post on attention networks vs social networks yesterday, which concisely explains the directed vs undirected graph property which underlies part of the ranking algorithms that would be needed.
Perhaps someone’s already done this for a research project.

If Google Desktop were open source, it might be a logical place to insert a modified ranking algorithm based on attention, tags and social networks and also to insert an SSE-style interface to allow peer-to-peer federation of local search queries and results. This would keep the search index data local to “me” and “my documents”, but allow sharing with other clients that I trust. Perhaps it’s just an age thing. The college kids didn’t seem to mind having all of their documents on public servers, are counting on robots.txt to keep them out the global search engines, and apparently think that access controls on sites like Facebook will keep their personal postings out the of the public realm. For me, I still think twice sometimes about posting to my del.icio.us bookmarks list and keep anything really critical on physical media in a safe deposit box in a vault. So while I’ve gone from being Ungoogleable to Google search stardom, there’s a good portion of my digital life which is “dark matter” to the search engines. I’d like to find a way to fix it for myself, and share information with people I trust, and refine my searches over the public internet, but without having to give Google or anyone else all of my personal data.

Youth panel discussion Wrap up session

Took a few photos, photos from others will probably turn up tagged with “brainjams

Update 12-04-2005 21:15 PST: Audio from the Youth Panel discussion on Chris’s blog
KRON-4 television piece on BrainJams. Looks like I missed the hula hoop part in the morning. I also seem to have mostly missed the non-profit community-oriented discussion, as you can see from my notes. Perhaps that’s what was going on when we got kicked out into the lobby for being too loud…

A reading list on PageRank and search algorithms

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.

  • Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
  • Online Reputation Systems: The Cost of Attack of PageRank (PDF)
    Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
  • SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF)
    András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
  • Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).

I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.