These are my links for May 6th through May 7th:
- Mathematical Atlas: A gateway to Mathematics – "The Mathematical Atlas is a collection of articles about aspects of mathematics at and above the university level, but (usually) not at the level of current research. The goal of this collection is to introduce the subject areas of modern mathematics, to describe a few of the milestone results and topics, and to give pointers to some of the key resources where further information is to be found. Like any good atlas, we try to present several ways to look at each area and to show its relationship with neighboring areas and sub-areas. "
- Three Reasons Why Twitter Will NOT Index the Links You Share – ReadWriteWeb – Argues that Twitter will rely on bit.ly through partnership or acquisition to handle sentiment and semantic analysis of twitter search and link contents.
- Tough Love For Microsoft Search – December 2008 post from Danny Sullivan on Microsoft and the search landscape.
- Annals of Innovation: How David Beats Goliath: Reporting & Essays: The New Yorker – Malcolm Gladwell, with a reporter at large on Vivek Ranadivé and his NJB girls basketball team, employing asymmetric strategies to overcome conventionally stronger teams, and a broader look at the history of insurgent strategies from David and Goliath, T.E. Lawrence, George Washington, etc.
These are my links for April 28th through April 29th:
- Inside Facebook Reports: Why Hasn’t Facebook Grown More in China? – A look at Chinese consumer internet and social media usage, QQ, 51, Xiaonei, Kaixin, and some reasons why there are only around 300,000 Facebook users in China today.
- Facebook maps the swine flu hysteria | The Web Services Report – CNET News – Visualizing interest in swine flu by mapping percentages of mentions on Facebook wall pages, using data from Lexicon.
- Develop Twitter API application in django and deploy on Google App Engine — The Uswaretech Blog – Django Web Development – Walkthrough of a sample Twitter application on Google App Engine, using Django/Python.
These are my links for April 9th from 08:07 to 17:53:
- IP address geolocation SQL database – IP address geolocation with MySQL by Marc-Andre Caron. He's done all the necessary legwork to solve this problem, putting together a free, monthly-updated MySQL dataset that will allow you to derive country, region, city, zip, latitude, and longitude from an IP address.
- Del.icio.us Finally Gets Some Respect from Yahoo – Probably Too Late – ReadWriteWeb –
- In the Event That You Have Accidentally Swallowed the Higgs Boson by Michael Rottman – The Morning News – "7. Do you feel protons decaying? Grand Unification may be occurring near your vital organs. "
- FT.com / Companies / UK companies – Dotcom veterans in Twitter ‘brains trust’ – "Mr Read has brought together a “brains trust” of advisers to Twitter Partners, including Brent Hoberman and Martha Lane Fox, founders of Lastminute.com; Saul Klein, a partner at Index Ventures, the London venture capitalists; and Toby Coppel, the former European vice-president at Yahoo."
- byteonic.com » What you cannot do using Java in Google App Engine – List of some restrictions on Java code running on GAE
These are my links for February 27th through February 28th:
These are my links for February 26th from 10:39 to 20:05:
These are my links for February 24th through February 25th:
- The C10K problem – On techniques for scaling to large number of network clients (e.g. >10000).
- Yodel Anecdotal » Blog Archive » Hello, (twitter) world – List of official Yahoo twitter handles for various activities including research, geo, search, and yui.
- New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia – AWS adds Freebase, DBPedia, Wikipedia extract, and US Transportation data sets.
- eigenclass – Related document discovery, without algebra – Another approach to simple related document discovery, based on tags, should work ok for small data sets.
- SVD Recommendation System in Ruby – igvita.com – A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.
Barry Ritholtz points out the new community sentiment feature, part of the new front page for Yahoo Finance.
Stock message boards are a fascinating place to scan through from time to time, containing a mix of informed, uninformed, and sometimes deliberately misleading posts. On average, the post volume and prevailing sentiment is probably a good contrary indicator. Part of what makes stock message boards interesting is the sheer volume of misdirection and general noise. At the same time, there are a smaller number of board posters that contribute more than blind cheerleading or bashing their chosen stocks.
I spent some time a while back looking at trying to automate the process of scanning the Yahoo Finance boards for “informed” or otherwise actionable message flow, but concluded that the project wasn’t worth the effort unless I was working for Yahoo. Some of the process I was considering would have been to determining a reputation value for messages and contributors, partly based on historical outcomes and partly based on user-generated ratings. The last revision to the finance message boards incorporates a simple rating system, but what I was looking for was the ability to see the ratings from a trusted set of users, and the ability to rate the posters, and perhaps their likely context (long term holder, swing trader, or intraday trader). The other piece would have been to implement some backtesting on the various sentiment indicators to see whether it had any trading value at all.
The sentiment indicator is being generated by Collective Intellect, which says:
Using proprietary algorithms, Collective Intellect’s Media IntelligenceTM service filters and ranks bloggers and posts, so you only see the most credible sources of information — and only when it’s relevant to your trading strategy.
I think it’s much easier to deal with analysis of the financial bloggers than the stock message boards and chat rooms. It will be interesting to keep an eye on how this new feature works out.
Here’s a quick snapshot of incoming search engine referrals for the past few weeks. Compare this with another post last year on search engine referral share, recently referenced in a post at Alexa noting the discrepancy between the published search engine traffic reports and anecdotal observations by webmasters.
Is it just me, or are these charts a bit goofy? Does Yahoo really still have 23% of the search market? Is Google at less than half the search market?
I don’t believe it. Any webmaster will tell you that Google represents almost ALL of the search engine traffic. Yahoo is nowhere near 23%. Just read the blogs, here, here, here and here and on countless other blogs.
Already at 82% last October, Google has increased to even more of the incoming search traffic (92%) here, largely at the expense of “Other”. In the fall, it looked like those were mostly miscellaneous Chinese search engines, so perhaps my site is not getting indexed or ranked well there anymore, or Google is picking up market share, or both.
Some of the commenters at the Alexa post noted increasing traffic from Microsoft / MSN / Live search, including one who got most of their traffic through MSN search. I’m a little surprised that I don’t see more traffic from Yahoo and Microsoft search here, but that may also be a function of who’s likely to be searching for a given topic.
See also Greg Linden’s comments on the competitiveness of Yahoo and Microsoft search efforts
Dropped by the Yahoo Finance message boards this evening to scan through comments. The Yahoo Message Boards have been around in the same form for nearly as long as Yahoo, and for the past several weeks Yahoo has been testing a new format, which I find hard to read. Fortunately, there was a link to return to the original version, and I think it’s been popular.
Sometime over the weekend, all Yahoo Finance message boards have been upgraded to the new version, with no way to get to the old version.
This post captures the sentiment of many members:
IMPORTANT MESSAGE FROM YAHOO!
Hi, I am the Project Manager for the new Yahoo Message Boards. I just wanted to let you all know that we will be adding even more new features to the stunning new boards next week:
1) Starting next Monday, all new posts will automatically be translated into Norwegian. Our Yahoo Finance development team decided that users would want this, and since we do not speak directly to users before updating our site, we will assume this is a highly desired feature. The brilliant Stanford Ph.D.’s we hired to update the site thought it would be cool! Due to disk storage issues, we regret that we will no longer be able to offer your posts in English. Please learn Norwegian, and let us know how it goes on our Feedback form! We read every form you send us!
2) As you know, we recently stopped listing stock board posts chronologically, and went to a thread-based system. We feel that stock investors do not need to clearly see the latest posts in chronological order, during the trading day, and our Ph.D. geniuses from Stanford think you would prefer to dig through threads to find old posts on your stocks. More importantly, later this month, we will be removing the dates and times from all posts, and then in September we plan to mix them all up randomly. We hope that you are not inconvenienced by reading 8 year old posts when making investment decisions. Please use the helpful Feedback form to let us know what you think. We read every one!
3) As you may know, in the Beta we switched from a simple and effective system of “recommending” posts with a click, to a more complex one in which you rate posts with stars. Of course, you can only see the number of stars for the first post in a thread, but we feel this is better even if no one ever uses it. Our Ph.D. developers informed us that “complex” equals “better”. So next month, we will replace the star system with a new even more complex one, in which you will manually calculate the cube root of how much you like the post on a scale of 3.4 to 11, and then divide by pi. Please let us know what you think on the Feedback form, which next month will only accept entries in Hexadecimal. We read every post.
Finally, just to get you excited and build some anticipation, I wanted to let you know that our Ph.D.’s are working on a new feature for 2007. Posts will be automatically scanned for keywords by our cool super-complex search technology, to determine if they would be better suited to a different stock board, and if so, they will be automatically moved. We feel the slight inconvenience of having posts moved around by the system will be outweighed by how cool the technology is!
Yahoo Finance Development Team Manager ”
The new version defaults to thread-oriented, has a 5-star rating system, and provides a filter to view only highly rated posts, similar to Slashdot. This seems like it could be helpful to people who want to see popular topics. Unfortunately, the new message board format makes it very difficult to view posts in time sequence. It also doesn’t have the old message numbers, so people who kept lists of “useful” posts are out of luck, and in general seems to make it difficult to get to older posts except through search.
There are typically many more readers/lurkers than writer/posters on any message board. The board revisions seem geared toward helping the occasional reader, but also seem unpopular with the current board communities who actually generate the content. There’s a lot of grumbling, and at least some early signs of migration to other boards, such as Investors Hub, Raging Bull, Silicon Investor, and Investor Village.
After some experimenting, I’ve found it a little easier to use after changing the view preferences to “expanded” and “message list”. I suspect the threaded format may eventually help separate traffic between the buy-and-hold crowd and the short term traders, if people stick with the new system. In the meantime, there appears to be a surge of people trying out the other services.
It will be interesting to see how the transition works out. I find the new format more difficult to read, and it seems to be unpopular among the existing communities. On the other hand the new message boards format may be easier for new people to participate, and could grow new communities to replace the existing ones.
There’s some speculation on Yahoo’s intent, this post is representative:
It’s puzzling that yahoo would enforce this new system when there’s been clear evidence (from neglect of the trial format in the last weeks) that it’s not popular. I suspect there is a familiar style of coercive industrial ad mgt driving things. If you use the new format, you are coaxed/forced into playing a sort of teenage interactive popularity contest, like voting for pop stars. I assume this is to persuade advertisers that the system gooses up user enthusiasm. The problem is, as in consumer mkting, the “threads” preselect what you can see and respond to, channeling your attn the way news reporting and ads preselect the reality you can see. In effect, yahoo is part of a larger industrial paradigm in which life is a consumer decision-tree rather than a play of curiosity and discussion / analysis. Part of the frustration being felt is that you know you’re a mouse in a maze. You’re being trapped inside a teenie bopper fan magazine rating “products” rather than sharing ideas. It’s a shame to see such a useful forum crippled by childish advertising trickery.
I’m less cynical about the intent of moving to a threaded view, but it’s clearly an uncomfortable change for many participants who are more engaged there than myself. The challenge for Yahoo is that much of the value in the Finance boards is that there’s enough traffic and/or useful posts on many of them (AAPL, GOOG, TIE, most highly traded stocks) to make it worth checking out from time to time, but the interests of the active board community are different than those of a casual viewer, and for the moment there’s a disconnect in progress. Yahoo may have also picked an inopportune weekend to switch over, since posting volume is likely to be high during the next few days, reacting to events in the Middle East.
I was recently pointed at Instant Bull, a new site intended to scan multiple finance boards. Unfortunately it wants Firefox 1.5, and I’m still running 1.x for now, so I’ll have to check it out later.
Update 07-16-2006 23:00PDT: More from PaidContent, GigaOm, CNet
Update 07-18-2006 19:45PDT: There’s an impressive level of antipathy toward the new message board format. Yahoo Finance members have rapidly started setting up beachheads on other sites. One anecdote from the YHOO board:
ELN on Investor Village shows that over 800 members and over 1800 guests (probably folks checking out alternative boards) have visited in the last 24 hours! Who even heard of Investor Village before this week?
I did a search of messages on Yahoo’s ELN board and saw that there were almost 500 postings between 6AM yesterday and 6AM today (for some reason, Yahoo’s search function is not showing any results for postings after 6AM today), most of which were probably related to complaints about the new format. Then I did a search on the number of postings done since 6AM today on the ELN board on Investor Village and there’s been over 400 posts! Considering they only have 1805 posts in total on the board and considering the number of people that have visited the ELN board on Investor Village in the last 24 hours, it tells me if Yahoo doesn’t find a way to bring back something similar to the old format and the complainers on Yahoo see that their complaints are getting them nowhere, there’s gonna be a mass exodus of all these people that are now posting on or at least checking out Investor Village.
This ‘upgrade’ implementaion has been a disaster. Competitors like Investor Village are taking advantage. Even if Yahoo gains some of their traffic back, they won’t recover all of it just like Coke didn’t gain all of their market share after the New Coke fiasco. If this an indication of Yahoo’s current business model, do you really want to be an investor in this stock as future ‘upgrades’ are implemented????
Vestiges of the “old” system are still around in the non-Finance sections of Yahoo, so others have been trying to set up shop there, such as this alternative AAPL board. I suspect these may not last for long.
Tom Foremski has commented on the uproar about the new message board format in the context of user interface design: “people are creatures of habit and nobody wants to have to learn a new user interface”. I agree, but I think there’s more to it than people not wanting to change. I personally find the new format difficult to visually scan, and in retrospect I see that I tend to watch for interaction among certain individual contributors, as well as for general noise level around various topics. The new system would probably work well for topically driven forums, while many of the high volume forums border on IRC chat.
Message boards are pre-Web 1.0 social software, dating back to the days of dialup BBSes. One view might be that the users just don’t “get” the Web 2.0 fit and finish being wrapped around Yahoo Finance. However, I think the clash here has mostly been about a mismatch between the existing community of users and the use of the site as envisioned by the Yahoo Finance product management team.
I conclude that there’s either a serious gap in how the user testing and feedback process worked, or there’s been a conscious management decision to change the character of the Finance Boards product, to clean up the content and make it better behaved by making it less interactive. Historically, many of the posts are of questionable merit, laced with profanity, innuendo, misrepresentation, and other disinformation. However…if you knew that already, then the flow of the pumper/basher posts itself was a useful data point, along with posts from individual traders and investors offering up independent opinions. Looks like that’s another bit of history now.
YHOO shares dropped hard in after hours trading today, the latest earnings matched, but search monetization isn’t growing well. Ironically, it sounds like at least a few traders shorted YHOO at the close, out of a combination of spite and a sense of management distrust following the message board fiasco. Not a sound rationale for the trade, but it clearly worked out for them.
Update 07-20-2006 14:36PDT – Yahoo has added a link to the old message list view, labeled “view all messages”, next to “view all topics”. The individual posts are still formatted in the “new look” though.
I’m really curious about what effect this is having on traffic and monetization at Yahoo Finance. I recognize a number of user handles that have moved to Investor Village or Investors Hub, and there are daily notices there from the site operators on server upgrades and other steps to accomodate the unexpected boost in traffic.
Some of you may also be interested in checking out SaneBull, an example of an AJAX-based stock info scanner. via TechCrunch
Google launched Google Finance today. Lots of people have written about it already, generally nonplussed. Here’s my quick reaction.
- News events plotted on the stock chart timeline. I wish they’d add this to Yahoo Finance.
- Ajax UI for scrolling the stock chart around and changing the time window
- Recent blog search results on the right sidebar (although they seem to be a few hours behind)
I wish for:
- More charting features. There basically aren’t any right now.
- Better integration of the “More Resources” features. Things like SEC filings, institutional holders, and earning estimates are all provided by 3rd parties via outbound links, making it hard to flip through.
Technical charting and research reports are provided via Yahoo Finance, although the discussions are hosted at Google Groups.
The feature I’d really like to see is an intelligently filtered view of the Yahoo Finance discussion boards. There is some interesting and useful information there, but a far larger quantity of rants, spam, and trolling in between.
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.
I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.
As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.
PageRank is only used by Google, which no longer uses the original PageRank algorithm for ranking results, but it’s still interesting to see the top search results from the three major search engines laid out with PR scores to get some sense of the page linkage.
Last Friday I spent an hour with my daughter’s 4th grade class, helping them do online research for reports on early California explorers. They were individually assigned an explorer, and were looking for basic biographical information such as dates and places of birth and death, and notable historical achievements or other interesting items to write about. From my perspective, this turned out to be a sort of small focus group on using search engines.
I spend most of my time around people who are pretty good at using search engines and online research tools, so it was interesting to see what they would do with this assignment.
The kids are all familiar with computers to varying degrees. They have had classroom activities using the computer at least once a week since kindergarten, and most of them have some experience using computers at home (this is Palo Alto, after all). I don’t think they’ve done any organized “internet research” in school up to this point, though.
They all started with their research subject’s name written on a piece of paper and had about 20 minutes to find some useful information.
Here are some observations:
- Simply typing in the names of the explorers was challenging for many of them (“Joseph Joaquin Moraga”, “Ivan Alexandrovich Kuskov”, and others I can’t recall).
- They often tried to type the search phrase into the address bar. I also saw at least one person try to type the search phrase into a form entry field in an advertisement.
- Their default home page is set to Yahooligans!, which is kid friendly but seems to sharply limit the search results. I had the kids try their queries there first, but most of them returned zero search results.
- I then let the kids choose which search engine they wanted to use. About a third of the kids voluntarily expressed a preference for using Google, most of the rest didn’t know or care (I sent about half to Yahoo and half to Google), and one kid really wanted to use A9 (strange, I didn’t have a chance to find out why).
- None of the kids were familiar with using quote marks to specify exact phrase matching. Some of the explorers’ names contain commonly occuring components and return a large number of irrelevant results without quotes.
- None of the kids were familiar with the advanced search operators for excluding or qualifying search results. I had to help out in a couple of cases where they were having trouble finding relevant pages.
- Some of them didn’t understand the difference between page content and the ads in the headers, footers, and sidebars.
- Some of them were already both familiar with Wikipedia and the benefit and problem that anyone can change the page. One person wanted to look exclusively on Wikipedia after the subject came up.
- The absence of a bookmarking system for the students to use tends to force them to print out pages they want to use later. This isn’t wonderful at a school lab, since the content is semi-disposable and they’re usually scrounging to conserve printer consumables like toner and paper. The kids liked having something to take back to the classroom with them, though
The variations in spelling for the mostly Spanish names caused problems for some queries. Google’s “did you mean” suggestions were helpful. At least one query (which I can’t recall) consisted entirely of common Hispanic names, which matched several famous people other than the intended query subject. This is similar to the problem of searching on common Asian names (like mine).
- Some students quickly clicked themselves into a rathole of completely unrelated pages, usually after clicking on an ad.
Watching the kids trying to find useful pages highlighted the differences with my usual search behavior, which is to quickly scan the search results page, then refine the query using additional keywords and/or search operators, both of which are hard for 9- and 10-year-olds to do. In “research mode” I usually open results in a new browser tab or window. The kids actually click through the link, making it hard to work through a list of candidate results.
Coincidentally, earlier this week I came across a post on Google Blogoscoped which points to a recent dissertation on search user interface design geared towards kids, by Hilary Browne Hutchinson at University of Maryland which has some interesting observations and ideas.
Last Friday’s announcement that Yahoo is buying del.icio.us has probably got more than a few people thinking about the future of the service and whether they want to keep using it. In any case, as with all of the interesting and useful web services out there, it’s good to take time now and then to back up your personal data, in case something goes sideways and the service becomes unavailable or unusable for whatever reason.
I’m personally planning on continuing to use del.icio.us, although there are a number of interesting tagged bookmarking alternatives out there, including running your own.
The first step is to get your personal bookmark data, which can be obtained through the del.icio.us API. You can retrieve all your saved bookmarks at del.icio.us/api/posts/all, which will return an XML file that can be saved to your local system and used as a backup or to import your bookmarks into another web application elsewhere.
The next step is to decide what you want to do with the data. Some alternative tagged bookmarking solutions include:
The following services are based on open source projects, so you can (or in some cases have to) run your own bookmarking system.
Yahoo already runs MyWeb2.0, which presumably will begin to merge with del.icio.us at some point. It has a lot of interesting features, but hasn’t had enough to get me to switch over up to this point. I’ve been wanting private bookmarks and tags on del.icio.us for a while, although I think I’ll be moving those off my desktop onto a roll-your-own server solution.
Any more suggestions? Reply in the comments and I’ll pull them up to the main post.
Here’s an extensive list of free bookmark managers at lights.com (via David Beisel)
This week’s Newsweek (December 12, 2005) features an article on white hat vs black hat search engine optimization. Among other things, it’s interesting that the topic has made it into the mainstream media.
A “black hat” anecdote:
Using an illicit software program he downloaded from the Net, he forcibly injected a link to his own private-detectives referral site onto the site of Long Island’s Stony Brook University. Most search engines give a higher value to a link on a reputable university site.
The site in question appears to be “www.private-detectives.org”, still currently #1 at MSN and #4 at Yahoo for searches on “private detectives”. It appears to have been sandboxed on Google.
Another interesting post at Seomoz features comments from “randfish” and “EarlGrey”, the two SEO consultants interviewed by Newsweek on the merits of “White Hat” vs “Black Hat” search engine optimization, and gives further perspective on the motivation and outlook of the two approaches.
In some ways one can think of the difference between search engine optimization approaches as a “trading” approach vs a “building” approach to investment. The “Black Hat” approach articulated in the Seomoz article tends to focus purely on a tactical present cash return to the operator, while the “White Hat” approach presumes that the operator will realize ongoing future value by developing a useful information asset and making it visible to the search engines. This makes an implicit assumption that the site itself offers some unique and valuable information content, which can’t usually be the case in the long run.
From an information retrieval point of view, I’m obviously in the latter camp of thinking that identifying the most relevant results for the search user is a good thing. However, the black hat approach makes perfect sense if you consider it in terms of optimizing the short term value return to the publisher (cash as information), while possibly still presenting a useable information return to the search user. This is especially the case for commodity information or products, in which the actual information or goods are identical, such as affiliate sales.
I’m a little curious about the link from Stony Brook University. I took a quick look but wasn’t able to turn up a backlink. One of the problems with simply relying on trusted link sources is that they can be gamed, corrupted, or hacked.
See also: A reading list on PageRank and search algorithms
Update 12-12-2005 00:30 PST: Lots of comments on Matt Cutt’s post, plus Slashdot
Yahoo continues down the path of more tagging and more collaborative content. Having already purchased Flickr, this morning they’re acquiring del.icio.us (terms undislosed):
From Joshua Schachter at the del.icio.us blog:
We’re proud to announce that del.icio.us has joined the Yahoo! family. Together we’ll continue to improve how people discover, remember and share on the Internet, with a big emphasis on the power of community. We’re excited to be working with the Yahoo! Search team – they definitely get social systems and their potential to change the web. (We’re also excited to be joining our fraternal twin Flickr!)
From Jeremy Zawodny at Yahoo Search Blog:
And just like we’ve done with Flickr, we plan to give del.icio.us the resources, support, and room it needs to continue growing the service and community. Finally, don’t be surprised if you see My Web and del.icio.us borrow a few ideas from each other in the future.
From Lisa McMillan, an enthusiastic user of all 3 services (comment on the del.icio.us blog):
Yahoo that’s delicious! I live here. I live in flickr. I live at yahoo. This is insane. You deserve this success dude. Just please g-d don’t let me lose my bookmarks I’m practically my own search engine. LOL
Tagged bookmarking sites such as del.icio.us can provide a rich source of input data for developing contextual and topical search. The early adopters that have used del.icio.us up to this point are unlikely to bookmark spam or very uninteresting pages, and the aggregate set of bookmarks and tags is likely to expose clustering of links and related tags which can be used to refine search results by improving estimates of user intent. Individuals are becoming their own search engine in a very personal, narrow way, which could be coupled to general purpose search engines such as Yahoo or Google.
I think Google needs to identify resources it can use to incorporate more user feedback into search results. Looking over the users’ shoulders via AdSense is interesting but inadequate on its own because there are a lot of sites that will never be AdSense publishers. Explicit input capturing the user’s intent, whether through tagging, voting, posting, publishing, is a strong indication of relevance and interest by that user. I think the basic Google philosophy of letting the algorithm do everything is much more scalable, but it looks like time to capture more human input into the algorithms.
In a recent post, I pointed out some work at Yahoo on computing conditional search ranking based on user intent. The range of topics on del.icio.us tends to be predictably biased, but for the areas that it covers well, I’d be looking for some opportunities to improve search results based on what humans thought was interesting. As far as I know, Google doesn’t have any assets in this space. Maybe Blogger or Orkut, but those are very noisy inputs.
This seems like a great move by Yahoo on multiple fronts, and I am very interested to see how this plays out.
Update 12-12-2005 12:30 PST: No hard numbers, but something like $10-15MM with earnouts looks plausible. More posts, analysis, and reader comments: Om Malik, John Batelle, Paul Kedrosky.
Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.
On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.
Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.
2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.
3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.
4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.
Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:
- Individual users interests / search intent changes over time
- The click and viewing data available to do the personalization is limited
- Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
- Computers are often shared among multiple users with varying intent
- Queries are too short to accurately infer intent
Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.
Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (“this is my event information”) and the search user’s intent (“please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.
I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).
The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.
I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.
Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.
The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.
I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.
Just when I’d started getting a little bored with Google-based pincushion maps du jour, I come across something surprising built on the new Yahoo Maps API:
from Justin’s Rich Media Blog:
With the power of Flash 8, you can customize the Yahoo! Maps on your site to actually blend with the surrounding design of the site or application. Forget about a rectangular maps and default colors of the map tiles. Use ActionScript, or the IDE to add runtime filters to the map tiles themselves.
The radar “scan” is animated to rotate around, while the pirate map telescope also serves as the zoom level slider.
I’ve seen so many Google Maps applications in the past few months that the sheer novelty and utility value of new ways to access data and maps has started to wear off. These demos made me stop to take a look simply because they look so much better than what we’ve gotten used to lately, and are likely to precipitate a wave of interesting new ideas.
I’m ambivalent about requiring Flash as a client technology. It’s really neat, and is deployed on a lot (but not all) browsers. It’s also somewhat opaque, and chews up a lot of system resources. I can usually tell when I’ve landed on a web page with Flash content somewhere because the fan in my T42 usually starts spinning up after a few seconds instead of running dead silent.
But in the meantime, this made my day.
Yahoo has a major update to Yahoo Maps this evening, bringing it back on par with Google Maps, and with a full set of web APIs for building mapping applications.
From the Yahoo Maps API overview:
Building Block Components
Several Yahoo! APIs help you create a powerful and useful Yahoo! Maps mashups. Use these together with the Yahoo! Maps APIs to enhance the user experience.
- Geocoding API – Pass in location data by address and receive geocoded (encoded with latitude-longitude) responses.
- Map Image API – Stitch map images together to build your own maps for usage in custom applications, including mobile and offline use.
- Traffic APIs – Build applications that take dynamic traffic report data to help you plan optimal routes and keep on top of your commute using either our REST API or Dynamic RSS Feed.
- Local Search APIs – Query against the Yahoo! Local service, which now returns longitude-latitude with every search result for easy plotting on a map. Also new is the inclusion of ratings from Yahoo! users for each establishment to give added context.
They also spell out their free service restrictions:
The Simple API that displays your map data on the Yahoo! Maps site has no rate limit, thought it is limited to non-commercial use. The Yahoo! Maps Embeddedable APIs (the Flash and AJAX APIs are limited to 50,000 queries per IP per day and to non-commercial use. See the specific terms attached to each API for that API’s rate limit. See information on rate limiting.
This restriction is more interesting:
Sensor-Based Location Limit
You may use location data derived from GPS or other location sensing devices in connection with the Yahoo! Maps APIs, provided that such location data is not based on real-time (i.e., less than 6 hours) GPS or any other real-time location sensing device, the GPS or location sensing device that derives the location data cannot automatically (i.e. without human intervention) provide the end user’s location, and any such location data must be uploaded by an end-user (and not you) to the Yahoo! Maps APIs.
So uploading a track log after running or hiking is OK, but doing a live GPS ping from your notebook, PDA, or cell phone to show where you are isn’t? I think this is intended to exclude traffic and fleet tracking applications, but it seems to include geocoded blog maps by accident. I don’t think they’d actually mind that.
There are several sample applications to look at. The events map seems nicely done, pulling up locations, images, and events for venues within the search window.
To display appropriate images for events, local event output was sent into the Term Extraction API, then the term vector was given to the Image Search API. The results are often incredibly accurate.
I’ve been meaning to take a look at the Term Extraction service, it looks like it might be a handy tool for building some quick-and-dirty personal meme engines or other filters for wrangling down my ever growing list of feeds.
Announcement at Yahoo Search Blog
More from TechCrunch, Jeremy Zawodny, Chad Dickerson
Jeremy Zawodny posted a summary of his October search referral statistics, and I thought I’d take a quick look at mine.
Nearly all of the search referrals here come through Google. I also have a relatively large number of “Other”, some of which (I think) are various Chinese search engines.
The gap between Google and Yahoo! is hard to interpret, since it doesn’t come close to matching the publicly available market share numbers. The same is true of the numbers for MSN and AOL. They should be higher.
There are two ways I can think to explain this:
1. People who use Google are more likely to be searching for content that’s on my site.
2. The market share numbers are wrong. Google actually generates more traffic than has been reported and MSN and AOL have been over-estimated.
I suspect that #1 is closer to reality. After all, I most often write about topics that are of interest to an audience that’s more technical than average. And I suspect that crowd skews toward Google in a more dramatic fashion than the general population of Internet users. If that’s true, it would seem to confirm many of the stereotypes about AOL and MSN users.
It looks like my site has even less appeal for a consumer audience than his…