Bookmarks for May 21st from 06:07 to 22:34

These are my links for May 21st from 06:07 to 22:34:

Bookmarks for April 28th from 05:35 to 14:24

These are my links for April 28th from 05:35 to 14:24:

  • Official Google Blog: Adding search power to public data – Interesting. Wonder if the underlying public data sets will eventually become available on Google App Engine as well, sort of like the public data sets available for use with Amazon EC2 applications.
  • MySQL And Search At Craigslist – Jeremy Zawodny's slides on MySQL, Sphinx, and free text search implementation at Craigslist, from last week's MySQL conference.
  • Skew, The Frontend Engineer’s Misery @ Irrational Exuberance – For mashups and the like, the distinction between a FE engineer and web dev is rather small in terms of technical skills; they are both using the same skillset, they are both interacting with APIs, and so on. However, there are important distinctions between the two: 1. web developers tend to move in small groups or as individuals, whereas fe engineers work in larger groups, 2. web developers tend to design a product on top of an existing backend service (api, etc), while fe engineers are usually working in parallel with the backend being developed.
  • Study: Twitter Audience Does Not Have A Return Policy – Over 60 percent of people who sign up to use the popular (and tremendously discussed) micro-blogging platform do not return to using it the following month, according to new data released by Nielsen Online. In other words, Twitter currently has just a 40 percent retention rate, up from just 30 percent in previous months–indicating an “I don’t get it factor” among new users that is reminiscent of the similarly-over hyped Second Life from a few years ago.
  • Hey Americans, Appreciate Your Freedom Of Speech : NPR – Firoozeh Dumas on the underappreciated freedoms of speech and expression we have in the US vs journalists and bloggers in Iran.

Bookmarks for April 3rd through April 7th

These are my links for April 3rd through April 7th:

  • Agile Testing: Experiences deploying a large-scale infrastructure in Amazon EC2 – Practical guidance on using cloud computing at EC2. Expect failures, automate deployment, more.
  • joshua’s blog: on url shorteners – Joshua Schachter (founder of del.icio.us) summary on the state of URL shorteners (tinyurl, bit.ly, etc), and issues with 3rd party redirects, link sharing through twitter, etc.
  • Control Yourself » status.net coming soon – On status.net, plans for hosting laconi.ca sites, and federating microblogging status networks
  • There must be some way out of here (Scripting News) – Comments on the rise of celebrity accounts on Twitter, increasing spam/noise, and alternative models for laconi.ca and status.net
  • Stochastic Models of User-Contributory Web Sites – Tad Hogg, Kristina Lerman 31 Mar 2009 Abstract: We describe a general stochastic processes-based approach to modeling user-contributory web sites, where users create, rate and share content. These models describe aggregate measures of activity and how they arise from simple models of individual users. This approach provides a tractable method to understand user activity on the web site and how this activity depends on web site design choices, especially the choice of what information about other users' behaviors is shown to each user. We illustrate this modeling approach in the context of user-created content on the news rating site Digg.

Bookmarks for March 9th through March 12th

These are my links for March 9th through March 12th:

Bookmarks for February 18th through February 19th

These are my links for February 18th through February 19th:

Amazon aStore – custom storefronts for Amazon affiliates

Amidst the speculation about the Amazon Unbox video download service, Amazon has quietly launched aStores, a service providing custom online storefronts for Amazon affiliates. (You may not be able to view the link unless you’re an Amazon affiliate.)

aStore by Amazon is a new Associates product that gives you the power to create a professional online store, in minutes and without the need for programming skills, that can be embedded within or linked to from your website.

Here’s a link to their demo store.

You get to pick up to nine “featured items” to put on the home page of the store, choose product categories, and add reviews and editorial content. The shopping cart and fulfillment are handled by Amazon, with standard referral fees going back to the affiliate. There’s a browser based interface for building a store on the Amazon Affiliates site. The resulting store can be hosted by Amazon or on your own site.

This sort of functionality has been available for a while for those will and able to customize their site using Amazon’s web services API, but the aStores program will make custom stores broadly accessible to all of the Amazon affiliates base (just in time for the holiday shopping season). I suspect we’ll see an explosion of niche shopping sites in short order, it looks pretty easy to set one up.

Deconstructing search at Alexa

Wow! Although the basic idea is straightforward, crawling and indexing for a general purpose search engine requires huge resources. Web crawlers are effectively downloading copies of the entire internet over and over, turning them over to indexing applications which scan the contents for structure and meaning.

The sheer scale of the task is a substantial barrier to entry for anyone wanting to develop a new indexing or retrieval application. Some projects have narrowed the problem domain, which can reduce the problem scope to a manageable level, but this announcement from Alexa looks like it may offer an exciting alternative for building new search applications.

John Batelle writes:

Alexa, an Amazon-owned search company started by Bruce Gilliat and Brewster Kahle (and the spider that fuels the Internet Archive), is going to offer its index up to anyone who wants it (details are not up yet, but soon). Alexa has about 5 billion documents in its index – about 100 terabytes of data.

Anyone can also use Alexa’s servers and processing power to mine its index to discover things – perhaps, to outsource the crawl needed to create a vertical search engine, for example. Or maybe to build new kinds of search engines entirely, or …well, whatever creative folks can dream up. And then, anyone can run that new service on Alexa’s (er…Amazon’s) platform, should they wish.

The service will be priced on a usage basis: $1 per CPU hour, $1 per GB stored or uploaded, $1 per 50GB data processed.

There’s no announcement posted on the Alexa or Amazon sites yet, it’s apparently due out overnight. (Updated 12-13-2005 00:25 – the site is up now)

Not every search and retrieval application is necessarily going to fit onto the way Alexa has built their crawler and indexing infrastructure, or onto any other search engine platform, for that matter. But opening up access to more of the platform should make it possible for a lot of new ideas to be tried out quickly without having to build yet another crawler for each project. Up to this point, many search ideas can’t be evaluated without working at one of the major search engines. I suspect most development teams would prefer to get access to Google’s crawl and index data, but I’m certainly looking forward to seeing what’s available at Alexa when they get their documentation online in the morning.

More from Om Malik, TechCrunch, ReadWrite Web

Building better personalized search, filtering spam blogs

Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

  1. Individual users interests / search intent changes over time
  2. The click and viewing data available to do the personalization is limited
  3. Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
  4. Computers are often shared among multiple users with varying intent
  5. Queries are too short to accurately infer intent

Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.

Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (“this is my event information”) and the search user’s intent (“please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.

I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).

The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.

I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.

Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.

The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.

I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.

Ammazon Mechanikal Truk




Ammazon Mechanikal Truk:

Artificial…um…Real Smart Truk

See also: Amazon Mechanical Turk: Putting Humans in the Loop

(via Turk Lurker)

Amazon – Books by the Page

More Amazon stuff this evening:

Amazon Pages and Amazon Upgrade will provide paid access to books by the page, and the ability to “upgrade” access to the full contents of the book.

Press release:

The first program, Amazon Pages, will “un-bundle” the physical-world experience of buying and reading a book so that customers can simply and inexpensively purchase and read online just the pages they need. For example, an entrepreneur interested in marketing his or her business could purchase the relevant chapters from several best-selling business books.

The second program, Amazon Upgrade, will allow customers to “upgrade” their purchase of a physical book on Amazon.com to include complete online access. For example, a software developer who buys a Java programming book will not only get the physical book delivered to his or her home, but will also get 24×7 Web access to the complete interior text of the book. Buy a cookbook and you will not only have it on your shelf, but also be able to access it anywhere via the Web.

Personally, I like owning actual books, as I find them much easier to read and carry around than a computer or PDA. But something like this would be handy to get at my personal collection while travelling. Plus it might cut back on the volume of books I end up donating to the Palo Alto library.

This shouldn’t affect fiction book sales at all (who wants half a novel?), but could put a dent in sales of some types of reference books.

This seems a little bit like a “book” version of the old mp3.com service. If you owned the CD, they would let you stream the bits from their server. I seem to be slowly reconstructing my own private version of that service in our house, although if disk storage increases quickly enough I may just switch to duplicating the content everywhere.

More comments at TheStreet.com

Amazon Mechanical Turk – Putting Humans in the Loop

I came across a cryptic link to mturk.com on supr.c.ilio.us, asking “Isn’t that how the Matrix came to be?”

Amazon Mechanical Turk provides a web services API for computers to integrate “artificial, artificial intelligence” directly into their processing by making requests of humans. Developers use the Amazon Mechanical Turk web services API to submit tasks to the Amazon Mechanical Turk web site, approve completed tasks, and incorporate the answers into their software applications. To the application, the transaction looks very much like any remote procedure call: the application sends the request, and the service returns the results. In reality, a network of humans fuels this artificial, artificial intelligence by coming to the web site, searching for and completing tasks, and receiving payment for their work.

All software developers need to do is write normal code. The pseudo code below illustrates how simple this can be.

 read (photo);
 photoContainsHuman = callMechanicalTurk(photo);
 if (photoContainsHuman == TRUE) {
   acceptPhoto;
 }
 else {
   rejectPhoto;
 }

Given the source of the link, I was a little skeptical at first read, but it appears to be a legitimate beta project that just launched yesterday at Amazon. At least, the documentation links point back into Amazon Web Services, and at least one person seems to know someone there.

This is an interesting idea that should find some useful applications. Spammers have supposedly been doing something like this to defeat the image-based Turing tests used to screen comment posting systems, offering access to porn in exchange for solving the puzzles, and there are other anecdotes of using low cost offshore labor for similar tasks. Having a simpler web service interface for finding a human key operator somewhere will probably allow smaller and more experimental applications to emerge.

Update 11-04-2005 08:09 PST – Slashdot, TechDirt, Google Blogoscoped on Mechanical Turk, pointer to BoingBoing on porn puzzles and spam, captcha.net

Alexa Web Information Service

Alexa Web Information Service has been in beta for a year and is officially launched this week.

The Alexa Web Information Service provides the following operations:
URL Information
Examples of information that can be accessed are site popularity, related sites, detailed usage/traffic stats, supported character-set/locales, and site contact information. This is most of the data that can be found on the Alexa Web site and in the Alexa toolbar, plus additional information that is being made available for the first time with this release.
Web Search
The Web Search operation is a brand new search index based on Alexa’s extensive Web crawl. The search query format is similar to a Google query and allows up to 1,000 results per page.
Browse Category
This service returns Web pages and sub-categories within a specified category. The returned URLs are filtered through the Alexa traffic data and then ordered by popularity.
Web Map
The Web Map operation gives developers access to links-in and links-out information for all pages in the crawl. For example, given a URL as an input, the service returns a list of all links-in and links-out to or from that URL. This Web map information can be used as inputs to search-engine ranking algorithms such as PageRank and HITS, and for Internet research.
Crawl Meta Data
The Crawl Meta Data operation gives developers access to metadata collected in Alexa’s Web Crawl. For example, a developer can get pages size, checksum, total links, link text, images, frames, and any Javascript-embedded URLs for any page in the crawl.
Pricing
First 10,000 requests per month are free
additional requests are $0.00015 per request ($0.15 for 1,000 requests)

via Paul Kedrosky

Amazon A9 Maps with Block Photo View

A first version of Amazon A9’s photo mapping project is open for business at maps.a9.com.

The block-by-block view is available for selected US metro areas, and provides a street-level view of storefronts, houses, parks, and whatever else happened to be in view when they drove by.

Here are a few sample locations to try:

  • MIT Great Dome
  • 59th street side of Central Park, New York
  • Union Square, San Francisco
  • Unfortunately, there’s no easy way to bookmark a location yet, so saving a particular location requires a bit of trial and error on the street address once you come across an interesting view.

    via Batelle’s Searchblog

    Also, at Search Engine Watch Gary Price comments on the early coverage of Fargo, North Dakota:

    So, why Fargo? A couple of weeks ago A9’s CEO, Udi Manber, told Danny:

    “The reason we have Fargo is one of the engineers lives there. He took the equipment home and did the whole place in a day.”

    So, I’m thinking that if you don’t have an A9 engineer living in your small town, don’t expect to see Block View imagery anytime soon. (-: