Building better personalized search, filtering spam blogs

Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

  1. Individual users interests / search intent changes over time
  2. The click and viewing data available to do the personalization is limited
  3. Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
  4. Computers are often shared among multiple users with varying intent
  5. Queries are too short to accurately infer intent

Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.

Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (”this is my event information”) and the search user’s intent (”please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.

I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).

The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.

I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.

Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.

The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.

I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.

Tags: , , , , , ,

 
Google

 

2 Responses to “Building better personalized search, filtering spam blogs”

  1. Ho John Lee's Weblog Says:

    A reading list on PageRank and search algorithms

    If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth…

  2. Greg Linden Says:

    Hi, Ho John Lee. You mentioned that “a similar [suggestions] clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.”

    You might be interested in seeing Findory.com if you haven’t seen it already. That’s essentially what I am trying to do with Findory, trying to demonstrate how personalization techniques could be applied to helping people discover relevant information.

Leave a Reply

  • A Random Selection of Other Fine Posts

  •  
    Translate this page
    German Flag Spanish Flag French Flag Italian Flag Portuguese Flag Japanese Flag Korean Flag Chinese Flag
    Plugin by Taragana
    Google
    Web hojohnlee.com

    •  

     

     
     

    © 2004-2008 Ho John Lee