More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT - The first online interface for exploring the AOL search query data is up at www.aolsearchdatabase.com (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT - Here’s another online interface at dontdelete.com (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT - A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

Tags: , , , , , , , , , ,

 
Google

 

One Response to “More on the America Online search query data”

  1. Tom Harrison Says:

    Ho John –

    When I was doing Direct Hit in 1998 (search engine, eventually bought by Ask Jeeves), I recall the first time we sent our search results live on a real search engine (HotBot). Each hit to our servers was a query from theirs, and I spent many hours tailing the logs, just watching queries (and referrers and so on) come in.

    It was in these moments that I realized the great experiment of the Internet was primarily an excellent way to find naked lady pictures. There were many terms indeed that none of us knew (hint: the use of Z as an alternate to S as plural was a code, e.g. “cheatz”). Later, we developed various content recognition systems, including a porn filter. We had an exceptional, if unlikely engineer on the job — she was in her mid 50’s, several grown kids, married to an MIT professor, and was pretty much a wonderful lady — far from the person one might expect to see trolling logs identifying the patterns that coalesced the worst , or at least basest of human nature.

    At Direct Hit we also embarked on a personalization project. By mining logs we parsed the paths of distinct users (cookied), and tried to identify patterns that could help us make broad generalization about a user. For example, a person making queries that contained a geographical term, or in a particular language, or containing a frequently occurring term might tell us something about the user which we could then apply back in their subsequent searches to refine the results. In short, exactly what Google is doing in their Personalized Search project now. Bygones.

    The interesting, if preliminary results of the data mining were that it was hard, as you suggest, to identify meaning from patterns of usage of any specific user. The first problem was simple statistics — while we had millions of queries a day, there were not that many individual users who performed a significant number of queries — hundreds or thousands at most over a few months for the power-users, hardly enough. But perhaps it’s different now 8 years later.

    But today, I look over my query history from Google personalized search, and can easily know the what’s and why’s of the various patterns. But that’s because I am me. It is a little scary to think what someone who does not have the context of … me … might conclude in looking through the same data.

    Likewise, what we saw in our data mining, we could reasonably conclude that there are some cohorts of data that represented statistically relevant patterns. The problem was not in finding which sets were valid or what the patterns were, but instead what they actually meant or how we might apply this data/information/knowledge. I can now look at my data and tell you what (I think) they meant, but the conclusion of another person observing the same data might be very different. Hell, my observations might be wrong as well!

    That it is often possible to link queries back to a specific person, despite sort of anonymous data makes this even more interesting. I recall seeing a referrer from a click a user made from a link in an email message. It had his email address, the subject of the emails, and of course IP, our cookie, and the query data. I suspect modern email clients are smart enough not to pass this information along these days, but this was only several years ago. I know my sister, for example, is still using an email client from then.

    So, this is certainly a gnarly problem. But there’s certainly some gold in that data. I wonder if we are any different now in looking at this resource than 8 years ago when we took the Internet and used it as a great reservoir of porno?

    Tom

Leave a Reply

  • A Random Selection of Other Fine Posts

  •  
    Translate this page
    German Flag Spanish Flag French Flag Italian Flag Portuguese Flag Japanese Flag Korean Flag Chinese Flag
    Plugin by Taragana
    Google
    Web hojohnlee.com

    •  

     

     
     

    © 2004-2008 Ho John Lee