More on the America Online search query data
The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:
This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.
Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.
I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.
In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.
#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.
#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.
#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.
#1521 like the free porn.
Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.
Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).
Sav, commenting over at TechCrunch (#67) observes:
The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…
A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.
I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.
Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.