AOL Research publishes 20 million search queries
More raw data for search engineers and SEOs, and fodder for online privacy debates - AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.
This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:
The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.
I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.
Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.
This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.
On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.
Plentyoffish sees an opportunity for PPC and Adsense spammers:
Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.
I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.
More on the privacy angle from SiliconBeat, Zoli Erdos
See also: Coming soon to DVD - 1,146,580,664 common five-word sequences
Update - Sunday 08-06-2006 20:31 PDT - AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.
Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data
Tags: search, query, data, application, user, personalization, seo, spam, aol, google, privacy, policy, research, datamining



























August 7th, 2006 at 6:26 pm
[…] - Digg - AOL Releases Search Logs from 500,000 Users - TechCrunch - AOL Proudly Releases Massive Amounts of Private Data - Greg Hughes - AOL screws the pooch - or at least about 650,000 of their own users - Yardley.ca - You never had privacy anyway - Ho John Lee - AOL Research publishes 20 million search queries […]
August 7th, 2006 at 8:03 pm
More on the America Online search query data
The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew We…
August 12th, 2006 at 3:42 pm
Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..
http://data.aolsearchlogs.com/log/random.cgi