Bookmarks for February 4th through February 11th

These are my links for February 4th through February 11th:

  • Schneier on Security: Interview with a Nigerian Internet Scammer – "We had something called the recovery approach. A few months after the original scam, we would approach the victim again, this time pretending to be from the FBI, or the Nigerian Authorities. The email would tell the victim that we had caught a scammer and had found all of the details of the original scam, and that the money could be recovered. Of course there would be fees involved as well. Victims would often pay up again to try and get their money back."
  • xkcd – Frequency of Strip Versions of Various Games – n = Google hits for "strip <game name>" / Google hits for "<game name>"
  • PeteSearch: How to split up the US – Visualization of social network clusters in the US. "information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them.

    Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South."

  • Redis: Lightweight key/value Store That Goes the Extra Mile | Linux Magazine – Sort of like memcache. "Calling redis a key/value store doesn’t quite due it justice. It’s better thought of as a “data structures” server that supports several native data types and operations on them. That’s pretty much how creator Salvatore Sanfilippo (known as antirez) describes it in the documentation. Let’s dig in and see how it works."
  • Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.

Bookmarks for January 30th through February 4th

These are my links for January 30th through February 4th:

  • Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.
  • Leonardo da Vinci’s Resume Explains Why He’s The Renaissance Man For the Job – Davinci – Gizmodo – At one time in history, even da Vinci himself had to pen a resume to explain why he was a qualified applicant. Here's a translation of his letter to the Duke of Milan, delineating his many talents and abilities. "Most Illustrious Lord, Having now sufficiently considered the specimens of all those who proclaim themselves skilled contrivers of instruments of war, and that the invention and operation of the said instruments are nothing different from those in common use: I shall endeavor, without prejudice to any one else, to explain myself to your Excellency, showing your Lordship my secret, and then offering them to your best pleasure and approbation to work with effect at opportune moments on all those things which, in part, shall be briefly noted below..The document, written when da Vinci was 30, is actually more of a cover letter than a resume; he leaves out many of his artistic achievements and instead focuses on what he can provide for the Duke in technologies of war.
  • jsMath: jsMath Home Page – The jsMath package provides a method of including mathematics in HTML pages that works across multiple browsers under Windows, Macintosh OS X, Linux and other flavors of unix. It overcomes a number of the shortcomings of the traditional method of using images to represent mathematics: jsMath uses native fonts, so they resize when you change the size of the text in your browser, they print at the full resolution of your printer, and you don't have to wait for dozens of images to be downloaded in order to see the mathematics in a web page. There are also advantages for web-page authors, as there is no need to preprocess your web pages to generate any images, and the mathematics is entered in TeX form, so it is easy to create and maintain your web pages. Although it works best with the TeX fonts installed, jsMath will fall back on a collection of image-based fonts (which can still be scaled or printed at high resolution) or unicode fonts when the TeX fonts are not available.
  • Josh on the Web » Blog Archive » Abusing the Cache: Tracking Users without Cookies – To track a user I make use of three URLs: the container, which can be any website; a shim file, which contains a unique code; and a tracking page, which stores (and in this case displays) requests. The trick lies in making the browser cache the shim file indefinitely. When the file is requested for the first – and only – time a unique identifier is embedded in the page. The shim embeds the tracking page, passing it the unique ID every time it is loaded. See the source code.

    One neat thing about this method is that JavaScript is not strictly required. It is only used to pass the message and referrer to the tracker. It would probably be possible to replace the iframes with CSS and images to gain JS-free HTTP referrer logging but would lose the ability to store messages so easily.

  • Panopticlick – Your browser fingerprint appears to be unique among the 342,943 tested so far.

    Currently, we estimate that your browser has a fingerprint that conveys at least 18.39 bits of identifying information.

    The measurements we used to obtain this result are listed below. You can read more about the methodology here, and about some defenses against fingerprinting here

Bookmarks for January 23rd through January 30th

These are my links for January 23rd through January 30th:

  • Leonardo da Vinci’s Resume Explains Why He’s The Renaissance Man For the Job – Davinci – Gizmodo – At one time in history, even da Vinci himself had to pen a resume to explain why he was a qualified applicant. Here's a translation of his letter to the Duke of Milan, delineating his many talents and abilities. "Most Illustrious Lord, Having now sufficiently considered the specimens of all those who proclaim themselves skilled contrivers of instruments of war, and that the invention and operation of the said instruments are nothing different from those in common use: I shall endeavor, without prejudice to any one else, to explain myself to your Excellency, showing your Lordship my secret, and then offering them to your best pleasure and approbation to work with effect at opportune moments on all those things which, in part, shall be briefly noted below..The document, written when da Vinci was 30, is actually more of a cover letter than a resume; he leaves out many of his artistic achievements and instead focuses on what he can provide for the Duke in technologies of war.
  • jsMath: jsMath Home Page – The jsMath package provides a method of including mathematics in HTML pages that works across multiple browsers under Windows, Macintosh OS X, Linux and other flavors of unix. It overcomes a number of the shortcomings of the traditional method of using images to represent mathematics: jsMath uses native fonts, so they resize when you change the size of the text in your browser, they print at the full resolution of your printer, and you don't have to wait for dozens of images to be downloaded in order to see the mathematics in a web page. There are also advantages for web-page authors, as there is no need to preprocess your web pages to generate any images, and the mathematics is entered in TeX form, so it is easy to create and maintain your web pages. Although it works best with the TeX fonts installed, jsMath will fall back on a collection of image-based fonts (which can still be scaled or printed at high resolution) or unicode fonts when the TeX fonts are not available.
  • Josh on the Web » Blog Archive » Abusing the Cache: Tracking Users without Cookies – To track a user I make use of three URLs: the container, which can be any website; a shim file, which contains a unique code; and a tracking page, which stores (and in this case displays) requests. The trick lies in making the browser cache the shim file indefinitely. When the file is requested for the first – and only – time a unique identifier is embedded in the page. The shim embeds the tracking page, passing it the unique ID every time it is loaded. See the source code.

    One neat thing about this method is that JavaScript is not strictly required. It is only used to pass the message and referrer to the tracker. It would probably be possible to replace the iframes with CSS and images to gain JS-free HTTP referrer logging but would lose the ability to store messages so easily.

  • Panopticlick – Your browser fingerprint appears to be unique among the 342,943 tested so far.

    Currently, we estimate that your browser has a fingerprint that conveys at least 18.39 bits of identifying information.

    The measurements we used to obtain this result are listed below. You can read more about the methodology here, and about some defenses against fingerprinting here

  • Benlog » Don’t Hash Secrets – If I tell you that SHA1(foo) is X, then it turns out in a lot of cases to be quite easy for you to determine what SHA1(foo || bar) is. You don’t need to know what foo is. because SHA1 is iterative and works block by block, if you know the hash of foo, then you can extend the computation to determine the hash of foo || bar

    That means that if you know SHA1(secret || message), you can compute SHA1(secret || message || ANYTHING), which is a valid signature for message || ANYTHING. So to break this system, you just need to see one signature from SuperAnnoyingPoke, then you can impersonate SuperAnnoyingPoke for lots of other messages.

    What you should be using is HMAC: Hash-function Message Authentication Code. You don’t need to know exactly how it works, just need to know that HMAC is specifically built for message authentication codes and the use case of SuperAnnoyingPoke/MyFace. Under the hood, what’s approximately going on is two hashes, with the secret combined after the first hash

  • Data.gov – Featured Datasets: Open Government Directive Agency – Datasets required under the Open Government Directive through the end of the day, January 22, 2010. Freedom of Information Act request logs, Treasury TARP and derivative activity logs, crime, income, agriculture datasets.

Bookmarks for December 31st through January 17th

These are my links for December 31st through January 17th:

  • Khan Academy – The Khan Academy is a not-for-profit organization with the mission of providing a high quality education to anyone, anywhere.

    We have 1000+ videos on YouTube covering everything from basic arithmetic and algebra to differential equations, physics, chemistry, biology and finance which have been recorded by Salman Khan.

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

Bookmarks for May 30th through May 31st

These are my links for May 30th through May 31st:

Bookmarks for May 19th from 08:04 to 19:24

These are my links for May 19th from 08:04 to 19:24:

  • List of Really Useful Free Tools For JavaScript Developers | W3Avenue
  • When Korean Culture Flourished – WSJ.com – In the geography of the Metropolitan Museum of Art, the gallery devoted to Korea acts as a sort of land bridge between China and South Asia that all too often serves as passage rather than destination. The first in a series of shows to be held over the next 10 to 15 years, "Art of the Korean Renaissance, 1400-1600" may change this. With only 47 objects(!), the exhibition explores a fertile 200-year period in Korea's cultural history, revealing as much through its choice of works as it does through the order in which it displays them. The show's modest size makes the point that, sadly, little has survived from this period, when the Joseon — or Fresh Dawn — dynasty (1392-1910) united the Korean peninsula militarily, established Confucianism as the national ideology and introduced a phonetic alphabet.
  • Axiis : Data Visualization Framework – Axiis provides both pre-built visualization components as well as abstract layout patterns and rendering classes that allow you to create your own unique visualizations. Axiis is built upon the Degrafa graphics framework and Adobe Flex 3.
  • Report: Mint Considers Selling Anonymized Data from Its Users – ReadWriteWeb – A lot of people would be interested in that dataset. Tricky to balance data exposure with consumer privacy.
  • Lendingclub.com: A De-anonymization Walkthrough « 33 Bits of Entropy – Step by step look at de-anonymizing a consumer data set. Given alternate sources, you can fill in a lot of gaps.

Bookmarks for April 12th from 17:02 to 19:13

These are my links for April 12th from 17:02 to 19:13:

Bookmarks for April 11th through April 12th

These are my links for April 11th through April 12th:

  • Wordle – Beautiful Word Clouds – Wordle is a toy for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes.
  • The dark side of Dubai – Johann Hari, Commentators – The Independent – "Dubai was meant to be a Middle-Eastern Shangri-La, a glittering monument to Arab enterprise and western capitalism. But as hard times arrive in the city state that rose from the desert sands, an uglier story is emerging."
  • Topless Robot – Hot Girls Have Lightsaber Strip-Fight for Your Viewing Pleasure – Star Wars CGI meets fake body spray ad
  • Poll Result: Best VPN to leap China’s Great Firewall? – Thomas Crampton – - Witopia – Undisputed winner. Quality of service, speed of surfing, though it is said to be relatively expensive at US$50 to US$60 per year. Hotspot Shield – Bandwidth limits can be painful. Force you to wait until the next month if you use it too much. – Ultrasurf – StrongVPN
  • InfoQ: Facebook: Science and the Social Graph – In this presentation filmed during QCon SF 2008 (November 2008), Aditya Agarwal discusses Facebook’s architecture, more exactly the software stack used, presenting the advantages and disadvantages of its major components: LAMP (PHP, MySQL), Memcache, Thrift, Scribe.
  • The Running Man, Revisited § SEEDMAGAZINE.COM – a handful of scientists think that these ultra-marathoners are using their bodies just as our hominid forbears once did, a theory known as the endurance running hypothesis (ER). ER proponents believe that being able to run for extended lengths of time is an adapted trait, most likely for obtaining food, and was the catalyst that forced Homo erectus to evolve from its apelike ancestors.

Genius, in search of lab coat

hjl-signtific-lab-profile-top

Didn’t attend ETech this week, but thanks to a Twitter pointer from Gene Becker,  I did take a few breaks to participate in a collaborative future forecasting experiment at the event, organized by Institute For the Future / Signtific Labs. The general idea is to enlist game players to offer Twitter-like short notes with outlier ideas regarding a scenario under discussion, in this case the consequences of inexpensive ($100) 1kg microsatellites (“CubeSats”) capable of high speed networking and remote sensing. The same game framework could be used for any scenario, though. Bonus points are awarded to “Super-Interesting” ideas and ideas that result in additional discussion, which helped me out on the scoreboard.

Gene (“ubik“) won a “Feynman” award on the first day, and I managed to end up with a high score at ETech, thus winning a lab coat to go with my “Genius” label.

Some of my favorite future forecast contributions from “What will you do when space is as cheap and accessible as the Web is today?” (slide summary here):

Jurisdiction-free data haven built with csats full of rad-hard flash memory, hbase-style distributed replication across multiple nodes. Subpoena-proof anonymizers, for better or worse. Alternative, universal internet currency evolves, outside any government’s central bank control. Following forced disclosure of banking client list, Swiss government recognizes anonymous cSat net IDs, followed by Cayman, Bermuda etc.

CSats deorbited in vacant areas of oceans as impulse input to passive sonar imaging. Oceanographers get great maps, submarines lose stealth. Depending on how accurately you can drop a CSat, you can effectively “ping” a region and listen to the return signal through existing arrays. This really messes with strategic deterrence since now subs are vulnerable to first strike. But CSat deorbit is cheap WMD for all. On the positive side, detailed acoustic propagation data leads to new insights on ocean dynamics – bathymetrics, thermoclines, currents, etc. A similar version of dropping CSats on land might yield useful seismic imaging. But these would all be surface impulse, not at depth.

Csat data networks circumvent the Great Firewall of China and other govt access controls, leading to broader/safer citizen engagement online

CSat operating interface is marketed as a toy, like Tamagochi. Recharge, collect interesting data, avoid mean csats, team with friends. Organizations might post cash prize/rewards for things like locating missing ships, oil/trash dumping at sea, smokestack emissions, etc

Commodity traders are early adopters of CSat operator networks. Looking for crop yield data, mine production volumes, freight shipments etc. Among other things, CSat observations could give a more accurate estimate of “floating” oil parked in tankers as well as ongoing demand. Similarly, you’d get a decent idea of iron ore production by watching BHP’s railway in Australia, and the demand side in China, Korea etc. CSat data could improve the market visbility into supply/demand. But one might start creating Potemkin mining/farming operations etc… Sadly, credit derivative risk is not observable via CSat.

Ubiquitous, near real time satellite surveillance. No more privacy outdoors. But really good Google Maps. Ultra high resolution terrain maps of the world synthesized from multiple satellite passes/viewing aspects. Long term studies of effects of erosion, farming, development, earthquakes, flooding, drought, etc. Insurgents, militias, and terrorists get real time tactical data feeds, make use of homebrew UAVs, sensors, and in-field dispatch from afar. Turf wars among poppy and marijuana growers who now know where each other’s fields are. All vehicles – car, truck, rail, container, airplanes, etc – get a sky-facing ID plate. Maybe these should just be really big QR codes with an authoritative registry to foil car thieves from painting on bogus “plates”.

Now I need to figure out how to collect that lab coat.

What would Bill and Dave Do?

HP has a culture problem.

Put aside for a moment the (probably illegal) methods used to obtain the personal phone records of the HP board members.

Yes, HP’s private detectives were using social engineering and pretexting, but honestly, does it surprise you to hear that a senior executive got carried away trying to identify their secret “enemies”? Didn’t think so.

The surprising part is that this wasn’t an Oracle (sending private investigators out dumpster diving for evidence) or an Apple (filing lawsuits and requesting subpoenas to learn the names of leakers), or some other Valley company built around tightly controlling founders.

The surprising part is that this was at HP, the company formerly known as “Hewlett-Packard”, where by tradition Bill Hewlett left his change on his desk, demonstrating his trust in his co-workers. This is like expecting Gerald Ford and getting Richard Nixon.

HP has been getting its act back together for the past year. Less talking, more doing. This affair won’t have any short term effect on the operation of the company. But it how it is resolved (or not) will have a long lasting effect on the internal values of the organization and the external perception of the company by partners, customers, and competitors.

If Patty Dunn worked for Mark Hurd, I think he would be nearly obligated to fire her at this point, or at least move her to the “penalty box” of sidelined executives. However, board directors aren’t exactly employees, and she’s the chair. It’s difficult to fire your boss.

But…would you want to do business with (or work for) a company whose management thinks it’s OK to conduct illegal searches because it thinks you did something it doesn’t like?

What would Bill and Dave do? (After they stop spinning in their graves.)

Yes, I know it’s a vastly different company now. That’s a good thing. This is still wrong.

More at Newsweek, MSNBC, Smoking Gun, TechDirt, Fred’s House, Infectious Greed, Intuitive Life

More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT – The first online interface for exploring the AOL search query data is up at www.aolsearchdatabase.com (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT – Here’s another online interface at dontdelete.com (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

AOL Research publishes 20 million search queries

More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.

Adam D’Angelo says:

This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.

Plentyoffish sees an opportunity for PPC and Adsense spammers:

Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.

I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.

More on the privacy angle from SiliconBeat, Zoli Erdos

See also: Coming soon to DVD – 1,146,580,664 common five-word sequences

Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.

Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data

Del.icio.us adds private bookmarks

Del.icio.us is testing out private bookmarks now.

I’ve been playing with a private instance of Scuttle ever since del.icio.us was purchased by Yahoo a few months back, but have continued using del.icio.us for posting public links anyway.

My del.icio.us links are automatically posted here (except when one end or the other is out of service for some reason), don’t know if that would include the private ones or not. Also don’t know exactly where the private bookmarks might be visible, aside from in one’s own account. I’ll have to give it a try.

BrainJam, December 2005, search, privacy, transparency

brainjams
Spent a few hours this afternoon at Chris Heuer’s BrainJam event. Wasn’t able to make it to the morning sessions, but arrived in time for the end of lunch and the “youth user panel”, consisting of four college students. They all love Facebook. Not sure how representative they are of the general student demographic, since two of them are trying to put together a web startup. They all use free online music and movie access, mostly through sharing within the dorm networks.

During the Q&A I asked for the panel members’ thoughts on privacy and about having their college lives online in perpituity. They’re vaguely concerned, but I don’t think the topic is really raising red flags for them. I think the high school and college users have more confidence in Facebook, MySpace, Xanga and others keeping their data private and/or it not making any difference to them in the future as social norms change. Part of it is that people are simply making things up on their pages, for the sake of attracting attention, and part of it is them not caring or not understanding that their web pages, chat transcripts, and even VOIP are mostly staying online forever. I think there’s going to be a lot of interesting conflicts in the future as people start running into their past personae 5, 10, 15 years later in a societal context that hasn’t adjusted yet to perpetual transparency.

Afterwards the group broke out into smaller topical discussions. The first session I went to was on the 2-way RSS proposal from Microsoft (Simple Sharing Extensions, SSE). I’m starting to think of SSE as a way for MSFT to use an RSS container for solving the sync problem for applications like Windows Mobile syncing a device and a desktop, or Active Directory performing distributed synchronization of directory data. I’m not really seeing a federated publishing model based on this, an idea that was floated in the conversation. It really feels like it solves an application sync problem for structured data.

The session on “what to do with all the data?” quickly turned into a discussion on privacy, transparency, and DRM. I’m personally disinclined to depend on trusting anyone’s DRM system to manage my criticall personal data, or for allowing anyone to indexing my private data in a way that eventually gets exposed to the world. One point of view expressed in this discussion was that the world would be better off if everyone just got used to the idea that everything they did was recorded and visible to the world (the Global Panopticon), although I think the majority disargreed that this would actually make people behave better. Personally, I think that documenting everything would break a lot of the ambiguity in relationships and conversations that allow the formation of reasonable opinions, by forcing people into adhering to “statements” and “positions” that were nothing more than passing conversation or exploration of a topic. This was part of my thinking behind asking the college kids about privacy. In real life, there are normally various social transitions that call for stepping away or de-emphasizing some aspects of one’s life, in favor of new ones. It doesn’t make the past behaviors and activities go away, but the combination of search engines and infinite, cheap storage is likely to keep some aspects of these folks’ “past” life in their face for a long time, which may make it harder to move forward.

Someone mentioned the idea of “privacy parity”, i.e. you can ask for my data, but I can see that you’re asking for it, sort of like being able to find out when someone has requested your credit report. This is interesting, but there are substantial asymmetries in the value of that information to each party. A bit of parity that would be very interesting would be a feed of who’s seen my site URLs and excerpts in a search results page — not the clickthrough, which I can already see, but when it’s turned up on the page at all.

A few of us continued a sidebar discussion on search, social networks, trust, and attention networks, and eventually got kicked out into the lobby where we were free to speculate on Google’s plan for world domination next to a huge globe in the SRI lobby. I haven’t bumped into anyone yet doing work on integrating the attention, social, and trust data into search. Doing this on a Google/Yahoo/Microsoft scale looks hard, because of the sheer scale, but I’m getting the sense that doing a custom search engine biased by the social / attention data inputs for a limited subject domain (100-1000′sGB) and a relatively small social / atttention network (1000′s – people you know or have heard of) is becoming more reasonable because of cheaper / faster / better IT hardware and because more of the data is actually becoming available now. Still chewing on this. I just came across Danah Boyd’s post on attention networks vs social networks yesterday, which concisely explains the directed vs undirected graph property which underlies part of the ranking algorithms that would be needed.
Perhaps someone’s already done this for a research project.

If Google Desktop were open source, it might be a logical place to insert a modified ranking algorithm based on attention, tags and social networks and also to insert an SSE-style interface to allow peer-to-peer federation of local search queries and results. This would keep the search index data local to “me” and “my documents”, but allow sharing with other clients that I trust. Perhaps it’s just an age thing. The college kids didn’t seem to mind having all of their documents on public servers, are counting on robots.txt to keep them out the global search engines, and apparently think that access controls on sites like Facebook will keep their personal postings out the of the public realm. For me, I still think twice sometimes about posting to my del.icio.us bookmarks list and keep anything really critical on physical media in a safe deposit box in a vault. So while I’ve gone from being Ungoogleable to Google search stardom, there’s a good portion of my digital life which is “dark matter” to the search engines. I’d like to find a way to fix it for myself, and share information with people I trust, and refine my searches over the public internet, but without having to give Google or anyone else all of my personal data.

Youth panel discussion Wrap up session

Took a few photos, photos from others will probably turn up tagged with “brainjams

Update 12-04-2005 21:15 PST: Audio from the Youth Panel discussion on Chris’s blog
KRON-4 television piece on BrainJams. Looks like I missed the hula hoop part in the morning. I also seem to have mostly missed the non-profit community-oriented discussion, as you can see from my notes. Perhaps that’s what was going on when we got kicked out into the lobby for being too loud…

Better Eavesdropping with Microwaves


Although there’s no working system described in any articles I can find about this, the patent application that goes with this is filed on behalf of NASA, so it might not be total vaporware.

From Audio DesignLine:

At last, you think that you have a secure room for conversations. No windows to bounce laser beams off as a means to eavesdrop. The doors are sealed and air tight. But don’t rest too easy. Now there’s a new way of snooping using Gigahertz waves.

Reflected electromagnetic signals can be used to detect audible sound. Electromagnetic radiation reflected by a vibrating object includes an amplitude modulated component that represents the object’s vibrations. The new audio interception method works by illuminating an object with an RF beam that does not include any amplitude modulation. Reflections of the RF beam include amplitude modulation that provide information about vibrations or movements of the object. Audio information can be extracted from the amplitude modulated information and used to reproduce any sound pressure waves striking the object. Interestingly enough, the object can be something as unlikely as a piece of clothing. Thus, something as intensely personal as your heart beat can be intercepted by reflected RF waves in addition to audio sounds.

More from New Scientist, discussion at Slashdot, Bruce Schneier (see comments)

Decoding the hidden ID tracker in your printer output


via BoingBoing:

Many color laser printers hide information about your printer’s serial number and the date and time of your print job in every job you print. It’s believed that this is done to get your equipment to incriminate you without your knowledge. Now EFF has decoded the information-hiding scheme on the Xerox Docucolor series, by getting EFF supporters to print out pages from their printers and mail them to our researchers, who examined them under magnification and special light and cracked the code.

EFF: Is Your Printer Spying On You?:

Imagine that every time you printed a document, it automatically included a secret code that could be used to identify the printer – and potentially, the person who used it. Sounds like something from an episode of “Alias,” right?

Unfortunately, the scenario isn’t fictional. In a purported effort to identify counterfeiters, the US government has succeeded in persuading some color laser printer manufacturers to encode each page with identifying information.

They have a longer discussion and an online pattern decoder for reading the tracking output from a Xerox Docucolor 12 on the EFF site.

Update 10-29-2005 21:10 PDT – EFF has a list of printers which include visible tracking.

Ungoogleable to #1 in six months

Despite being online for a very long time by today’s standards (~1980), I have been difficult to find in search engines until fairly recently.

This basically has 4 reasons:

  1. The components of my name, “Ho”, “John”, and “Lee” are all short and common in several different contexts, so there are a vast number of indexed documents with those components.
  2. Papers I’ve published are listed under “Lee, H.J.” or something similar, lumping them together with the thousands of other Korean “Lee, H.J.”s. Something like 14% of all Koreans have the “Lee” surname, and “Ho” and “Lee” are both common surnames in Chinese as well. Various misspellings, manglings and transcriptions mean that old papers don’t turn up in searches even when they do eventually make it online.
  3. Much of the work that I’ve done resides behind various corporate firewalls, and is unlikely to be indexed, ever. A fair amount of it is on actual paper, and not digitized at all.
  4. I’ve generally been conscious that everything going into the public space gets recorded or logged somewhere, so even back in the Usenet days I have tended to stay on private networks and e-mail lists rather than posting everything to “world”.

Searching for “Ho John Lee” (no quotes) at the beginning of 2005 would have gotten you a page full of John Lee Hooker and Wen Ho Lee articles. Click here for an approximation. With quotes, you would have seen a few citations here and there from print media working its way online, along with miscellaneous RFCs.

Among various informal objectives for starting a public web site, one was to make myself findable again, especially for people I know but haven’t stayed in contact with. After roughly six months, I’m now the top search result for my name, on all search engines.

As Steve Martin says in The Jerk (upon seeing his name in the phone book for the first time), “That really makes me somebody! Things are going to start happening to me now…”

Wired this month on people who are Ungoogleable:

As the internet makes greater inroads into everyday life, more people are finding they’re leaving an accidental trail of digital bread crumbs on the web — where Google’s merciless crawlers vacuum them up and regurgitate them for anyone who cares to type in a name. Our growing Googleability has already changed the face of dating and hiring, and has become a real concern to spousal-abuse victims and others with life-and-death privacy needs.

But despite Google’s inarguable power to dredge up information, some people have succeeded — either by luck, conscious effort or both — in avoiding the search engine’s all-seeing eye.

Korea’s plans for Ubicomp City

Korea has amazingly high penetration rates for broadband and cellular service. It’s cheap, fast, and widely available, and has been for several years now. This has made Korea a lead market for trying out new wireless and online services. Streaming broadcast and video-on-demand for all national networks is the norm. Next up: building a centrally planned, wired city called New Songdo, which will implement many of the ubiquitous / pervasive computing ideas that have been floating around for a while but never attempted at this scale:

New York Times:

A ubiquitous city is where all major information systems (residential, medical, business, governmental and the like) share data, and computers are built into the houses, streets and office buildings. New Songdo, located on a man-made island of nearly 1,500 acres off the Incheon coast about 40 miles from Seoul, is rising from the ground up as a U-city.

In the West, ubiquitous computing is a controversial idea that raises privacy concerns and the specter of a surveillance society. (They’ll know whether I recycled my Coke bottle?!) But in Asia the concept is viewed as an opportunity to show off technological prowess and attract foreign investment.

“New Songdo sounds like it will be one big Petri dish for understanding how people want to use technology,” said B. J. Fogg, the director of the Persuasive Technology Lab at Stanford University.

If so, it is an experiment much easier to do in Asia than in the West.

“Much of this technology was developed in U.S. research labs, but there are fewer social and regulatory obstacles to implementing them in Korea,” said Mr. Townsend, who consulted on Seoul’s own U-city plan, known as Digital Media City. “There is an historical expectation of less privacy. Korea is willing to put off the hard questions to take the early lead and set standards.”

I think projects like these are going to need something like the AttentionTrust Recorder, or at least an OFF button, to let people see what’s being monitored about themselves and to manage how the information is made available. Without it, this might be a really cool place to visit but not somewhere you’d want to live.

(via TechDirt)

Tagging and Searching: How transparent do you want to be?

This note captures some thoughts in progress, feel free to chip in with your comments…

Here’s a feature wish list for link tagging:

  • Private-only links – only I can see them at all
  • Group-only links – only members of the group can see them
  • Group-only tags – only members of the group can see my application of a set of tags
  • Unattributed links – link counts and tags are visible to the public, but not the contributor or comments

Tagged bookmarking services such as del.icio.us allow individuals to save and organize their own collection of web links, along with user-defined short descriptions and tags. This is already convenient for the individual user, but the interesting part comes from being able to search the entire universe of saved bookmarks by user-defined tags as an alternative or adjunct to conventional search engines.

Bits of collective wisdom embodied in a community can be captured through aggregating user actions representing their attention, i.e. the click streams, bookmarks, tags, and other incremental choices that are incidental to whatever they happened to be doing online. The result of a tag search are typically much smaller, but are often more focused or topically relevant than a search on Google or Yahoo.

It’s also interesting to browse the bookmarks of other people who have tagged or saved similar items. To some extent the bookmark and tag collection can be treated as a proxy for that person’s set of interests and attention.

In a similar fashion, clicking on a link (or actually purchasing an item), can be treated as a indication of interest. This is part of what makes Google Adsense, Yahoo Publisher Network, and Amazon’s Recommendations work. The individual decisions are incidental to any one person’s experience, and taken on their own have little value, but can be combined to form information sets which are mutually beneficial to the individual and the aggregator. Web 2.0 thrives on the sharing of “privately useless but socially valuable” information, the contribution of individuals toward a shared good.

In the case of bookmarking services, the exchange of values is: I get a convenient way to save my links, and del.icio.us gets my link and tag data to be shared with other users

One problem I run into regularly is that everything is public on del.icio.us. For most links I add, I am happy to share them, along with the fact that I looked at them, cared to save it, and any comments and tags I might add. Del.icio.us starts out with the assumption that everyone who bookmarked something there would want to share. As I use it more regularly, though, I sometimes find situations where I want to save something, but not necessarily in public. Typically either

      a) don’t want to make the URL visible to the public, or
      b) don’t mind sharing the link, but don’t want to leave a detailed trail open to the public.

The first case, in which I’d like to save a link for my private use, is arguably just private information and shouldn’t actually be in a “social bookmarks” system to begin with. However, there is a social variant of the private link, which is when I’d like to share my link data with a group, but not all users. This might be people such as members of a project team, or family or friends. It’s analogous to the various photo sharing models, in which photos are typically shared to the public, or with varying systems of restrictions.

The second case, in which I’m willing to share my link data, but would like to do so without attribution, is interesting. In thinking about my link bookmarking, I find that I’m actually willing to share my link, and possibly my tag and comment data, but don’t want to have someone browse my bookmark list and find the aggregated collection there, as it probably introduces too much transparency into what I’m working on. At some point in time, it’s also likely that I would be happy to make the link data fully visible, tags, comments, and all, perhaps after some project or activity is completed and the presence of that information is no longer as sensitive.

The feature wish list above would address some of the not-quite-public link data problems, while continuing to accrete community contributed data. In the meantime, I’m still accumulating links back behind the firewall.

Another useful change to existing systems would be to aggregate tag or search results based on a selected set of users to improve relevance. This is along the lines of Memeorandum, which uses a selected set of more-authoritative blogs as a starting point to gauge relevance of blog posts. In the tagged search case, it would be interesting if I could select a number of people as “better” or “more relevant” at generating useful links, and return search results with ranking biased toward search nodes that were in the neighborhood of links that were tagged by my preferred community of taggers.

It’s possible to subscribe to specific tags or users on del.icio.us, but what I had in mind was more like being able to tag the users as “favorites” or by topic and then rank my search results based on their link and tag neighborhoods. I don’t actually want to look at all of their bookmarks all the time.

Something similar might also work with search result page clickthroughs. These sorts of approaches seem attractive, but also seem too messy to scale very well.

Unattributed links may be too vulnerable to spamming to be useful. One possible fix could be to filter unattributed links based on the authority of the source, without disclosing the source to the public.

I was at the Techcrunch meetup last night, didn’t have a chance to talk with the del.icio.us folks who were apparently around somewhere, but Ofer Ben-Shachar from Raw Sugar did mention that they were looking at providing some sort of group-only access option for their tagging system.

A lot of this could be hacked onto the existing systems to solve the end user problem easily, but some of the initial approaches that come to mind start to break the social value creation, and I think those could be preserved while making better provisions for “private” or “group” restrictions by working on the platform side.

Cell phone tracking service

An interesting thread on Google Answers, regarding what services are available to track the current location of a cell phone. (via del.icio.us).

For about $200.00 ICU, Inc. offers to locate a cellular telephone by
pinging the phone – a kind of triangulation process similar to the one
I mentioned earlier. Ms. Landers explained that the cell phone appears
as a ‘blip” on a screen. They provide the service 24 hours a day, 7
days a week in order to help locate missing persons, fugitives,
cheating spouses, etc. They regularly serve bondsmen, authorities,
investigators and many others. You will receive the results within 7
to 10 minutes of a successfully completed ping that will indicate
within approximately 50 feet, where the phone was located at the time
of the ping.

I.C.U. Inc.
http://www.tracerservices.com/cpl.htm
http://www.tracerservices.com/cplfaqs.htm

Aside from the cell phone tracing, the list of services on the I.C.U. Inc web site makes for fascinating reading.

Update: 08-15-2005 23:59 – Came across the CellTrack project, which is developing a free, open source cell phone tracking system (presently for GSM). It requires installing a client application on the phone, however, so it’s not useful for finding someone who doesn’t want to be found. (screenshots here)

Also came across this paranoia-inducing clip at Instapundit:

THEY CAN HEAR YOU NOW: When I was in Beirut in April one of the leaders of the Cedar Revolution, Nabil Abou-Charaf, told me that Syrian intelligence agents used cell phones to “spy” on people.

“You mean they monitor your phone conversations,” I said.

“No,” he said. “They can listen to us all the time even when we’re not using the phone.” He could tell I didn’t believe him. “We know as a fact they can do this.”

Still, I didn’t believe what he said about spies using his cell phone as a bug. If the cell phone is off or just sitting there it isn’t transmitting a signal.

Looks like I was wrong. Julian Sanchez at Hit and Run points out this chilling excerpt from a story in last week’s Guardian.

The main means of tracking terrorist suspects down has been the monitoring of mobile phone conversations. Not only can operators pinpoint users to within yards of their location by “triangulating” the signals from three base stations, but – according to a report in the Financial Times – the operators (under instructions from the authorities) can remotely install software onto a handset to activate the microphone even when the user is not making a call.
I’m sure the police love this feature. Police states apparently love it, as well.

Page 1 of 212