Slides from the Social Graph Symposium panel

Some introductory slides from a panel session at the Social Graph Symposium.

Social Graph Symposium Panel – May 2010 – Presentation Transcript

1. Social Graph Symposium Panel
Ho John Lee | Principal Program Manager | Bing Social Search
2. About me:
Ho John Lee
hojohn . lee @ microsoft . com
twitter.com/hjl
Past: Bing Twitter (v1), SocialQuant, trading, investing/consulting (China, India)
HP Labs, MIT, Stanford, Harvard
Current: Bing Social Search – graph and time series analysis, data mining
Twitter, Facebook, new products, technical planning
3. What can we do by observing social networks?
On the internet, no one knows you’re a dog.
But in social networks, we can tell if you act like a dog, what groups you belong to, and some of your interests
4. How many Twitter users are there?
from a search on twopular, May 2009
5. Graph analysis for relevance and ranking
Spam marketing campaign
(teeth whitening)
Naturally connected community (#smx)
Real time relevance needs data mining to filter and rank based on history
Spammy communities can be highly visible
Social graph, topic/concept graph, and behavior/gesture graphs are all useful tools
6. Information diffusion in the graph
Observed incidence network of retweets in Twitter
Kwak, Lee, et al, What is Twitter, a Social Network or a News Media? WWW2010
Information flow and behaviors form an implicit interaction graph
7. Topic / sentiment range, volume, trend analysis
What is the baseline rate of mentions / sentiment per unit time?
Look for changes in attention flow around a subject, location, topic
Watch for correlated signals from multiple sources
Consider source relevance and authority as well
8. Applying graph analysis
Attention flow vs information flow
Leads to utility functions, cost functions
Variable diffusion rates by actor / network / info type
Predicting interests and affiliations
Content creation follows attention
Self-organized communities of attention
If there’s no content, you can ask for some
Observable propagation of information
9. Clustering and fuzzing properties and identities
* Frequently used terms can identify interests, affinities, latent query intent
* But can potentially be used to identify likely individual users!
* Infochaff – fuzzing out identity, behavior, properties
10. Thank You
Ho John Lee
hojohn . lee @ microsoft . com
twitter.com/hjl

RESEARCH: Insights from the latest social graph studies
Moderator: Eric Siegel – President at Prediction Impact and Conference Chair at Predictive Analytics World
Speakers:
Sharad Goel – Research Scientist at Yahoo
Ho John Lee – Principal Program Manager at Microsoft
DJ Patil – Chief Scientist at LinkedIn
Marc Smith – Chief Social Scientist at Connected Action Consulting Group

Bookmarks for February 4th through February 11th

These are my links for February 4th through February 11th:

  • Schneier on Security: Interview with a Nigerian Internet Scammer – "We had something called the recovery approach. A few months after the original scam, we would approach the victim again, this time pretending to be from the FBI, or the Nigerian Authorities. The email would tell the victim that we had caught a scammer and had found all of the details of the original scam, and that the money could be recovered. Of course there would be fees involved as well. Victims would often pay up again to try and get their money back."
  • xkcd – Frequency of Strip Versions of Various Games – n = Google hits for "strip <game name>" / Google hits for "<game name>"
  • PeteSearch: How to split up the US – Visualization of social network clusters in the US. "information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them.

    Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South."

  • Redis: Lightweight key/value Store That Goes the Extra Mile | Linux Magazine – Sort of like memcache. "Calling redis a key/value store doesn’t quite due it justice. It’s better thought of as a “data structures” server that supports several native data types and operations on them. That’s pretty much how creator Salvatore Sanfilippo (known as antirez) describes it in the documentation. Let’s dig in and see how it works."
  • Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.

Bookmarks for January 17th through January 20th

These are my links for January 17th through January 20th:

  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.
  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for December 31st through January 17th

These are my links for December 31st through January 17th:

  • Khan Academy – The Khan Academy is a not-for-profit organization with the mission of providing a high quality education to anyone, anywhere.

    We have 1000+ videos on YouTube covering everything from basic arithmetic and algebra to differential equations, physics, chemistry, biology and finance which have been recorded by Salman Khan.

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

When you come to a fork in the road…

Crossroads of the World at the Beach Bar, Waikiki

Crossroads of the World at the Beach Bar, Waikiki

As some of you know, I have been exploring a variety of paths forward for SocialQuant, my real time social search and analytics project. My family, friends, and colleagues have given me much support, patience, and advice during this process, which has reached a crossroads, and as Yogi Berra says, “When you come to a fork in the road, take it!”

The rise of Twitter, Facebook, and other social media, combined with web-based applications, smartphones, and cloud computing have all set the stage for new applications and use models based on social discovery, collaboration, and communications, in addition to traditional search. What we’re all calling “real time search” lately isn’t exactly real time, nor is it exactly search, in which you find a definitive/authoritative answer. Much of the opportunity revolves around discovering people, discussions, and events that are relevant to you and bringing it to your attention in a timely, actionable fashion. Information streams from social media are transient, unreliable, and noisy. At the same time, the sheer volume of data can help provide the basis for building better filters. As an added bonus, you can ask questions to people in the social graph itself, and there are numerous examples of communities of interest forming around current events such as Barack Obama’s inauguration, the Iran elections, or even Michael Jackson’s funeral, all of which help surface information content, opinion, and sentiment that were previously inaccessible online. One interesting aspect of real time social media is that it’s not just algorithmic, it’s based on human connections and emotions. So a message  that “feels right” from people you trust can be more relevant than one that is “correct” at times.

The challenge then is in filtering and ranking the massive flow of information in a way that helps direct the user’s limited (and non-expanding) time and attention in a way that’s most valuable to them. With today’s information technology, amazing things are possible with limited resources. I personally have more computing and storage resources than the facility we launched HP’s original photo site with (for millions of dollars), at a fraction of the cost, routinely pushing around datasets of millions of rows on the local development servers. Unfortunately, that’s just the ante to get started on the problem. Running ranking, clustering, and semantic analysis for filtering the ever-growing stream of social media eventually requires web scale computing, even with careful problem selection and data pruning. The bar is also going up every day as the social media user base grows, and as well funded teams make progress on their platforms (+Google).  So very shortly, to be competitive in real time, social search and discovery is going to require access to lots of data and either getting a datacenter or working with someone who has one.

In my case, I have recently chosen the latter path, and will be joining the Microsoft Bing search team, focusing on real time and social search. Microsoft itself has been showing signs of a renaissance, with search relaunching, Windows 7 looking leaner, Azure becoming non-vaporous, more web APIs getting published, core online applications starting to turn up, and a cool Office 2010 video. Even Mini-Microsoft is getting positive recently. And Google is starting to have “bigness” issues.

I look forward to working with Sean Suchter and the Microsoft Bing search team (and likely expanding their carbon footprint) in pursuit of new applications and services as the social media and online application space evolves.

You can follow along on Twitter (@hjl). As always, any and all opinions here are solely mine and do not reflect the position of any past, present, or future employer, partner, or business associate.

Bookmarks for June 6th through June 8th

These are my links for June 6th through June 8th:

  • Latin motto generator: make your own catchy slogans! – Create your own life mottos and slogans in Latin! (Learning Latin not required, some vague idea for a desired motto a plus)
  • A Map Of Social (Network) Dominance – Using Alexa and Google Trend data, Cosenza color-coded the map based on which social network is the most popular in each country. All of the light green countries belong to Facebook. But there are still pockets of resistance in Russia (where V Kontakte rules), China (QQ), Brazil and India (Orkut), Central America, Peru, Mongolia, and Thailand (hi5), South Korea (Cyworld), Japan (Mixi), the Middle East (Maktoob), and the Philippines (Friendster).
  • Microsoft Releases Bing API – With No Usage Quotas – Updated search API, with no quotas and some improvements.
    * Developers can now request data in JSON and XML formats. The SOAP interface that the Live Search API required has also been retained.
    * Requested data can be narrowed to one of the following source types: web, news, images, phonebook, spell-checker, related queries, and Encarta instant answer.
    * It is now possible to send requests in OpenSearch-compliant RSS format for web, news, image and phonebook queries.
    * Client applications will be able to combine any number of different data source types into a single request with a single query string.
  • Twitter Limits Getting Ridiculous! « Verwon’s Blog – Anecdotal reports of Twitter users running into problems with rate limiting, either API or max posts/tweets/follows/directs.
  • flot – Google Code – Flot is a pure Javascript plotting library for jQuery. It produces graphical plots of arbitrary datasets on-the-fly client-side. The focus is on simple usage (all settings are optional), attractive looks and interactive features like zooming and mouse tracking. The plugin is known to work with Internet Explorer 6/7/8, Firefox 2.x+, Safari 3.0+, Opera 9.5+ and Konqueror 4.x+. If you find a problem, please report it. Drawing is done with the canvas tag introduced by Safari and now available on all major browsers, except Internet Explorer where the excanvas Javascript emulation helper is used.

Bookmarks for May 30th through May 31st

These are my links for May 30th through May 31st:

Bookmarks for May 8th through May 12th

These are my links for May 8th through May 12th:

Bookmarks for May 5th through May 6th

These are my links for May 5th through May 6th:

Bookmarks for April 28th through April 29th

These are my links for April 28th through April 29th:

Bookmarks for April 12th from 17:02 to 19:13

These are my links for April 12th from 17:02 to 19:13:

Bookmarks for April 11th through April 12th

These are my links for April 11th through April 12th:

  • Wordle – Beautiful Word Clouds – Wordle is a toy for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes.
  • The dark side of Dubai – Johann Hari, Commentators – The Independent – "Dubai was meant to be a Middle-Eastern Shangri-La, a glittering monument to Arab enterprise and western capitalism. But as hard times arrive in the city state that rose from the desert sands, an uglier story is emerging."
  • Topless Robot – Hot Girls Have Lightsaber Strip-Fight for Your Viewing Pleasure – Star Wars CGI meets fake body spray ad
  • Poll Result: Best VPN to leap China’s Great Firewall? – Thomas Crampton – - Witopia – Undisputed winner. Quality of service, speed of surfing, though it is said to be relatively expensive at US$50 to US$60 per year. Hotspot Shield – Bandwidth limits can be painful. Force you to wait until the next month if you use it too much. – Ultrasurf – StrongVPN
  • InfoQ: Facebook: Science and the Social Graph – In this presentation filmed during QCon SF 2008 (November 2008), Aditya Agarwal discusses Facebook’s architecture, more exactly the software stack used, presenting the advantages and disadvantages of its major components: LAMP (PHP, MySQL), Memcache, Thrift, Scribe.
  • The Running Man, Revisited § SEEDMAGAZINE.COM – a handful of scientists think that these ultra-marathoners are using their bodies just as our hominid forbears once did, a theory known as the endurance running hypothesis (ER). ER proponents believe that being able to run for extended lengths of time is an adapted trait, most likely for obtaining food, and was the catalyst that forced Homo erectus to evolve from its apelike ancestors.

Bookmarks for March 16th through April 2nd

These are my links for March 16th through April 2nd:

Bookmarks for March 9th through March 12th

These are my links for March 9th through March 12th:

Bookmarks for February 27th through February 28th

These are my links for February 27th through February 28th:

Why I’m not connected to you on Facebook or LinkedIn (but do follow on Twitter and Friendfeed)

birds-crop-img_9698

Here are my current informal policies for using Facebook, LinkedIn, Twitter, and Friendfeed.  Short version – Facebook and LinkedIn I use for people I know personally, Twitter and Friendfeed any interesting input is welcome.

Facebook: This has been rapidly going mainstream lately. I had a mostly unused account for a long time, which has become more interesting/active as people I know sign up.  I presently only link to people I know in real life. Facebook is interesting because there are people I haven’t interacted with for years (high school friends etc) as well as people that live next door (literally) and colleagues from past work projects all mixed together, and they all get to eavesdrop and engage in casual/passive interaction. I currently have my Twitter feed linked to update my Facebook status, which means my messages are probably cryptic to about half the readers at any given time.

LinkedIn: I originally only linked to people I worked with and knew very well. I have broadened out the criteria over the years, and at this point I will link to people that I haven’t worked with but have at least actually met and had a conversation with. I basically don’t link to people I don’t know and haven’t met, though. I’d to at least be able to recognize people I’m linked to, and have a clue about what they’re like. So no “LinkedIn Open Networking” for me.

Twitter: I look for interesting (to me) streams, whether or not I know the author. Most of my twitter feed is people I haven’t met in person.  I follow people I know in real life, and also discover people who have commented on something that turned up in a conversation or a search. I don’t auto follow, although I do try to take a look at who’s on my follower list periodically to see if there is someone I should add.  Twitter has also been the most interesting for making new connections with people in real life, as you can get a sense of topic people are thinking about and what they’re more generally like.  I use Twitter for scanning a range of topics, so I’m a little less interested in people with huge follower counts and more interested in people kicking out uncorrelated but interesting ideas and data.  I’m working on tools for scanning and filtering status and sentiment streams, so in theory a bigger source network is better, if you can make effective use of it.

Friendfeed: Sometimes I feel like Friendfeed is the Robert Scoble/Louis Gray channel, but I have seeded it with my Twitter feed and have gradually added people as they are exposed through the “friend of” feature.  I always have the feeling that I’m not making the best use of Friendfeed. I like the conversations that pop up on posted items, but wish for the range of input that comes from the huge user bases on Twitter and Facebook. Then again, maybe not Facebook inputs here, I also enjoy the relative skew towards content from early adopters that persists for now on Friendfeed.

If I know you in real life, feel free to send me a Facebook or LinkedIn request, there have been a lot of people signing up lately and I’ve been enjoying reconnecting with people I haven’t heard from in a while.  If I don’t know you (yet), you’re welcome to follow on Twitter (@hjl) or Friendfeed (hjl).

Bookmarks for February 20th through February 21st

These are my links for February 20th through February 21st:

BrainJam, December 2005, search, privacy, transparency

brainjams
Spent a few hours this afternoon at Chris Heuer’s BrainJam event. Wasn’t able to make it to the morning sessions, but arrived in time for the end of lunch and the “youth user panel”, consisting of four college students. They all love Facebook. Not sure how representative they are of the general student demographic, since two of them are trying to put together a web startup. They all use free online music and movie access, mostly through sharing within the dorm networks.

During the Q&A I asked for the panel members’ thoughts on privacy and about having their college lives online in perpituity. They’re vaguely concerned, but I don’t think the topic is really raising red flags for them. I think the high school and college users have more confidence in Facebook, MySpace, Xanga and others keeping their data private and/or it not making any difference to them in the future as social norms change. Part of it is that people are simply making things up on their pages, for the sake of attracting attention, and part of it is them not caring or not understanding that their web pages, chat transcripts, and even VOIP are mostly staying online forever. I think there’s going to be a lot of interesting conflicts in the future as people start running into their past personae 5, 10, 15 years later in a societal context that hasn’t adjusted yet to perpetual transparency.

Afterwards the group broke out into smaller topical discussions. The first session I went to was on the 2-way RSS proposal from Microsoft (Simple Sharing Extensions, SSE). I’m starting to think of SSE as a way for MSFT to use an RSS container for solving the sync problem for applications like Windows Mobile syncing a device and a desktop, or Active Directory performing distributed synchronization of directory data. I’m not really seeing a federated publishing model based on this, an idea that was floated in the conversation. It really feels like it solves an application sync problem for structured data.

The session on “what to do with all the data?” quickly turned into a discussion on privacy, transparency, and DRM. I’m personally disinclined to depend on trusting anyone’s DRM system to manage my criticall personal data, or for allowing anyone to indexing my private data in a way that eventually gets exposed to the world. One point of view expressed in this discussion was that the world would be better off if everyone just got used to the idea that everything they did was recorded and visible to the world (the Global Panopticon), although I think the majority disargreed that this would actually make people behave better. Personally, I think that documenting everything would break a lot of the ambiguity in relationships and conversations that allow the formation of reasonable opinions, by forcing people into adhering to “statements” and “positions” that were nothing more than passing conversation or exploration of a topic. This was part of my thinking behind asking the college kids about privacy. In real life, there are normally various social transitions that call for stepping away or de-emphasizing some aspects of one’s life, in favor of new ones. It doesn’t make the past behaviors and activities go away, but the combination of search engines and infinite, cheap storage is likely to keep some aspects of these folks’ “past” life in their face for a long time, which may make it harder to move forward.

Someone mentioned the idea of “privacy parity”, i.e. you can ask for my data, but I can see that you’re asking for it, sort of like being able to find out when someone has requested your credit report. This is interesting, but there are substantial asymmetries in the value of that information to each party. A bit of parity that would be very interesting would be a feed of who’s seen my site URLs and excerpts in a search results page — not the clickthrough, which I can already see, but when it’s turned up on the page at all.

A few of us continued a sidebar discussion on search, social networks, trust, and attention networks, and eventually got kicked out into the lobby where we were free to speculate on Google’s plan for world domination next to a huge globe in the SRI lobby. I haven’t bumped into anyone yet doing work on integrating the attention, social, and trust data into search. Doing this on a Google/Yahoo/Microsoft scale looks hard, because of the sheer scale, but I’m getting the sense that doing a custom search engine biased by the social / attention data inputs for a limited subject domain (100-1000′sGB) and a relatively small social / atttention network (1000′s – people you know or have heard of) is becoming more reasonable because of cheaper / faster / better IT hardware and because more of the data is actually becoming available now. Still chewing on this. I just came across Danah Boyd’s post on attention networks vs social networks yesterday, which concisely explains the directed vs undirected graph property which underlies part of the ranking algorithms that would be needed.
Perhaps someone’s already done this for a research project.

If Google Desktop were open source, it might be a logical place to insert a modified ranking algorithm based on attention, tags and social networks and also to insert an SSE-style interface to allow peer-to-peer federation of local search queries and results. This would keep the search index data local to “me” and “my documents”, but allow sharing with other clients that I trust. Perhaps it’s just an age thing. The college kids didn’t seem to mind having all of their documents on public servers, are counting on robots.txt to keep them out the global search engines, and apparently think that access controls on sites like Facebook will keep their personal postings out the of the public realm. For me, I still think twice sometimes about posting to my del.icio.us bookmarks list and keep anything really critical on physical media in a safe deposit box in a vault. So while I’ve gone from being Ungoogleable to Google search stardom, there’s a good portion of my digital life which is “dark matter” to the search engines. I’d like to find a way to fix it for myself, and share information with people I trust, and refine my searches over the public internet, but without having to give Google or anyone else all of my personal data.

Youth panel discussion Wrap up session

Took a few photos, photos from others will probably turn up tagged with “brainjams

Update 12-04-2005 21:15 PST: Audio from the Youth Panel discussion on Chris’s blog
KRON-4 television piece on BrainJams. Looks like I missed the hula hoop part in the morning. I also seem to have mostly missed the non-profit community-oriented discussion, as you can see from my notes. Perhaps that’s what was going on when we got kicked out into the lobby for being too loud…