Warning: Parameter 1 to ace_where() expected to be a reference, value given in /home/hjlee/hjl/hojohnlee.com/weblog/wp-includes/plugin.php on line 227

Warning: Parameter 1 to ace_join() expected to be a reference, value given in /home/hjlee/hjl/hojohnlee.com/weblog/wp-includes/plugin.php on line 227
Ho John Lee’s Weblog » google

Bookmarks for January 23rd from 15:00 to 15:00

These are my links for January 23rd from 15:00 to 15:00:

Bookmarks for January 20th through January 23rd

These are my links for January 20th through January 23rd:

  • Data.gov – Featured Datasets: Open Government Directive Agency – Datasets required under the Open Government Directive through the end of the day, January 22, 2010. Freedom of Information Act request logs, Treasury TARP and derivative activity logs, crime, income, agriculture datasets.
  • All Your Twitter Bot Needs Is Love – The bot’s name? Jason Thorton. He’s been humming along for months now, sending out over 1250 tweets to some 174 followers. His tweets, while not particularly creative, manage to be both believable and timely. And he’s powered by a single word: Love.

    Thorton is the creation of developer Ryan Merket, who built him as a side project in around three hours. Merket has just posted the code that powers him, and has also divulged how he made Thorton seem somewhat realistic: the bot looks for tweets with the word “love” in them and tweets them as its own.

  • Building a Twitter Bot – "Meet Jason Thorton. To people who know Jason, he is a successful entrepreneur in San Francisco who tweets 4-5 times a day. But Jason has a secret, he’s not really a human, he’s the product of my simple algorithm in PHP

    Jason tweets A LOT about the word “love” – that’s because Jason actually steals tweets from the public timeline that contain the word “love” and posts them as his own

    Jason also @replies to people who use the word “love” in their tweets, and asks them random questions or says something arbitrary

    It took me about 3 hours to code Jason, imagine what a real engineer could do with real AI algorithms? Now realize that it’s already a reality. Sites like Twitter are full of side projects, company initiatives, spambots and AI robots. When the free flow of information becomes open, the amount of disinformation increases. Theres a real need for someone to vet the people we ‘meet’ on social sites – will be interesting to see how this market grows in the next year

  • Website monitoring status – Public API Status – Health monitor for 26 APIs from popular Web services, including Google Search, Google Maps, Bing, Facebook, Twitter, SalesForce, YouTube, Amazon, eBay and others
  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.

Bookmarks for January 20th through January 22nd

These are my links for January 20th through January 22nd:

  • All Your Twitter Bot Needs Is Love – The bot’s name? Jason Thorton. He’s been humming along for months now, sending out over 1250 tweets to some 174 followers. His tweets, while not particularly creative, manage to be both believable and timely. And he’s powered by a single word: Love.

    Thorton is the creation of developer Ryan Merket, who built him as a side project in around three hours. Merket has just posted the code that powers him, and has also divulged how he made Thorton seem somewhat realistic: the bot looks for tweets with the word “love” in them and tweets them as its own.

  • Building a Twitter Bot – "Meet Jason Thorton. To people who know Jason, he is a successful entrepreneur in San Francisco who tweets 4-5 times a day. But Jason has a secret, he’s not really a human, he’s the product of my simple algorithm in PHP

    Jason tweets A LOT about the word “love” – that’s because Jason actually steals tweets from the public timeline that contain the word “love” and posts them as his own

    Jason also @replies to people who use the word “love” in their tweets, and asks them random questions or says something arbitrary

    It took me about 3 hours to code Jason, imagine what a real engineer could do with real AI algorithms? Now realize that it’s already a reality. Sites like Twitter are full of side projects, company initiatives, spambots and AI robots. When the free flow of information becomes open, the amount of disinformation increases. Theres a real need for someone to vet the people we ‘meet’ on social sites – will be interesting to see how this market grows in the next year

  • Website monitoring status – Public API Status – Health monitor for 26 APIs from popular Web services, including Google Search, Google Maps, Bing, Facebook, Twitter, SalesForce, YouTube, Amazon, eBay and others
  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.

Bookmarks for January 20th from 10:26 to 14:12

These are my links for January 20th from 10:26 to 14:12:

  • Website monitoring status – Public API Status – Health monitor for 26 APIs from popular Web services, including Google Search, Google Maps, Bing, Facebook, Twitter, SalesForce, YouTube, Amazon, eBay and others
  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.

Bookmarks for January 20th from 10:26 to 10:26

These are my links for January 20th from 10:26 to 10:26:

  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.

Bookmarks for January 17th through January 20th

These are my links for January 17th through January 20th:

  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.
  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for January 17th through January 20th

These are my links for January 17th through January 20th:

  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.
  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for January 17th through January 18th

These are my links for January 17th through January 18th:

  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for January 17th through January 18th

These are my links for January 17th through January 18th:

  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for January 17th through January 18th

These are my links for January 17th through January 18th:

  • Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.

    "John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.

    That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."

  • Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.

    The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.

    Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.

    The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for January 17th from 21:37 to 21:53

These are my links for January 17th from 21:37 to 21:53:

  • Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.

    http://status.twitter.com/archive

    I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "

  • Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own

Bookmarks for December 31st through January 17th

These are my links for December 31st through January 17th:

  • Khan Academy – The Khan Academy is a not-for-profit organization with the mission of providing a high quality education to anyone, anywhere.

    We have 1000+ videos on YouTube covering everything from basic arithmetic and algebra to differential equations, physics, chemistry, biology and finance which have been recorded by Salman Khan.

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

Bookmarks for December 31st through January 16th

These are my links for December 31st through January 16th:

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

Bookmarks for December 31st through January 17th

These are my links for December 31st through January 17th:

  • Khan Academy – The Khan Academy is a not-for-profit organization with the mission of providing a high quality education to anyone, anywhere.

    We have 1000+ videos on YouTube covering everything from basic arithmetic and algebra to differential equations, physics, chemistry, biology and finance which have been recorded by Salman Khan.

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

Bookmarks for December 31st through January 17th

These are my links for December 31st through January 17th:

  • Khan Academy – The Khan Academy is a not-for-profit organization with the mission of providing a high quality education to anyone, anywhere.

    We have 1000+ videos on YouTube covering everything from basic arithmetic and algebra to differential equations, physics, chemistry, biology and finance which have been recorded by Salman Khan.

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."

Bookmarks for June 13th through January 16th

These are my links for June 13th through January 16th:

  • StarCraft AI Competition | Expressive Intelligence Studio – AI bot warfare competition using a hacked API to run StarCraft, will be held at AIIDE2010 in October 2010.
    The competition will use StarCraft Brood War 1.16.1. Bots for StarCraft can be developed using the Broodwar API, which provides hooks into StarCraft and enables the development of custom AI for StarCraft. A C++ interface enables developers to query the current state of the game and issue orders to units. An introduction to the Broodwar API is available here. Instructions for building a bot that communicates with a remote process are available here. There is also a Forum. We encourage submission of bots that make use of advanced AI techniques. Some ideas are:
    * Planning
    * Data Mining
    * Machine Learning
    * Case-Based Reasoning
  • Measuring Measures: Learning About Statistical Learning – A "quick start guide" for statistical and machine learning systems, good collection of references.
  • Berkowitz et al : The use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems (2006) – Berkowitz, Steven D., Woodward, Lloyd H., & Woodward, Caitlin. (2006). Use of formal methods to map, analyze and interpret hawala and terrorist-related alternative remittance systems. Originally intended for publication in updating the 1988 volume, eds., Wellman and Berkowitz, Social Structures: A Network Approach (Cambridge University Press). Steve died in November, 2003. See Barry Wellman’s “Steve Berkowitz: A Network Pioneer has passed away,” in Connections 25(2), 2003. It has not been possible to add the updating of references or of the quality of graphics that might have been possible if Berkowitz were alive. An early version of the article appeared in the Proceedings of the Session on Combating Terrorist Networks: Current Research in Social Network Analysis for the New War Fighting Environment. 8th International Command and Control Research and Technology Symposium. National Defense University, Washington, D.C June 17-19, 2003
  • SSH Tunneling through web filters | s-anand.net – Step by step tutorial on using Putty and an EC2 instance to set up a private web proxy on demand.
  • PyDroid GUI automation toolkit – GitHub – What is Pydroid?

    Pydroid is a simple toolkit for automating and scripting repetitive tasks, especially those involving a GUI, with Python. It includes functions for controlling the mouse and keyboard, finding colors and bitmaps on-screen, as well as displaying cross-platform alerts.
    Why use Pydroid?

    * Testing a GUI application for bugs and edge cases
    o You might think your app is stable, but what happens if you press that button 5000 times?
    * Automating games
    o Writing a script to beat that crappy flash game can be so much more gratifying than spending hours playing it yourself.
    * Freaking out friends and family
    o Well maybe this isn't really a practical use, but…

  • Time Series Data Library – More data sets – "This is a collection of about 800 time series drawn from many different fields.Agriculture Chemistry Crime Demography Ecology Finance Health Hydrology Industry Labour Market Macro-Economics Meteorology Micro-Economics Miscellaneous Physics Production Sales Simulated series Sport Transport & Tourism Tree-rings Utilities"
  • How informative is Twitter? » SemanticHacker Blog – "We undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (”tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL."
  • Gremlin – a Turing-complete, graph-based programming language – GitHub – Gremlin is a Turing-complete, graph-based programming language developed in Java 1.6+ for key/value-pair multi-relational graphs known as property graphs. Gremlin makes extensive use of the XPath 1.0 language to support complex graph traversals. This language has applications in the areas of graph query, analysis, and manipulation. Connectors exist for the following data management systems:

    * TinkerGraph in-memory graph
    * Neo4j graph database
    * Sesame 2.0 compliant RDF stores
    * MongoDB document database

    The documentation for Gremlin can be found at this location. Finally, please visit TinkerPop for other software products.

  • The C Programming Language: 4.10 – by Kernighan & Ritchie & Lovecraft – void Rlyeh
    (int mene[], int wgah, int nagl) {
    int Ia, fhtagn;
    if (wgah>=nagl) return;
    swap (mene,wgah,(wgah+nagl)/2);
    fhtagn = wgah;
    for (Ia=wgah+1; Ia<=nagl; Ia++)
    if (mene[Ia]<mene[wgah])
    swap (mene,++fhtagn,Ia);
    swap (mene,wgah,fhtagn);
    Rlyeh (mene,wgah,fhtagn-1);
    Rlyeh (mene,fhtagn+1,nagl);

    } // PH'NGLUI MGLW'NAFH CTHULHU!

  • How to convert email addresses into name, age, ethnicity, sexual orientation – This is so Meta – "Save your email list as a CSV file (just comma separate those email addresses). Upload this file to your facebook account as if you wanted to add them as friends. Voila, facebook will give you all the profiles of all those users (in my test, about 80% of my email lists have facebook profiles). Now, click through each profile, and because of the new default facebook settings, which makes all information public, about 95% of the user info is available for you to harvest."
  • Microsoft Security Development Lifecycle (SDL): Tools Repository – A collection of previously internal-only security tools from Microsoft, including anti-xss, fuzz test, fxcop, threat modeling, binscope, now available for free download.
  • Analytics X Prize – Home – Forecast the murder rate in Philadelphia – The Analytics X Prize is an ongoing contest to apply analytics, modeling, and statistics to solve the social problems that affect our cities. It combines the fields of statistics, mathematics, and social science to understand the root causes of dysfunction in our neighborhoods. Understanding these relationships and discovering the most highly correlated variables allows us to deploy our limited resources more effectively and target the variables that will have the greatest positive impact on improvement.
  • PeteSearch: How to find user information from an email address – FindByEmail code released as open-source. You pass it an email address, and it queries 11 different public APIs to discover what information those services have on the user with that email address.
  • Measuring Measures: Beyond PageRank: Learning with Content and Networks – Conclusion: learning based on content and network data is the current state of the art There is a great paper and talk about personalization in Google News they use content for this purpose, and then user click streams to provide personalization, i.e. recommend specific articles within each topical cluster. The issue is content filtering is typically (as we say in research) "way harder." Suppose you have a social graph, a bunch of documents, and you know that some users in the social graph like some documents, and you want to recommend other documents that you think they will like. Using approaches based on Networks, you might consider clustering users based on co-visitaion (they have co-liked some of the documents). This scales great, and it internationalizes great. If you start extracting features from the documents themselves, then what you build for English may not work as well for the Chinese market. In addition, there is far more data in the text than there is in the social graph
  • mikemaccana’s python-docx at master – GitHub – MIT-licensed Python library to read/write Microsoft Word docx format files. "The docx module reads and writes Microsoft Office Word 2007 docx files. These are referred to as 'WordML', 'Office Open XML' and 'Open XML' by Microsoft. They can be opened in Microsoft Office 2007, Microsoft Mac Office 2008, OpenOffice.org 2.2, and Apple iWork 08. The module was created when I was looking for a Python support for MS Word .doc files, but could only find various hacks involving COM automation, calling .net or Java, or automating OpenOffice or MS Office."
  • Handy one-liners for SED – Sed expressions are powerful, but somewhat obscure and easy to screw up. A handy cheat sheet for common tasks.

My slides from the Real Time Search Panel at SES Chicago last week

Although real time search is fairly new, as we end 2009, the ability to index and search fresh results is rapidly becoming a commodity, with Bing, various startups, and now Google all integrating status feeds from social networking services. The next set of challenges in 2010 will be around providing better relevance, information discovery, and topic exploration for social search, using signals from the dynamic behavior of users and their interaction with the social and topic graphs.

I gave a short talk on real time and social search for a panel at SES Chicago last week. I’ve been heads down for the past few months working on Bing Twitter Search, so now that the first launch is out the door it was a nice chance to talk with people about some of the work we’re doing. There was a lot of interest in the sentiment, trend, and social graph analysis slides (9 and 10). I will write about those in a separate post, but wanted to get the presentation up for those who have been asking about it.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09

View more presentations from Ho John Lee.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09 – Presentation Transcript

  1. What’s different about real time and social search?
    Ho John Lee
    Principal Program Manager
    Bing Social Search
    Search Engine Strategies
    Chicago – December 7, 2009
  2. What’s Real Time Search Good For, Anyway?
  3. Twitter is Great for Watching Uninformed Panics Unfold Live
    …or finding balloons
    http://xkcd.com/574/
  4. Some characteristics of Twitter / Social media
    Immediacy, Sentiment, Brevity
    Not always accurate
    Feelings, reactions, impressions
    Context is often essential to determine meaning
    Gestural – @user, #hashtag, RT, favorites, follows
    Self-organizing communities of attention and authority
    Content follows attention
    People talk about what others are talking about
    Observations and commentary from everywhere
    If there’s no content, you can ask for some
    Extreme head and tail coverage
    Low relevance “noise” can become “signal” in aggregate
  5. Your product or brand could suddenly be at the center of a huge conversation
    Tiger Woods
    Balloon Boy
    Breaking Story
    Persistent Story
    Big Story
    Bigger Story
  6. Some characteristics of Real time / Social Search
    • Real time and social search is qualitatively different from traditional web search
    • Differences in ranking, relevance, use model
    • Social graph, user behavior, location, event correlation and other input signals
    • Real time search is frequently about discovery, not search per se
    • “what is everyone talking about”, followed by “what are people saying about ”
    • Top real time and social search results will usually differ from top web search results
  7. Bing Twitter Search at a glance
    Top Tweets
    Top Shared Links
    Tweets/Sentiment per link
    Adult /Spam filter; Tweets/Links ranking & relevance
  8. Bing Fall 2009: Twitter vertical, News, MSN, Maps
    MSN Local Edition
    Page 2: Tweets or Links
    Page 1: Tweets & Links
    Twitter Answer on News SERP
    MSN Hot Topics
  9. Topic / sentiment range, volume, trend analysis
    What is the baseline rate of mentions / sentiment per unit time?
    Changes in attention flow around a subject, location, topic
    Watch for correlated signals from multiple sources
    Consider source relevance and authority as well
  10. Graph analysis for relevance and ranking
    Spam marketing campaign
    Naturally connected community
    Spammy communities are highly visible – don’t be part of one!
  11. Bing Twitter Maps Demo
  12. To rise above the noise, there is more to do as search gets more social
    Plus…
  13. Thank You
    Ho John Lee
    hojohn . lee @ microsoft.com
    twitter.com/hjl
The session was moderated by Barbara Coll, CEO, WebMama.com Inc., with panelists Bill Fischer, Co-Founder & Director, Workdigital, Ltd., Rob Walk, Managing Partner, NovaRising, Nathan Stoll, Co-Founder, Aardvark, and  Ho John Lee, Principal Program Manager, Social and Real Time Search, Microsoft Bing.

My slides from the Real Time Search Panel at SES Chicago last week

Although real time search is fairly new, as we end 2009, the ability to index and search fresh results is rapidly becoming a commodity, with Bing, various startups, and now Google all integrating status feeds from social networking services. The next set of challenges in 2010 will be around providing better relevance, information discovery, and topic exploration for social search, using signals from the dynamic behavior of users and their interaction with the social and topic graphs.

I gave a short talk on real time and social search for a panel at SES Chicago last week. I’ve been heads down for the past few months working on Bing Twitter Search, so now that the first launch is out the door it was a nice chance to talk with people about some of the work we’re doing. There was a lot of interest in the sentiment, trend, and social graph analysis slides (9 and 10). I will write about those in a separate post, but wanted to get the presentation up for those who have been asking about it.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09

View more presentations from Ho John Lee.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09 – Presentation Transcript

  1. What’s different about real time and social search?
    Ho John Lee
    Principal Program Manager
    Bing Social Search
    Search Engine Strategies
    Chicago – December 7, 2009
  2. What’s Real Time Search Good For, Anyway?
  3. Twitter is Great for Watching Uninformed Panics Unfold Live
    …or finding balloons
    http://xkcd.com/574/
  4. Some characteristics of Twitter / Social media
    Immediacy, Sentiment, Brevity
    Not always accurate
    Feelings, reactions, impressions
    Context is often essential to determine meaning
    Gestural – @user, #hashtag, RT, favorites, follows
    Self-organizing communities of attention and authority
    Content follows attention
    People talk about what others are talking about
    Observations and commentary from everywhere
    If there’s no content, you can ask for some
    Extreme head and tail coverage
    Low relevance “noise” can become “signal” in aggregate
  5. Your product or brand could suddenly be at the center of a huge conversation
    Tiger Woods
    Balloon Boy
    Breaking Story
    Persistent Story
    Big Story
    Bigger Story
  6. Some characteristics of Real time / Social Search
    • Real time and social search is qualitatively different from traditional web search
    • Differences in ranking, relevance, use model
    • Social graph, user behavior, location, event correlation and other input signals
    • Real time search is frequently about discovery, not search per se
    • “what is everyone talking about”, followed by “what are people saying about ”
    • Top real time and social search results will usually differ from top web search results
  7. Bing Twitter Search at a glance
    Top Tweets
    Top Shared Links
    Tweets/Sentiment per link
    Adult /Spam filter; Tweets/Links ranking & relevance
  8. Bing Fall 2009: Twitter vertical, News, MSN, Maps
    MSN Local Edition
    Page 2: Tweets or Links
    Page 1: Tweets & Links
    Twitter Answer on News SERP
    MSN Hot Topics
  9. Topic / sentiment range, volume, trend analysis
    What is the baseline rate of mentions / sentiment per unit time?
    Changes in attention flow around a subject, location, topic
    Watch for correlated signals from multiple sources
    Consider source relevance and authority as well
  10. Graph analysis for relevance and ranking
    Spam marketing campaign
    Naturally connected community
    Spammy communities are highly visible – don’t be part of one!
  11. Bing Twitter Maps Demo
  12. To rise above the noise, there is more to do as search gets more social
    Plus…
  13. Thank You
    Ho John Lee
    hojohn . lee @ microsoft.com
    twitter.com/hjl
The session was moderated by Barbara Coll, CEO, WebMama.com Inc., with panelists Bill Fischer, Co-Founder & Director, Workdigital, Ltd., Rob Walk, Managing Partner, NovaRising, Nathan Stoll, Co-Founder, Aardvark, and  Ho John Lee, Principal Program Manager, Social and Real Time Search, Microsoft Bing.

My slides from the Real Time Search Panel at SES Chicago last week

Although real time search is fairly new, as we end 2009, the ability to index and search fresh results is rapidly becoming a commodity, with Bing, various startups, and now Google all integrating status feeds from social networking services. The next set of challenges in 2010 will be around providing better relevance, information discovery, and topic exploration for social search, using signals from the dynamic behavior of users and their interaction with the social and topic graphs.

I gave a short talk on real time and social search for a panel at SES Chicago last week. I’ve been heads down for the past few months working on Bing Twitter Search, so now that the first launch is out the door it was a nice chance to talk with people about some of the work we’re doing. There was a lot of interest in the sentiment, trend, and social graph analysis slides (9 and 10). I will write about those in a separate post, but wanted to get the presentation up for those who have been asking about it.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09

View more presentations from Ho John Lee.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09 – Presentation Transcript

  1. What’s different about real time and social search?
    Ho John Lee
    Principal Program Manager
    Bing Social Search
    Search Engine Strategies
    Chicago – December 7, 2009
  2. What’s Real Time Search Good For, Anyway?
  3. Twitter is Great for Watching Uninformed Panics Unfold Live
    …or finding balloons
    http://xkcd.com/574/
  4. Some characteristics of Twitter / Social media
    Immediacy, Sentiment, Brevity
    Not always accurate
    Feelings, reactions, impressions
    Context is often essential to determine meaning
    Gestural – @user, #hashtag, RT, favorites, follows
    Self-organizing communities of attention and authority
    Content follows attention
    People talk about what others are talking about
    Observations and commentary from everywhere
    If there’s no content, you can ask for some
    Extreme head and tail coverage
    Low relevance “noise” can become “signal” in aggregate
  5. Your product or brand could suddenly be at the center of a huge conversation
    Tiger Woods
    Balloon Boy
    Breaking Story
    Persistent Story
    Big Story
    Bigger Story
  6. Some characteristics of Real time / Social Search
    • Real time and social search is qualitatively different from traditional web search
    • Differences in ranking, relevance, use model
    • Social graph, user behavior, location, event correlation and other input signals
    • Real time search is frequently about discovery, not search per se
    • “what is everyone talking about”, followed by “what are people saying about ”
    • Top real time and social search results will usually differ from top web search results
  7. Bing Twitter Search at a glance
    Top Tweets
    Top Shared Links
    Tweets/Sentiment per link
    Adult /Spam filter; Tweets/Links ranking & relevance
  8. Bing Fall 2009: Twitter vertical, News, MSN, Maps
    MSN Local Edition
    Page 2: Tweets or Links
    Page 1: Tweets & Links
    Twitter Answer on News SERP
    MSN Hot Topics
  9. Topic / sentiment range, volume, trend analysis
    What is the baseline rate of mentions / sentiment per unit time?
    Changes in attention flow around a subject, location, topic
    Watch for correlated signals from multiple sources
    Consider source relevance and authority as well
  10. Graph analysis for relevance and ranking
    Spam marketing campaign
    Naturally connected community
    Spammy communities are highly visible – don’t be part of one!
  11. Bing Twitter Maps Demo
  12. To rise above the noise, there is more to do as search gets more social
    Plus…
  13. Thank You
    Ho John Lee
    hojohn . lee @ microsoft.com
    twitter.com/hjl
The session was moderated by Barbara Coll, CEO, WebMama.com Inc., with panelists Bill Fischer, Co-Founder & Director, Workdigital, Ltd., Rob Walk, Managing Partner, NovaRising, Nathan Stoll, Co-Founder, Aardvark, and  Ho John Lee, Principal Program Manager, Social and Real Time Search, Microsoft Bing.

My slides from the Real Time Search Panel at SES Chicago last week

Although real time search is fairly new, as we end 2009, the ability to index and search fresh results is rapidly becoming a commodity, with Bing, various startups, and now Google all integrating status feeds from social networking services. The next set of challenges in 2010 will be around providing better relevance, information discovery, and topic exploration for social search, using signals from the dynamic behavior of users and their interaction with the social and topic graphs.

I gave a short talk on real time and social search for a panel at SES Chicago last week. I’ve been heads down for the past few months working on Bing Twitter Search, so now that the first launch is out the door it was a nice chance to talk with people about some of the work we’re doing. There was a lot of interest in the sentiment, trend, and social graph analysis slides (9 and 10). I will write about those in a separate post, but wanted to get the presentation up for those who have been asking about it.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09

View more presentations from Ho John Lee.

What’s Different about Real Time and Social Search – HJL Slides For SES Chicago Dec 09 – Presentation Transcript

  1. What’s different about real time and social search?
    Ho John Lee
    Principal Program Manager
    Bing Social Search
    Search Engine Strategies
    Chicago – December 7, 2009
  2. What’s Real Time Search Good For, Anyway?
  3. Twitter is Great for Watching Uninformed Panics Unfold Live
    …or finding balloons
    http://xkcd.com/574/
  4. Some characteristics of Twitter / Social media
    Immediacy, Sentiment, Brevity
    Not always accurate
    Feelings, reactions, impressions
    Context is often essential to determine meaning
    Gestural – @user, #hashtag, RT, favorites, follows
    Self-organizing communities of attention and authority
    Content follows attention
    People talk about what others are talking about
    Observations and commentary from everywhere
    If there’s no content, you can ask for some
    Extreme head and tail coverage
    Low relevance “noise” can become “signal” in aggregate
  5. Your product or brand could suddenly be at the center of a huge conversation
    Tiger Woods
    Balloon Boy
    Breaking Story
    Persistent Story
    Big Story
    Bigger Story
  6. Some characteristics of Real time / Social Search
    • Real time and social search is qualitatively different from traditional web search
    • Differences in ranking, relevance, use model
    • Social graph, user behavior, location, event correlation and other input signals
    • Real time search is frequently about discovery, not search per se
    • “what is everyone talking about”, followed by “what are people saying about ”
    • Top real time and social search results will usually differ from top web search results
  7. Bing Twitter Search at a glance
    Top Tweets
    Top Shared Links
    Tweets/Sentiment per link
    Adult /Spam filter; Tweets/Links ranking & relevance
  8. Bing Fall 2009: Twitter vertical, News, MSN, Maps
    MSN Local Edition
    Page 2: Tweets or Links
    Page 1: Tweets & Links
    Twitter Answer on News SERP
    MSN Hot Topics
  9. Topic / sentiment range, volume, trend analysis
    What is the baseline rate of mentions / sentiment per unit time?
    Changes in attention flow around a subject, location, topic
    Watch for correlated signals from multiple sources
    Consider source relevance and authority as well
  10. Graph analysis for relevance and ranking
    Spam marketing campaign
    Naturally connected community
    Spammy communities are highly visible – don’t be part of one!
  11. Bing Twitter Maps Demo
  12. To rise above the noise, there is more to do as search gets more social
    Plus…
  13. Thank You
    Ho John Lee
    hojohn . lee @ microsoft.com
    twitter.com/hjl
The session was moderated by Barbara Coll, CEO, WebMama.com Inc., with panelists Bill Fischer, Co-Founder & Director, Workdigital, Ltd., Rob Walk, Managing Partner, NovaRising, Nathan Stoll, Co-Founder, Aardvark, and  Ho John Lee, Principal Program Manager, Social and Real Time Search, Microsoft Bing.
Page 3 of 7812345...102030...Last »