Crossroads of the World at the Beach Bar, Waikiki
As some of you know, I have been exploring a variety of paths forward for SocialQuant, my real time social search and analytics project. My family, friends, and colleagues have given me much support, patience, and advice during this process, which has reached a crossroads, and as Yogi Berra says, “When you come to a fork in the road, take it!”
The rise of Twitter, Facebook, and other social media, combined with web-based applications, smartphones, and cloud computing have all set the stage for new applications and use models based on social discovery, collaboration, and communications, in addition to traditional search. What we’re all calling “real time search” lately isn’t exactly real time, nor is it exactly search, in which you find a definitive/authoritative answer. Much of the opportunity revolves around discovering people, discussions, and events that are relevant to you and bringing it to your attention in a timely, actionable fashion. Information streams from social media are transient, unreliable, and noisy. At the same time, the sheer volume of data can help provide the basis for building better filters. As an added bonus, you can ask questions to people in the social graph itself, and there are numerous examples of communities of interest forming around current events such as Barack Obama’s inauguration, the Iran elections, or even Michael Jackson’s funeral, all of which help surface information content, opinion, and sentiment that were previously inaccessible online. One interesting aspect of real time social media is that it’s not just algorithmic, it’s based on human connections and emotions. So a message that “feels right” from people you trust can be more relevant than one that is “correct” at times.
The challenge then is in filtering and ranking the massive flow of information in a way that helps direct the user’s limited (and non-expanding) time and attention in a way that’s most valuable to them. With today’s information technology, amazing things are possible with limited resources. I personally have more computing and storage resources than the facility we launched HP’s original photo site with (for millions of dollars), at a fraction of the cost, routinely pushing around datasets of millions of rows on the local development servers. Unfortunately, that’s just the ante to get started on the problem. Running ranking, clustering, and semantic analysis for filtering the ever-growing stream of social media eventually requires web scale computing, even with careful problem selection and data pruning. The bar is also going up every day as the social media user base grows, and as well funded teams make progress on their platforms (+Google). So very shortly, to be competitive in real time, social search and discovery is going to require access to lots of data and either getting a datacenter or working with someone who has one.
In my case, I have recently chosen the latter path, and will be joining the Microsoft Bing search team, focusing on real time and social search. Microsoft itself has been showing signs of a renaissance, with search relaunching, Windows 7 looking leaner, Azure becoming non-vaporous, more web APIs getting published, core online applications starting to turn up, and a cool Office 2010 video. Even Mini-Microsoft is getting positive recently. And Google is starting to have “bigness” issues.
I look forward to working with Sean Suchter and the Microsoft Bing search team (and likely expanding their carbon footprint) in pursuit of new applications and services as the social media and online application space evolves.
You can follow along on Twitter (@hjl). As always, any and all opinions here are solely mine and do not reflect the position of any past, present, or future employer, partner, or business associate.
These are my links for June 11th through June 12th:
These are my links for June 9th through June 10th:
- Announcing the Yahoo! Distribution of Hadoop (Hadoop and Distributed Computing at Yahoo!) – Yahoo releases its internal version of Hadoop, a source-only distribution of Apache Hadoop tested and used in production at Yahoo.
- Google Fusion Tables FAQ – Sort of like extra-large Google Docs spreadsheets, up to 100MB per table, 250MB per user. One interesting wrinkle is that it doesn't actually delete your dataset when you "delete" it, so the data is still available for derived tables that other users have built.
- Filesystem Performance from a Database Perspective – Presentation on performance benchmarks on linux filesystems (ext2, ext3, reiserfs, xfs, etc)
- What Assumptions Make: Filesystem I/O from a database perspective – Slide presentation comparing linux file system performance across various formats (ext2, ext3, etc), RAID configurations, readahead buffer sizes
- MySQL – Common Queries Tree – A collection of common queries implemented in MySQL
These are my links for June 3rd through June 4th:
These are my links for May 30th through May 31st:
- Scaling Twitter: Making Twitter 10000 Percent Faster | High Scalability – Collection of links to presentations and interviews regarding Twitter's architecture, implementation plans, and performance issues, from spring 2009.
- The Last Psychiatrist: The Difference Between An Amateur, A Scientist, And A Genius – An amateur is full of wonder and speculation, tinkering towards the truth but suffering from a lack of knowledge and idleness; he's not even sure if someone else has already made these discoveries. "Is this a worthwhile pursuit?"
A scientist performs experiments to confirm or disprove a hypothesis, and in that way he grinds out the truth.
A genius has three abilities, which are actually the union of amateur and scientist: 1. to know the state of the art, what is known and what is not known. 2. To be able to think "out of the box". 3. To be disciplined enough to concentrate on the tedium of a formal investigation of his wondrous speculations.
- PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing – Research paper on sort of "super healing brush" for manipulating digital images, allows splicing together different sections of the image and automatically selecting similar textures to make the seam transitions work better.
- Light Blue Touchpaper » Blog Archive » Attack of the Zombie Photos – Social networking and sharing sites have challenges implementing and managing access control policies at large scale, and content delivery networks add another wrinkle.
- Map of all Google data center locations | Royal Pingdom – Where in the world is your search being served from? An attempt to assemble a list of known Google data centers worldwide.
These are my links for May 29th from 05:17 to 12:45:
- Some stats from Twitter conference compared to… – Robert Scoble – FriendFeed – Anecdotal data from 140tc this week. 200 tweets/second at peak. Didn't see an estimate of current user account population though, I keep seeing site unique visitor estimates, which aren't useful.
- Microsoft Silverlight vs Google Wave: Why Karma Matters | Zoho Blogs – "The real interesting contrast to us, as independent software developers, is the way developers responded to Silverlight as opposed to the reaction yesterday to Google Wave. Both Silverlight and Wave are aimed at taking the internet experience to the next level. To be perfectly honest, Silverlight is a great piece of technology. Google Wave, as yet, is not much more than a concept and an announcement. It is easy to dismiss all this with "Oh, the press just loves to hype everything Google, and loves to hate Microsoft," but that cannot explain why even competitors like us are willing to embrace Google's innovations, but stay away from perfectly good innovations from Microsoft, such as Silverlight? It comes down to one word: karma."
- makerfaire.com: Maker Faire – This weekend at San Mateo Expo Center
- Google Wave Federation Protocol –
- Google Wave API Overview – Google Wave API – Google Code – APIs for Google Wave email / bbs / wiki / chat / collaboration / communications mashup platform introduced yesterday.
- What Emacs Commands Do You Use Most and Find Most Useful? : programming – Reddit thread discussing favorite emacs commands
These are my links for May 24th through May 27th:
- Formulas and game mechanics – WoWWiki – Your guide to the World of Warcraft – Formulas and game mechanics rules and guidelines for developing role playing games
- Manchester United’s Park Has the Endurance to Persevere – NYTimes.com – Korean soccer player Park Ji-Sung – On Wednesday night in Rome, Park is expected to become the first Asian player to participate in the European Champions League final when Manchester United faces Barcelona.
- mloss.org – Machine Learning Open Source Software – Big collection of open source packages for machine learning, data mining, statistical analysis
- The Datacenter as Computer – Luiz André Barroso and Urs Hölzle 2009 (PDF) – 120 pages on large scale computing lessons from Google. "These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base."
- Geeking with Greg: The datacenter is the new mainframe – Pointer to a paper by Googlers Luiz Andre Barroso and Urs Holzle on the evolution of warehouse scale computing and the management and use of computing resources in a contemporary datacenter.
These are my links for May 14th through May 15th:
- Congratulations, Google staff: $210k in profit per head in 2008 | Royal Pingdom – Google had $209,624 in profit per employee in 2008, which beats all the other large tech companies we looked at, including big hitters like Microsoft ($194K), Apple ($151K), Intel ($64K) and IBM ($30K).
- Statistical Data Mining Tutorials – A nice collection of presentations reviewing topics in data mining and machine learning. e.g. "HillClimbing, Simulated Annealing and Genetic Algorithms. Some very useful algorithms, to be used only in case of emergency." These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.
- Dare Obasanjo aka Carnage4Life – Why Twitter’s Engineers Hate the @replies feature – Looking at the infrastructure overhead required for Twitter's attempted change to @reply behavior.
- Scratch Helps Kids Get With the Program – Gadgetwise Blog – NYTimes.com – On my candidate list for 7th grade introductory programming and analysis. "Scratch, an M.I.T.-developed computer-programming language for children, is the focus of worldwide show-and-tell sessions this Saturday. "
These are my links for May 6th through May 7th:
- Mathematical Atlas: A gateway to Mathematics – "The Mathematical Atlas is a collection of articles about aspects of mathematics at and above the university level, but (usually) not at the level of current research. The goal of this collection is to introduce the subject areas of modern mathematics, to describe a few of the milestone results and topics, and to give pointers to some of the key resources where further information is to be found. Like any good atlas, we try to present several ways to look at each area and to show its relationship with neighboring areas and sub-areas. "
- Three Reasons Why Twitter Will NOT Index the Links You Share – ReadWriteWeb – Argues that Twitter will rely on bit.ly through partnership or acquisition to handle sentiment and semantic analysis of twitter search and link contents.
- Tough Love For Microsoft Search – December 2008 post from Danny Sullivan on Microsoft and the search landscape.
- Annals of Innovation: How David Beats Goliath: Reporting & Essays: The New Yorker – Malcolm Gladwell, with a reporter at large on Vivek Ranadivé and his NJB girls basketball team, employing asymmetric strategies to overcome conventionally stronger teams, and a broader look at the history of insurgent strategies from David and Goliath, T.E. Lawrence, George Washington, etc.
These are my links for May 5th through May 6th:
- Coding Horror: I Just Logged In As You: How It Happened – On good password management, why forums should mostly not be storing user passwords in general, and how re-use of passwords on multiple sites can lead to vulnerability on other sites.
- Arc Forum | Arc – Arc is a version of Lisp. Among other things it is used to implement Hacker News.
- John Graham-Cumming: Can you trust Paul Graham with your password? – On best practices for storing password hashes to avoid attacks on compromised password files and the use of rainbow files, in a look at Hacker News implementation of passwords
- Deliberate Ambiguity: How *not* to rate a search engine – Search engines have very simple user interfaces, but are used in many different contexts, most of which don't resemble the way people often try out a new search engine.
- The Slow Erosion of Google Search – Bokardo – On changes in internet user behaviors over time, more social media (ask your Twitter friends) vs directed search (send a keyword query) etc.
- Brynn Marie Evans » Why social search won’t topple Google (anytime soon) – On differences between searching through social media such as Twitter, Facebook etc, vs Google etc.
- The Financial Services Club’s Blog: Stock picking with real-time news – Looking at real time social media trends for trading ideas.
- Lisp’s reputation is so bad that many people don’t even take a look at Lisp | International Lisp Conference 2009 – I haven't touched Lisp in years, except maybe for configuring emacs. A list of possible reasons why Lisp is not more widely used, e.g. "Lisp is old and moldy. It must be primitive by today's standards.", "The exciting languages to learn now are Python, Ruby, Groovy, etc."
- Peering into North Korea – The Big Picture – Boston.com – A collection of recent photos of scenes from North Korea.
These are my links for May 4th through May 5th:
- Inﬂuential Nodes in a Diﬀusion Model for Social Networks (icalp05-inf.pdf) – Kempe, Kleinberg, Tardos. Algorithm for greedy approximation of most influential nodes in social network (63% of optimal) under various conditions.
- Maximizing the Spread of Inﬂuence through a Social Network (kdd03-inf.pdf) – Kempe, Kleinberg, Tardos. Maximizing propagation by selecting most influential nodes is NP-hard, but a greedy approximation can work well (63% of optimal) under various conditions.
- Notification Strategies for Social Networks – Discussion on approaches to maximizing use of a limited number of notifications within social networks e.g. Facebook
- James Smith • loopj.com » Blog Archive » jQuery Plugin: Tokenizing Autocomplete Text Entry – Looks handy – "This is a jQuery plugin to allow users to select multiple items from a predefined list, using autocompletion as they type to find each item. You may have seen a similar type of text entry when filling in the recipients field sending messages on facebook."
- Google Code FAQ – Using cURL to interact with Google data services – Step by step tutorial on using curl with Google data APIs.
- Behind The Business Plan Of Pirates Inc. : NPR – It takes around $250K to fund a Somali pirate operation. About 20 percent goes to pay off officials who look the other way. About 50 percent is for expenses and payroll. The leader of an attack makes $10,000 to $20,000 (the average Somali family lives on $500 a year). The initial investor — who put in $250,000 of seed capital — gets 30 percent, sometimes up to $500,000.
These are my links for April 28th through April 29th:
- Inside Facebook Reports: Why Hasn’t Facebook Grown More in China? – A look at Chinese consumer internet and social media usage, QQ, 51, Xiaonei, Kaixin, and some reasons why there are only around 300,000 Facebook users in China today.
- Facebook maps the swine flu hysteria | The Web Services Report – CNET News – Visualizing interest in swine flu by mapping percentages of mentions on Facebook wall pages, using data from Lexicon.
- Develop Twitter API application in django and deploy on Google App Engine — The Uswaretech Blog – Django Web Development – Walkthrough of a sample Twitter application on Google App Engine, using Django/Python.
These are my links for April 28th from 05:35 to 14:24:
- Official Google Blog: Adding search power to public data – Interesting. Wonder if the underlying public data sets will eventually become available on Google App Engine as well, sort of like the public data sets available for use with Amazon EC2 applications.
- MySQL And Search At Craigslist – Jeremy Zawodny's slides on MySQL, Sphinx, and free text search implementation at Craigslist, from last week's MySQL conference.
- Skew, The Frontend Engineer’s Misery @ Irrational Exuberance – For mashups and the like, the distinction between a FE engineer and web dev is rather small in terms of technical skills; they are both using the same skillset, they are both interacting with APIs, and so on. However, there are important distinctions between the two: 1. web developers tend to move in small groups or as individuals, whereas fe engineers work in larger groups, 2. web developers tend to design a product on top of an existing backend service (api, etc), while fe engineers are usually working in parallel with the backend being developed.
- Study: Twitter Audience Does Not Have A Return Policy – Over 60 percent of people who sign up to use the popular (and tremendously discussed) micro-blogging platform do not return to using it the following month, according to new data released by Nielsen Online. In other words, Twitter currently has just a 40 percent retention rate, up from just 30 percent in previous months–indicating an “I don’t get it factor” among new users that is reminiscent of the similarly-over hyped Second Life from a few years ago.
- Hey Americans, Appreciate Your Freedom Of Speech : NPR – Firoozeh Dumas on the underappreciated freedoms of speech and expression we have in the US vs journalists and bloggers in Iran.
These are my links for April 12th through April 13th:
- High Performance Web Sites :: don’t use @import – Summary – use LINK instead of @import for stylesheet references. "Using @import within a stylesheet adds one more roundtrip to the overall download time of the page. Using @import in IE causes the download order to be altered. This may cause stylesheets to take longer to download, which hinders progress rendering making the page feel slower."
- Learn Korean Language :The Official Korea Tourism Guide Site – Flash-based Korean language lessons, from KBS World Radio.
- Korea rate of obesity ranks lowest among OECD nations – INSIDE JoongAng Daily – Korea has lowest obesity rate among 30 OECD countries, at 3.5%, vs the US (#30) at 34.3%.
- FT.com / Weekend / Reportage – Is a high IQ a burden as much as a blessing? – “High cognitive ability is very often a mixed blessing,” Patrick O’Shea, the president of the International Society for Philosophical Enquiry (ISPE), told me. Too wide a deviation from the mean IQ of 100 brings with it an inherent isolation. “If you have an IQ of 160 or higher,” O’Shea explained, “you’re probably able to connect well with less than 1 per cent of the population.”
These are my links for April 9th from 08:07 to 17:53:
- IP address geolocation SQL database – IP address geolocation with MySQL by Marc-Andre Caron. He's done all the necessary legwork to solve this problem, putting together a free, monthly-updated MySQL dataset that will allow you to derive country, region, city, zip, latitude, and longitude from an IP address.
- Del.icio.us Finally Gets Some Respect from Yahoo – Probably Too Late – ReadWriteWeb –
- In the Event That You Have Accidentally Swallowed the Higgs Boson by Michael Rottman – The Morning News – "7. Do you feel protons decaying? Grand Unification may be occurring near your vital organs. "
- FT.com / Companies / UK companies – Dotcom veterans in Twitter ‘brains trust’ – "Mr Read has brought together a “brains trust” of advisers to Twitter Partners, including Brent Hoberman and Martha Lane Fox, founders of Lastminute.com; Saul Klein, a partner at Index Ventures, the London venture capitalists; and Toby Coppel, the former European vice-president at Yahoo."
- byteonic.com » What you cannot do using Java in Google App Engine – List of some restrictions on Java code running on GAE
These are my links for March 16th through April 2nd:
- Google uncloaks once-secret server | Business Tech – CNET News – Photo and more comments on the Google data center server configuration, 12vdc only, local battery, shown at yesterday's data center power conference.
- Google’s Custom Web Server, Revealed « Data Center Knowledge – 1:30 video of current server configuration, from Google Data Center Energy Summit, April 1, 2009. Open shelf, power supply with built in battery (per-unit UPS) rather than centralized UPS.
- HerHotSpot Uses Facebook Connect to Block Boys Out – Relies on Facebook profile data to limit boys access to site targeting girls only. Uses FBConnect as the exclusive login method.
- SandHill.com | Opinion : Cloud Computing Ecosystem Map v1.0: Standing on the Shoulders of Giants – Collection of pointers to maps of the cloud computing ecosystem, and a merged map, as of March 2009
- Penny Arcade! – Le Twittre –
These are my links for March 9th through March 12th:
- Google Friend Connect APIs – Google Code –
- Geek And Poke – Mostly twitter and cloud computing themed cartoons.
- Official Google Blog: Here comes Google Voice – GrandCentral makes a comeback, after disappearing into Google a while back. Now with voice transcription, SMS folders, and integration with GMail address book.
- Amazon Web Services Blog: Announcing Amazon EC2 Reserved Instances – AWS introduces pricing structure for longer term, reserved capacity. Upfront payment, plus a (lower) incremental hourly charge, net savings for continuous 24×7 clients, and guaranteed availability of instances for backup or surge capacity.
- How To Monetize a Social Network: MySpace and Facebook Should Follow TenCent « abovethecrowd.com – Bill Gurley on the case for virtual goods and casual gaming as revenue vehicles on US-based social networking sites, in a look at China-based QQ / TenCent.
- Too Big Has Failed – Thomas Hoenig, Kansas City Federal Reserve Bank, March 6, 2009 (PDF) – Hoenig argues that too-big-to-fail institutions have failed, US banks will require some form of nationalization eventually.
These are my links for March 2nd from 10:48 to 21:40:
These are my links for February 27th through February 28th:
These are my links for February 26th from 10:39 to 20:05: