These are my links for February 4th through February 11th:
- Schneier on Security: Interview with a Nigerian Internet Scammer – "We had something called the recovery approach. A few months after the original scam, we would approach the victim again, this time pretending to be from the FBI, or the Nigerian Authorities. The email would tell the victim that we had caught a scammer and had found all of the details of the original scam, and that the money could be recovered. Of course there would be fees involved as well. Victims would often pay up again to try and get their money back."
- xkcd – Frequency of Strip Versions of Various Games – n = Google hits for "strip <game name>" / Google hits for "<game name>"
- PeteSearch: How to split up the US – Visualization of social network clusters in the US. "information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them.
Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South."
- Redis: Lightweight key/value Store That Goes the Extra Mile | Linux Magazine – Sort of like memcache. "Calling redis a key/value store doesn’t quite due it justice. It’s better thought of as a “data structures” server that supports several native data types and operations on them. That’s pretty much how creator Salvatore Sanfilippo (known as antirez) describes it in the documentation. Let’s dig in and see how it works."
- Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.
These are my links for June 9th through June 10th:
- Announcing the Yahoo! Distribution of Hadoop (Hadoop and Distributed Computing at Yahoo!) – Yahoo releases its internal version of Hadoop, a source-only distribution of Apache Hadoop tested and used in production at Yahoo.
- Google Fusion Tables FAQ – Sort of like extra-large Google Docs spreadsheets, up to 100MB per table, 250MB per user. One interesting wrinkle is that it doesn't actually delete your dataset when you "delete" it, so the data is still available for derived tables that other users have built.
- Filesystem Performance from a Database Perspective – Presentation on performance benchmarks on linux filesystems (ext2, ext3, reiserfs, xfs, etc)
- What Assumptions Make: Filesystem I/O from a database perspective – Slide presentation comparing linux file system performance across various formats (ext2, ext3, etc), RAID configurations, readahead buffer sizes
- MySQL – Common Queries Tree – A collection of common queries implemented in MySQL
These are my links for May 30th through May 31st:
- Scaling Twitter: Making Twitter 10000 Percent Faster | High Scalability – Collection of links to presentations and interviews regarding Twitter's architecture, implementation plans, and performance issues, from spring 2009.
- The Last Psychiatrist: The Difference Between An Amateur, A Scientist, And A Genius – An amateur is full of wonder and speculation, tinkering towards the truth but suffering from a lack of knowledge and idleness; he's not even sure if someone else has already made these discoveries. "Is this a worthwhile pursuit?"
A scientist performs experiments to confirm or disprove a hypothesis, and in that way he grinds out the truth.
A genius has three abilities, which are actually the union of amateur and scientist: 1. to know the state of the art, what is known and what is not known. 2. To be able to think "out of the box". 3. To be disciplined enough to concentrate on the tedium of a formal investigation of his wondrous speculations.
- PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing – Research paper on sort of "super healing brush" for manipulating digital images, allows splicing together different sections of the image and automatically selecting similar textures to make the seam transitions work better.
- Light Blue Touchpaper » Blog Archive » Attack of the Zombie Photos – Social networking and sharing sites have challenges implementing and managing access control policies at large scale, and content delivery networks add another wrinkle.
- Map of all Google data center locations | Royal Pingdom – Where in the world is your search being served from? An attempt to assemble a list of known Google data centers worldwide.
These are my links for May 24th through May 27th:
- Formulas and game mechanics – WoWWiki – Your guide to the World of Warcraft – Formulas and game mechanics rules and guidelines for developing role playing games
- Manchester United’s Park Has the Endurance to Persevere – NYTimes.com – Korean soccer player Park Ji-Sung – On Wednesday night in Rome, Park is expected to become the first Asian player to participate in the European Champions League final when Manchester United faces Barcelona.
- mloss.org – Machine Learning Open Source Software – Big collection of open source packages for machine learning, data mining, statistical analysis
- The Datacenter as Computer – Luiz André Barroso and Urs Hölzle 2009 (PDF) – 120 pages on large scale computing lessons from Google. "These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base."
- Geeking with Greg: The datacenter is the new mainframe – Pointer to a paper by Googlers Luiz Andre Barroso and Urs Holzle on the evolution of warehouse scale computing and the management and use of computing resources in a contemporary datacenter.
These are my links for April 20th through April 23rd:
- What I’ve Learned from Hacker News – Paul Graham on social dynamics and managing Hacker News, user submitted comments and ranking (voting up/down) , editorial intervention and moderators, project goals.
- SEOmoz | Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! – Looking at variations on algorithms for ranking items on social news aggregators
- NGINX + PHP-FPM + APC = Awesome – Walkthrough on setting up cached PHP web server on nginx with apc.
- Particletree » PHP Quick Profiler – Lightweight tool for profiling PHP code.
- MySQL’s Full-Text Formulas – Database Journal –
- http://www.acapela-group.com/text-to-speech-interactive-demo.html – Online text-to-speech demo, with various male and female speakers, plus a few translations.
- Dealing with Duplicate Person Data – Proud to Use Perl – Classifying likely duplicate entries in name/address contact data using Levenshtein distance and tables of nickname synonym and assigned distance weights.
- Web Security Horror Stories: The Director’s Cut at <head> – Presentation slides from a talk by Simon Willison on cross site scripting, SQL injection, referer forgery, and clickjacking attacks on web applications.
These are my links for April 13th through April 15th:
These are my links for April 12th from 17:02 to 19:13:
These are my links for April 11th through April 12th:
- Wordle – Beautiful Word Clouds – Wordle is a toy for generating “word clouds” from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes.
- The dark side of Dubai – Johann Hari, Commentators – The Independent – "Dubai was meant to be a Middle-Eastern Shangri-La, a glittering monument to Arab enterprise and western capitalism. But as hard times arrive in the city state that rose from the desert sands, an uglier story is emerging."
- Topless Robot – Hot Girls Have Lightsaber Strip-Fight for Your Viewing Pleasure – Star Wars CGI meets fake body spray ad
- Poll Result: Best VPN to leap China’s Great Firewall? – Thomas Crampton – - Witopia – Undisputed winner. Quality of service, speed of surfing, though it is said to be relatively expensive at US$50 to US$60 per year. Hotspot Shield – Bandwidth limits can be painful. Force you to wait until the next month if you use it too much. – Ultrasurf – StrongVPN
- InfoQ: Facebook: Science and the Social Graph – In this presentation filmed during QCon SF 2008 (November 2008), Aditya Agarwal discusses Facebook’s architecture, more exactly the software stack used, presenting the advantages and disadvantages of its major components: LAMP (PHP, MySQL), Memcache, Thrift, Scribe.
- The Running Man, Revisited § SEEDMAGAZINE.COM – a handful of scientists think that these ultra-marathoners are using their bodies just as our hominid forbears once did, a theory known as the endurance running hypothesis (ER). ER proponents believe that being able to run for extended lengths of time is an adapted trait, most likely for obtaining food, and was the catalyst that forced Homo erectus to evolve from its apelike ancestors.
These are my links for April 7th through April 9th:
These are my links for March 16th through April 2nd:
- Google uncloaks once-secret server | Business Tech – CNET News – Photo and more comments on the Google data center server configuration, 12vdc only, local battery, shown at yesterday's data center power conference.
- Google’s Custom Web Server, Revealed « Data Center Knowledge – 1:30 video of current server configuration, from Google Data Center Energy Summit, April 1, 2009. Open shelf, power supply with built in battery (per-unit UPS) rather than centralized UPS.
- HerHotSpot Uses Facebook Connect to Block Boys Out – Relies on Facebook profile data to limit boys access to site targeting girls only. Uses FBConnect as the exclusive login method.
- SandHill.com | Opinion : Cloud Computing Ecosystem Map v1.0: Standing on the Shoulders of Giants – Collection of pointers to maps of the cloud computing ecosystem, and a merged map, as of March 2009
- Penny Arcade! – Le Twittre –
These are my links for February 27th through February 28th:
These are my links for February 26th through February 27th:
These are my links for February 25th through February 26th:
These are my links for February 24th through February 25th:
- The C10K problem – On techniques for scaling to large number of network clients (e.g. >10000).
- Yodel Anecdotal » Blog Archive » Hello, (twitter) world – List of official Yahoo twitter handles for various activities including research, geo, search, and yui.
- New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia – AWS adds Freebase, DBPedia, Wikipedia extract, and US Transportation data sets.
- eigenclass – Related document discovery, without algebra – Another approach to simple related document discovery, based on tags, should work ok for small data sets.
- SVD Recommendation System in Ruby – igvita.com – A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.
These are my links for February 21st from 13:59 to 21:55:
- Non Sequitur — Gocomics.com – "Hi. My name is Bob, and I'm a Twitter addict…"
- A Tutorial on Support Vector Machines for Pattern Recognition – Christopher J.C. Burges (PDF) – Appeared in: Data Mining and Knowledge Discovery 2, 121-167, 1998. The tutorial starts with an overview of the concepts of VC dimension and structural risk
minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable
data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss
when SVM solutions are unique and when they are global. We describe how support vector training can
be practically implemented, and discuss in detail the kernel mapping technique which is used to construct
SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large
(even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian
radial basis function kernels. While very high VC dimension would normally bode ill for generalization
performance, there are several arguments which support the observed high accuracy of SVMs,
which we review.
- Data Mining Research – dataminingblog.com: Data Miners on Twitter – A list of data mining people on twitter.
- YouTube – The Crisis of Credit Visualized – Part 1 – Nice animated video attempting to present a simplified explanation of the credit crisis and the relationship between home mortgage lending, bank leverage, and risk.
- “10 Obstacles to Cloud Computing” by UC Berkeley & How GoGrid Hurdles Them | GoGrid Blog – Another commentary on the recent UCB cloud computing overview paper
These are my links for February 20th through February 21st:
- xkcd – A Webcomic – Online Communities – A map of online communities (circa 2007?)
- State of OpenSocial – weekend Apps Feb 20 2009 – Google Docs – Kevin Marks overview of OpenSocial as of February 2009.
- Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org – Sample dataset for research on social graphs. "The infochimps have gathered a massive scrape of the Twitter friend graph. Right now it weighs in at about 2.7M users, 10M tweets, 58M edges."
- getting theinfo: data sets (theinfo) – Another list of publicly accessible data collections online
- Some Datasets Available on the Web » Data Wrangling Blog – List of many research datasets and resources related to data analysis available online, last updated February 2009.
- ICWSM 2009 – International AAAI Conference on Weblogs and Social Media – May 17 – 20, 2009, San Jose, California. This interdisciplinary conference brings together researchers and industry leaders interested in creating and analyzing social media. Past conferences have included technical papers from areas such as computer science, linguistics, psychology, statistics, sociology, multimedia and semantic web technologies.
These are my links for February 18th through February 19th:
- Single Google Query uses 1000 Machines in 0.2 seconds – Google Fellow Jeff Dean says from 1999-2009, while both search queries and processing power have gone up by a factor of 1000, latency has gone down from around 1000ms to 200ms. Crawler updates now take minutes compared to months in 1999. 1000 machines handle a single query, all in memory.
- Government 2.0: Tweeting the Talk, Walking the Walk « Adriel Hampton – List of twitter users in various government organizations.
- The Absurdly Artificial Divide Between Pure and Applied Research – Olivia Judson – NYTimes.com – I used to explain myself as an "applied research" guy, small "r", not big "R" pure research. Love theory and analysis but want to see it get used for something eventually.
- Amazon Web Services Developer Community : Load data into S3 via hard drives? – Amazon asks for feedback regarding the FedEx option for bulk data transfer. "We have heard a number of requests about sending hard drives to AWS to load into S3. If such a service would benefit your business, we’d like to learn more about your use case."
- Local Media in a Postmodern World, Part XCI, Advertising Loses Its Balance – On the shifts in supply and demand, buyers and sellers in advertising markets as media moves from 1-to-many to niche-oriented, many-to-many and sellers take control of their own online media and advertising campaigns
These are my links for February 15th through February 16th:
- Berkeley cloud report gets mixed reviews | The Wisdom of Clouds – CNET News – James Urqhardt commentary on UCB paper, "The paper begins by setting a definition of Cloud Computing that will be considered controversial by many, as it is firmly in the "there is no cloud computing inside enterprise data centers" camp."
- Above the Clouds: Above the Clouds Released – UC Berkeley RAD Lab starts a new blog and publishes their take on the state of cloud computing.
- Forget Dunbar’s Number, Our Future Is in Scoble’s Number « I’m Not Actually a Geek – A look at changing interaction styles enabled by growing use of online social networks and applications. "If Dunbar’s Number is defined at 150 connections, perhaps we can term the looser connection of thousands as Scoble’s Number. "
- What really happened at Ma.gnolia and lessons learned – Video podcast with Larry Halff describing how Ma.gnolia was implemented (Ruby on Rails), its ongoing operation leading up to the failure of the (1/2 TB) MySQL database a few weeks ago.
- Infrastructure for Modern Web Sites « random($foo) – An overview of packages, services, and approaches for building web systems, circa January 2009. With assorted comments.
- Online Mind Mapping – MindMeister – Web-based, embeddable mind mapping software, sort of like MindJet, wiki-style collaborative editing.
- Jean-Lou Dupont’s WEBlog: Cloud Computing Mind Map – A mind map of companies and projects in the cloud computing space.