These are my links for January 23rd through January 30th:
- Leonardo da Vinci’s Resume Explains Why He’s The Renaissance Man For the Job – Davinci – Gizmodo – At one time in history, even da Vinci himself had to pen a resume to explain why he was a qualified applicant. Here's a translation of his letter to the Duke of Milan, delineating his many talents and abilities. "Most Illustrious Lord, Having now sufficiently considered the specimens of all those who proclaim themselves skilled contrivers of instruments of war, and that the invention and operation of the said instruments are nothing different from those in common use: I shall endeavor, without prejudice to any one else, to explain myself to your Excellency, showing your Lordship my secret, and then offering them to your best pleasure and approbation to work with effect at opportune moments on all those things which, in part, shall be briefly noted below..The document, written when da Vinci was 30, is actually more of a cover letter than a resume; he leaves out many of his artistic achievements and instead focuses on what he can provide for the Duke in technologies of war.
- jsMath: jsMath Home Page – The jsMath package provides a method of including mathematics in HTML pages that works across multiple browsers under Windows, Macintosh OS X, Linux and other flavors of unix. It overcomes a number of the shortcomings of the traditional method of using images to represent mathematics: jsMath uses native fonts, so they resize when you change the size of the text in your browser, they print at the full resolution of your printer, and you don't have to wait for dozens of images to be downloaded in order to see the mathematics in a web page. There are also advantages for web-page authors, as there is no need to preprocess your web pages to generate any images, and the mathematics is entered in TeX form, so it is easy to create and maintain your web pages. Although it works best with the TeX fonts installed, jsMath will fall back on a collection of image-based fonts (which can still be scaled or printed at high resolution) or unicode fonts when the TeX fonts are not available.
- Josh on the Web » Blog Archive » Abusing the Cache: Tracking Users without Cookies – To track a user I make use of three URLs: the container, which can be any website; a shim file, which contains a unique code; and a tracking page, which stores (and in this case displays) requests. The trick lies in making the browser cache the shim file indefinitely. When the file is requested for the first – and only – time a unique identifier is embedded in the page. The shim embeds the tracking page, passing it the unique ID every time it is loaded. See the source code.
- Panopticlick – Your browser fingerprint appears to be unique among the 342,943 tested so far.
Currently, we estimate that your browser has a fingerprint that conveys at least 18.39 bits of identifying information.
The measurements we used to obtain this result are listed below. You can read more about the methodology here, and about some defenses against fingerprinting here
- Benlog » Don’t Hash Secrets – If I tell you that SHA1(foo) is X, then it turns out in a lot of cases to be quite easy for you to determine what SHA1(foo || bar) is. You don’t need to know what foo is. because SHA1 is iterative and works block by block, if you know the hash of foo, then you can extend the computation to determine the hash of foo || bar
That means that if you know SHA1(secret || message), you can compute SHA1(secret || message || ANYTHING), which is a valid signature for message || ANYTHING. So to break this system, you just need to see one signature from SuperAnnoyingPoke, then you can impersonate SuperAnnoyingPoke for lots of other messages.
What you should be using is HMAC: Hash-function Message Authentication Code. You don’t need to know exactly how it works, just need to know that HMAC is specifically built for message authentication codes and the use case of SuperAnnoyingPoke/MyFace. Under the hood, what’s approximately going on is two hashes, with the secret combined after the first hash
- Data.gov – Featured Datasets: Open Government Directive Agency – Datasets required under the Open Government Directive through the end of the day, January 22, 2010. Freedom of Information Act request logs, Treasury TARP and derivative activity logs, crime, income, agriculture datasets.
These are my links for January 17th through January 20th:
- PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.
- Twitter.com vs The Twitter Ecosystem – Fred Wilson comments on some data from John Borthwick indicating Twitter ecosystem use = 3-5x Twitter.com directly.
"John's chart estimates that Twitter.com is about 20mm uvs a month in the US (comScore has it at 60mm uvs worldwide) and the Twitter ecosystem at about 60mm uvs in the US.
That says that across all web services, not just AVC, the Twitter ecosystem is about 3x Twitter.com. And on this blog, whose audience is certainly power users, that ratio is 5x."
- Chris Walshaw :: Research :: Partition Archive – Welcome to the University of Greenwich Graph Partitioning Archive. The archive consists of the best partitions found to date for a range of graphs and its aim is to provide a benchmark, against which partitioning algorithms can be tested, and a resource for experimentation.
The partition archive has been in operation since the year 2000 and includes results from most of the major graph partitioning software packages. Researchers developing experimental partitioning algorithms regularly submit new partitions for possible inclusion.
Most of the test graphs arise from typical partitioning applications, although the archive also includes results computed for a graph-colouring test suite [Wal04] contained in a separate annex.
The archive was originally set up as part of a research project into very high quality partitions and authors wishing to refer to the partitioning archive should cite the paper [SWC04].
- Twitter’s Crawl « The Product Guy – "A list of incidents that affected the Page Load Time of the Twitter product, distinguishing between total downtime, and partial downtime and information inaccessibility, based upon the public posts on Twitters blog.
I did my best to not double count any problems, but it was difficult since many of the problems occur so frequently, and it is often difficult to distinguish, from these status blog posts alone, between a persisting problem being experienced or fixed, from that of a new emergence of a similar or same problem. Furthermore, I also excluded the impact on Page Load Time arising from scheduled maintenance/downtime – periods of time over which the user expectation would be most aligned with the product’s promise of Page Load Time. "
- Soundboard.com – Soundboard.com is the web's largest catalog of free sounds and soundboards – in over 20 categories, for mobile or PC. 252,858 free sounds on 17,171 soundboards from movies to sports, sound effects, television, celebrities, history and travel. Or build, customize, embed and manage your own
These are my links for May 30th through May 31st:
- Scaling Twitter: Making Twitter 10000 Percent Faster | High Scalability – Collection of links to presentations and interviews regarding Twitter's architecture, implementation plans, and performance issues, from spring 2009.
- The Last Psychiatrist: The Difference Between An Amateur, A Scientist, And A Genius – An amateur is full of wonder and speculation, tinkering towards the truth but suffering from a lack of knowledge and idleness; he's not even sure if someone else has already made these discoveries. "Is this a worthwhile pursuit?"
A scientist performs experiments to confirm or disprove a hypothesis, and in that way he grinds out the truth.
A genius has three abilities, which are actually the union of amateur and scientist: 1. to know the state of the art, what is known and what is not known. 2. To be able to think "out of the box". 3. To be disciplined enough to concentrate on the tedium of a formal investigation of his wondrous speculations.
- PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing – Research paper on sort of "super healing brush" for manipulating digital images, allows splicing together different sections of the image and automatically selecting similar textures to make the seam transitions work better.
- Light Blue Touchpaper » Blog Archive » Attack of the Zombie Photos – Social networking and sharing sites have challenges implementing and managing access control policies at large scale, and content delivery networks add another wrinkle.
- Map of all Google data center locations | Royal Pingdom – Where in the world is your search being served from? An attempt to assemble a list of known Google data centers worldwide.
These are my links for May 24th through May 27th:
- Formulas and game mechanics – WoWWiki – Your guide to the World of Warcraft – Formulas and game mechanics rules and guidelines for developing role playing games
- Manchester United’s Park Has the Endurance to Persevere – NYTimes.com – Korean soccer player Park Ji-Sung – On Wednesday night in Rome, Park is expected to become the first Asian player to participate in the European Champions League final when Manchester United faces Barcelona.
- mloss.org – Machine Learning Open Source Software – Big collection of open source packages for machine learning, data mining, statistical analysis
- The Datacenter as Computer – Luiz André Barroso and Urs Hölzle 2009 (PDF) – 120 pages on large scale computing lessons from Google. "These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base."
- Geeking with Greg: The datacenter is the new mainframe – Pointer to a paper by Googlers Luiz Andre Barroso and Urs Holzle on the evolution of warehouse scale computing and the management and use of computing resources in a contemporary datacenter.
These are my links for May 14th through May 15th:
- Congratulations, Google staff: $210k in profit per head in 2008 | Royal Pingdom – Google had $209,624 in profit per employee in 2008, which beats all the other large tech companies we looked at, including big hitters like Microsoft ($194K), Apple ($151K), Intel ($64K) and IBM ($30K).
- Statistical Data Mining Tutorials – A nice collection of presentations reviewing topics in data mining and machine learning. e.g. "HillClimbing, Simulated Annealing and Genetic Algorithms. Some very useful algorithms, to be used only in case of emergency." These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.
- Dare Obasanjo aka Carnage4Life – Why Twitter’s Engineers Hate the @replies feature – Looking at the infrastructure overhead required for Twitter's attempted change to @reply behavior.
- Scratch Helps Kids Get With the Program – Gadgetwise Blog – NYTimes.com – On my candidate list for 7th grade introductory programming and analysis. "Scratch, an M.I.T.-developed computer-programming language for children, is the focus of worldwide show-and-tell sessions this Saturday. "
These are my links for April 20th through April 23rd:
- What I’ve Learned from Hacker News – Paul Graham on social dynamics and managing Hacker News, user submitted comments and ranking (voting up/down) , editorial intervention and moderators, project goals.
- SEOmoz | Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! – Looking at variations on algorithms for ranking items on social news aggregators
- NGINX + PHP-FPM + APC = Awesome – Walkthrough on setting up cached PHP web server on nginx with apc.
- Particletree » PHP Quick Profiler – Lightweight tool for profiling PHP code.
- MySQL’s Full-Text Formulas – Database Journal –
- http://www.acapela-group.com/text-to-speech-interactive-demo.html – Online text-to-speech demo, with various male and female speakers, plus a few translations.
- Dealing with Duplicate Person Data – Proud to Use Perl – Classifying likely duplicate entries in name/address contact data using Levenshtein distance and tables of nickname synonym and assigned distance weights.
- Web Security Horror Stories: The Director’s Cut at <head> – Presentation slides from a talk by Simon Willison on cross site scripting, SQL injection, referer forgery, and clickjacking attacks on web applications.
These are my links for April 18th through April 19th:
- Why Programmers Suck at CSS Design – Stefano’s Linotype – A practical approach to CSS for non-designers (programmers).
- The Art & Science of Seductive Interactions – Presentation slides on improving application user experience by making them more game like (points, levels, scarcity), social interaction, and other ideas.
- Stephen Marsland – Python code from "Machine Learning: An Algorithmic Perspective", assorted clustering and estimation algorithms.
- Firediff – In Case of Stairs – Firediff implements a change monitor that records all of the changes made by firebug and the application itself to CSS and the DOM. This
provides insight into the functionality of the application as well as provide a record of the changes that were required to debug and tweak the page’s display.
- Crowdsourcing the semantic web | lexanderA – "Currently, all attempts at providing semantic metadata require server-side changes which means that we need to rely on page authors to implement them. This, of course, is a major obstacle. But what if we could change that? What if we could bypass page authors and have the crowd add semantic metadata to existing pages?"
- Just How Important is the Valley? Let’s Look at some Data. – Tony Wright dot com – Is the silicon valley entrepreneurship model specific to SV? List of acquisitions in 2007 and 2008.
These are my links for April 9th through April 10th:
These are my links for February 23rd through February 24th:
Google Research is publishing a huge n-gram dataset distilled from trillions of words perused by Google’s vast search spidering effort:
We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
This looks like just the thing for developing some interesting predictive text applications, or just random data mining. The 6-DVD set will be distributed by the Linguistic Data Consortium, which collects and distributes interesting speech and text databases and training sets. Some other items in their collection include transcribed speech from 3000 speakers, a mapping between Chinese and English place, organization, and corporate names, and a transcription of colloquial Levantine Arabic speech.
Update Sunday 08-06-2006 16:41 PDT: See also AOL Research publishes 20 million search queries
The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.
Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)
This book could serve as an excellent introduction to search algorithms for someone with a programming or mathematics background, covering PageRank at length, along with some discussion of HITS, SALSA, and antispam approaches. Some current topics, such as clustering, personalization, and reputation (TrustRank/SpamRank) are not covered here, although they are mentioned briefly. The bibliography and web resources provide a comprehensive source list for further research (up through around 2004), which will help point motivated readers in the right direction. I’m sure it will be popular at Google and Yahoo, and perhaps at various SEO agencies as well.
Those with less interest in the innards of search technology may enjoy a more casual summer read about Google, try John Battelle’s The Search. Or get Langville and Meyers’ book, skip the math, and just read the sidebars.
See also: A Reading List on PageRank and Search Algorithms, my del.icio.us links on search algorithms
I sometimes click on a random “deep” search result page to see if anything interesting turns up, because of the limitations of popularity and PageRank for some queries.
Paul Kedrosky points at a recent paper from CMU which suggests randomly mixing in some low ranking pages may improve search results over time.
Unfortunately, the correlation between popularity and quality
is very weak for newly-created pages that have few
visits and/or in-links. Worse, the process by which new,
high-quality pages accumulate popularity is actually inhibited
by search engines. Since search engines dole out
a limited number of clicks per unit time among a large
number of pages, always listing highly popular pages at
the top, and because users usually focus their attention on
the top few results, newly-created but high-quality
pages are “shut out.”
We propose a simple and elegant solution to
this problem: the introduction of a controlled
amount of randomness into search result ranking
methods. Doing so offers new pages a chance
to prove their worth, although clearly using too
much randomness will degrade result quality and
annul any benefits achieved. Hence there is a
tradeoff between exploration to estimate the quality
of new pages and exploitation of pages already
known to be of high quality. We study this tradeoff
both analytically and via simulation, in the context
of an economic objective function based on
aggregate result quality amortized over time. We
show that a modest amount of randomness leads
to improved search results.
Shuffling a Stacked Deck: The Case for Partially
Randomized Ranking of Search Engine Results,
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.
I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.
As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.
PageRank is only used by Google, which no longer uses the original PageRank algorithm for ranking results, but it’s still interesting to see the top search results from the three major search engines laid out with PR scores to get some sense of the page linkage.
I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.
It points out a fault in Google’s PageRank algorithm that causes ‘sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.
After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.
This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.
See also: A reading list on PageRank and Search Algorithms
Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.
On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.
Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.
In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.
1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.
2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.
3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.
4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.
If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.
Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
- Online Reputation Systems: The Cost of Attack of PageRank (PDF) –
Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
- SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF) –
András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
- Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).
I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.