|
|
Ho John Lee | February 11th, 2010 | Comments are closed
These are my links for February 4th through February 11th:
- Schneier on Security: Interview with a Nigerian Internet Scammer – "We had something called the recovery approach. A few months after the original scam, we would approach the victim again, this time pretending to be from the FBI, or the Nigerian Authorities. The email would tell the victim that we had caught a scammer and had found all of the details of the original scam, and that the money could be recovered. Of course there would be fees involved as well. Victims would often pay up again to try and get their money back."
- xkcd – Frequency of Strip Versions of Various Games – n = Google hits for "strip <game name>" / Google hits for "<game name>"
- PeteSearch: How to split up the US – Visualization of social network clusters in the US. "information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them.
Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South."
- Redis: Lightweight key/value Store That Goes the Extra Mile | Linux Magazine – Sort of like memcache. "Calling redis a key/value store doesn’t quite due it justice. It’s better thought of as a “data structures” server that supports several native data types and operations on them. That’s pretty much how creator Salvatore Sanfilippo (known as antirez) describes it in the documentation. Let’s dig in and see how it works."
- Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.
site admin | May 6th, 2009 | Comments are closed
These are my links for May 5th through May 6th:
- Coding Horror: I Just Logged In As You: How It Happened – On good password management, why forums should mostly not be storing user passwords in general, and how re-use of passwords on multiple sites can lead to vulnerability on other sites.
- Arc Forum | Arc – Arc is a version of Lisp. Among other things it is used to implement Hacker News.
- John Graham-Cumming: Can you trust Paul Graham with your password? – On best practices for storing password hashes to avoid attacks on compromised password files and the use of rainbow files, in a look at Hacker News implementation of passwords
- Deliberate Ambiguity: How *not* to rate a search engine – Search engines have very simple user interfaces, but are used in many different contexts, most of which don't resemble the way people often try out a new search engine.
- The Slow Erosion of Google Search – Bokardo – On changes in internet user behaviors over time, more social media (ask your Twitter friends) vs directed search (send a keyword query) etc.
- Brynn Marie Evans » Why social search won’t topple Google (anytime soon) – On differences between searching through social media such as Twitter, Facebook etc, vs Google etc.
- The Financial Services Club’s Blog: Stock picking with real-time news – Looking at real time social media trends for trading ideas.
- Lisp’s reputation is so bad that many people don’t even take a look at Lisp | International Lisp Conference 2009 – I haven't touched Lisp in years, except maybe for configuring emacs. A list of possible reasons why Lisp is not more widely used, e.g. "Lisp is old and moldy. It must be primitive by today's standards.", "The exciting languages to learn now are Python, Ruby, Groovy, etc."
- Peering into North Korea – The Big Picture – Boston.com – A collection of recent photos of scenes from North Korea.
site admin | May 2nd, 2009 | Comments are closed
These are my links for April 30th through May 2nd:
- FusionCharts Free – Animated Flash Charts and Graphs for ASP, PHP, ASP.NET, JSP, RoR and other web applications – Flash charting component that can be used to render data-driven & animated charts for your web applications and presentations. It is a cross-browser and cross-platform solution that can be used with PHP, Python, Ruby on Rails, ASP, ASP.NET, JSP, ColdFusion, simple HTML pages or even PowerPoint Presentations to deliver interactive and powerful flash charts. You do NOT need to know anything about Flash to use FusionCharts. All you need to know is the language you're programming in.
- Raphaël—JavaScript Library – Raphaël is a small JavaScript library that should simplify your work with vector graphics on the web. If you want to create your own specific chart or image crop and rotate widget, for example, you can achieve it simply and easily with this library. Raphaël uses the SVG W3C Recommendation and VML as a base for creating graphics. This means every graphical object you create is also a DOM object, so you can attach JavaScript event handlers or modify them later. Raphaël’s goal is to provide an adapter that will make drawing vector art compatible cross-browser and easy.
- A Really Gentle Introduction to Data Mining | Regular Geek – List of data mining blogs and related resources.
- BlackBerry SSH Tutorial: Connect to Unix Server using MidpSSH for Mobile Devices – Notes on using MidpSSH on Blackberry for remote access to servers. Seems to work, although big network lag on my BlackBerry Bold / AT&T.
- Country Reports on Terrorism 2008 – U.S. law requires the Secretary of State to provide Congress, by April 30 of each year, a full and complete report on terrorism with regard to those countries and groups meeting criteria set forth in the legislation. This annual report is entitled Country Reports on Terrorism. Beginning with the report for 2004, it replaced the previously published Patterns of Global Terrorism.
- DIY: How To Find Authoritative Twitter Users Plus 100 To Get You Started | Ignite Social Media – Some comments on recommendation metrics for Twitter, trying to use "favorites" mark as an indicator.
- SIGUSR2 > The Power That is GNU Emacs – "If you've never been convinced before that Emacs is the text editor in which dreams are made from, or that inside Emacs there are unicorns manipulating your text, don't expect me to convince you."
site admin | April 23rd, 2009 | Comments are closed
These are my links for April 20th through April 23rd:
- What I’ve Learned from Hacker News – Paul Graham on social dynamics and managing Hacker News, user submitted comments and ranking (voting up/down) , editorial intervention and moderators, project goals.
- SEOmoz | Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! – Looking at variations on algorithms for ranking items on social news aggregators
- NGINX + PHP-FPM + APC = Awesome – Walkthrough on setting up cached PHP web server on nginx with apc.
- Particletree » PHP Quick Profiler – Lightweight tool for profiling PHP code.
- MySQL’s Full-Text Formulas – Database Journal –
- http://www.acapela-group.com/text-to-speech-interactive-demo.html – Online text-to-speech demo, with various male and female speakers, plus a few translations.
- Dealing with Duplicate Person Data – Proud to Use Perl – Classifying likely duplicate entries in name/address contact data using Levenshtein distance and tables of nickname synonym and assigned distance weights.
- Web Security Horror Stories: The Director’s Cut at <head> – Presentation slides from a talk by Simon Willison on cross site scripting, SQL injection, referer forgery, and clickjacking attacks on web applications.
site admin | April 19th, 2009 | Comments are closed
These are my links for April 18th through April 19th:
- Why Programmers Suck at CSS Design – Stefano’s Linotype – A practical approach to CSS for non-designers (programmers).
- The Art & Science of Seductive Interactions – Presentation slides on improving application user experience by making them more game like (points, levels, scarcity), social interaction, and other ideas.
- Stephen Marsland – Python code from "Machine Learning: An Algorithmic Perspective", assorted clustering and estimation algorithms.
- Firediff – In Case of Stairs – Firediff implements a change monitor that records all of the changes made by firebug and the application itself to CSS and the DOM. This
provides insight into the functionality of the application as well as provide a record of the changes that were required to debug and tweak the page’s display.
- Crowdsourcing the semantic web | lexanderA – "Currently, all attempts at providing semantic metadata require server-side changes which means that we need to rely on page authors to implement them. This, of course, is a major obstacle. But what if we could change that? What if we could bypass page authors and have the crowd add semantic metadata to existing pages?"
- Just How Important is the Valley? Let’s Look at some Data. – Tony Wright dot com – Is the silicon valley entrepreneurship model specific to SV? List of acquisitions in 2007 and 2008.
site admin | April 17th, 2009 | Comments are closed
These are my links for April 15th through April 17th:
- Paul Buchheit: Make your site faster and cheaper to operate in one easy step – Compress text files with gzip to reduce file size/bandwidth, the incremental cpu cost is usually low relative to the performance gain from lower network cost. Friendfeed uses nginx in front of main web servers for this.
- Jabbify – Free Comet web service and browser client for simple chat and streaming status applications.
- TinEye Image Search Engine – Idée Inc. – The Visual Search Company – Finds references to images online, starting with an original image. Attempts to use image analysis to be independent of scaling, cropping, and other common manipulations.
- All That Twitters Isn’t Gold: A Popular Web Application in Search of a Business Plan – Knowledge@Wharton – Business school take on Twitter and high growth, non-revenue consumer web startups.
- Almost Viral: A Hybrid Acquisition Strategy – "By being almost viral you can grow very cheaply, control your rate of growth and demographics, and get enough traffic to conduct meaningful experiments. Need to grow more slowly? Just decrease your daily ad spend. Need statistically significant results more quickly? Increase your daily ad spend. With a viral coefficient of 0.9 you’ve dealt with your acquisition risk. Rather than going fully viral and dealing with the operational difficulties, it might be worth your time to deal with other market risks: retention, engagement, and monetization. "
site admin | April 15th, 2009 | Comments are closed
These are my links for April 13th through April 15th:
site admin | April 14th, 2009 | Comments are closed
These are my links for April 12th through April 13th:
- Google App Engine Blog: Many languages, and in the runtime bind them – Now that AppEngine has a Java environment, there are a lot of possibilities for running other languages on top of the JVM, this is an all-singing, all-dancing shell interpreter demo providing a switchable command line interface to Beanshell, Clojure, Groovy, JavaScript, Python, Ruby, Scala, and Scheme.
- High Performance Web Sites :: don’t use @import – Summary – use LINK instead of @import for stylesheet references. "Using @import within a stylesheet adds one more roundtrip to the overall download time of the page. Using @import in IE causes the download order to be altered. This may cause stylesheets to take longer to download, which hinders progress rendering making the page feel slower."
- Learn Korean Language :The Official Korea Tourism Guide Site – Flash-based Korean language lessons, from KBS World Radio.
- Korea rate of obesity ranks lowest among OECD nations – INSIDE JoongAng Daily – Korea has lowest obesity rate among 30 OECD countries, at 3.5%, vs the US (#30) at 34.3%.
- FT.com / Weekend / Reportage – Is a high IQ a burden as much as a blessing? – “High cognitive ability is very often a mixed blessing,” Patrick O’Shea, the president of the International Society for Philosophical Enquiry (ISPE), told me. Too wide a deviation from the mean IQ of 100 brings with it an inherent isolation. “If you have an IQ of 160 or higher,” O’Shea explained, “you’re probably able to connect well with less than 1 per cent of the population.”
site admin | April 12th, 2009 | Comments are closed
These are my links for April 12th from 17:02 to 19:13:
site admin | April 9th, 2009 | Comments are closed
These are my links for April 7th through April 9th:
site admin | February 26th, 2009 | Comments are closed
These are my links for February 26th from 10:39 to 20:05:
site admin | February 26th, 2009 | Comments are closed
These are my links for February 25th through February 26th:
site admin | February 25th, 2009 | Comments are closed
These are my links for February 24th through February 25th:
- The C10K problem – On techniques for scaling to large number of network clients (e.g. >10000).
- Yodel Anecdotal » Blog Archive » Hello, (twitter) world – List of official Yahoo twitter handles for various activities including research, geo, search, and yui.
- New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia – AWS adds Freebase, DBPedia, Wikipedia extract, and US Transportation data sets.
- eigenclass – Related document discovery, without algebra – Another approach to simple related document discovery, based on tags, should work ok for small data sets.
- SVD Recommendation System in Ruby – igvita.com – A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.
site admin | February 16th, 2009 | Comments are closed
These are my links for February 15th through February 16th:
- Berkeley cloud report gets mixed reviews | The Wisdom of Clouds – CNET News – James Urqhardt commentary on UCB paper, "The paper begins by setting a definition of Cloud Computing that will be considered controversial by many, as it is firmly in the "there is no cloud computing inside enterprise data centers" camp."
- Above the Clouds: Above the Clouds Released – UC Berkeley RAD Lab starts a new blog and publishes their take on the state of cloud computing.
- Forget Dunbar’s Number, Our Future Is in Scoble’s Number « I’m Not Actually a Geek – A look at changing interaction styles enabled by growing use of online social networks and applications. "If Dunbar’s Number is defined at 150 connections, perhaps we can term the looser connection of thousands as Scoble’s Number. "
- What really happened at Ma.gnolia and lessons learned – Video podcast with Larry Halff describing how Ma.gnolia was implemented (Ruby on Rails), its ongoing operation leading up to the failure of the (1/2 TB) MySQL database a few weeks ago.
- Infrastructure for Modern Web Sites « random($foo) – An overview of packages, services, and approaches for building web systems, circa January 2009. With assorted comments.
- Online Mind Mapping – MindMeister – Web-based, embeddable mind mapping software, sort of like MindJet, wiki-style collaborative editing.
- Jean-Lou Dupont’s WEBlog: Cloud Computing Mind Map – A mind map of companies and projects in the cloud computing space.
Ho John Lee | August 6th, 2006 | 3 comments
More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.
This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:
The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.
I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.
Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.
Adam D’Angelo says:
This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.
On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.
Plentyoffish sees an opportunity for PPC and Adsense spammers:
Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.
I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.
More on the privacy angle from SiliconBeat, Zoli Erdos
See also: Coming soon to DVD – 1,146,580,664 common five-word sequences
Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.
Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data
|
|