Bookmarks for February 4th through February 11th

These are my links for February 4th through February 11th:

  • Schneier on Security: Interview with a Nigerian Internet Scammer – "We had something called the recovery approach. A few months after the original scam, we would approach the victim again, this time pretending to be from the FBI, or the Nigerian Authorities. The email would tell the victim that we had caught a scammer and had found all of the details of the original scam, and that the money could be recovered. Of course there would be fees involved as well. Victims would often pay up again to try and get their money back."
  • xkcd – Frequency of Strip Versions of Various Games – n = Google hits for "strip <game name>" / Google hits for "<game name>"
  • PeteSearch: How to split up the US – Visualization of social network clusters in the US. "information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them.

    Looking at the network of US cities, it's been remarkable to see how groups of them form clusters, with strong connections locally but few contacts outside the cluster. For example Columbus, OH and Charleston WV are nearby as the crow flies, but share few connections, with Columbus clearly part of the North, and Charleston tied to the South."

  • Redis: Lightweight key/value Store That Goes the Extra Mile | Linux Magazine – Sort of like memcache. "Calling redis a key/value store doesn’t quite due it justice. It’s better thought of as a “data structures” server that supports several native data types and operations on them. That’s pretty much how creator Salvatore Sanfilippo (known as antirez) describes it in the documentation. Let’s dig in and see how it works."
  • Op-Ed Contributor – Microsoft’s Creative Destruction – NYTimes.com – Unlike other companies, Microsoft never developed a true system for innovation. Some of my former colleagues argue that it actually developed a system to thwart innovation. Despite having one of the largest and best corporate laboratories in the world, and the luxury of not one but three chief technology officers, the company routinely manages to frustrate the efforts of its visionary thinkers.

Bookmarks for January 20th through January 23rd

These are my links for January 20th through January 23rd:

  • Data.gov – Featured Datasets: Open Government Directive Agency – Datasets required under the Open Government Directive through the end of the day, January 22, 2010. Freedom of Information Act request logs, Treasury TARP and derivative activity logs, crime, income, agriculture datasets.
  • All Your Twitter Bot Needs Is Love – The bot’s name? Jason Thorton. He’s been humming along for months now, sending out over 1250 tweets to some 174 followers. His tweets, while not particularly creative, manage to be both believable and timely. And he’s powered by a single word: Love.

    Thorton is the creation of developer Ryan Merket, who built him as a side project in around three hours. Merket has just posted the code that powers him, and has also divulged how he made Thorton seem somewhat realistic: the bot looks for tweets with the word “love” in them and tweets them as its own.

  • Building a Twitter Bot – "Meet Jason Thorton. To people who know Jason, he is a successful entrepreneur in San Francisco who tweets 4-5 times a day. But Jason has a secret, he’s not really a human, he’s the product of my simple algorithm in PHP

    Jason tweets A LOT about the word “love” – that’s because Jason actually steals tweets from the public timeline that contain the word “love” and posts them as his own

    Jason also @replies to people who use the word “love” in their tweets, and asks them random questions or says something arbitrary

    It took me about 3 hours to code Jason, imagine what a real engineer could do with real AI algorithms? Now realize that it’s already a reality. Sites like Twitter are full of side projects, company initiatives, spambots and AI robots. When the free flow of information becomes open, the amount of disinformation increases. Theres a real need for someone to vet the people we ‘meet’ on social sites – will be interesting to see how this market grows in the next year

  • Website monitoring status – Public API Status – Health monitor for 26 APIs from popular Web services, including Google Search, Google Maps, Bing, Facebook, Twitter, SalesForce, YouTube, Amazon, eBay and others
  • PG&E Electrical System Outage Map – This map shows the current outages in our 70,000-square-mile service area. To see more details about an outage, including the cause and estimated time of restoration, click on the color-coded icon associated with that outage.

Follow suggested users, attract instant spamcloud

Despite Twitter’s amazing growth rate, there is general agreement that the Suggested Users List and the new user experience has shortcomings. As an experiment, I created a new Twitter account. I wanted to see what the experience might look like for someone interested in, but otherwise completely unfamiliar with the service. During the signup process, it automatically picks some suggested users (apparently random), which I selected all of, about a dozen or so. Then it asked for my email credentials to check for other people I know on Twitter, which I declined, since I generally don’t give web applications access to my email services. Then I went back to “Suggested Users” under the “Find People” section, and selected all of them. In total, the Suggested Users list got me up to 237 friends in my incoming stream.

Within a few minutes of completing this process, I already had 13 spam followers offering affiliate links for cameras, porn, and twitter followers. A day later I was up to 41 spam followers, plus 4 follow-backs from accounts I followed in addition to the Suggested Users List.

twitter-newuser-spam-090705There are two different issues here: 1) finding a set of interesting / relevant people for new users to follow, and 2) limiting the impact of spam and affiliate marketers, who appear to be scanning the follower lists of the Suggested Users to identify new accounts to spam.

Benin is the new Nigeria (for spam campaigns)

Spring seems to have brought on a new variant of the Nigerian “419″ spam fraud campaign, substituting Benin for Nigeria. Going through the e-mail that came in during spring break, weeks I’m seeing a lot of e-mail with titles like

“FINAL NOTIFICATION OF RECEIVING YOUR HERITANCE FUND IN ATM MASTER CARD”

“CONTACT YOUR ATM MASETR CARD”

“CONTACT EMS IMMEDIATLY ON +234 8022856155″

“CONTACT FedEX EXPRESS COURIER COMPANY LIMITED FOR YOUR CONSIGNMENT IMMEDIATLY”

“CONTACT REV DR.KENNETH OKOM DIRECTOR OF ATM CARD BANK”

“CONTACT MR FRED IKEM FOR YOUR $950,000.00″

The general theme in this sort of spam is “We’re waiting for you to confirm your bank information and send a small processing fee so we can send you a lot of money.” This campaign mostly mentions a program from the Republic of Benin to give away money through funded ATM/Mastercard accounts for various reasons ranging from inheritance to payment for previous services. Some of these have an interesting wrinkle though:

THIS IS TO OFFICIALLY INFORM YOU THAT WE HAVE VERIFIED YOUR CONTRACT /INHERITANCE FILE AND FOUND OUT THAT WHY YOU HAVE NOT RECEIVED YOUR PAYMENT IS BECAUSE YOU HAVE NOT FULFILLED THE OBLIGATIONS GIVEN TO YOU IN RESPECT OF YOUR CONTRACT / INHERITANCE PAYMENT. SECONDLY WE HAVE BEEN INFORMED THAT YOU ARE STILL DEALING WITH THE NONE OFFICIALS IN THE BANK ALL YOUR ATTEMPT TO SECURE THE RELEASE OF THE FUND TO YOU. WE WISH TO ADVICE YOU THAT SUCH AN ILLEGAL ACT LIKE THIS HAVE TO STOP IF YOU WISHES TO RECEIVE YOUR PAYMENT SINCE WE HAVE DECIDED TO BRING A SOLUTION TO YOUR PROBLEM.

Maybe this would sound plausible to someone who had already responded to a previous scam email? “The reason you haven’t been paid yet is because you have been illegally dealing with the wrong officials, so please send us the money instead?” Perhaps this reflects a finely tuned understanding of the likely responders to this campaign…

Links: 419 Scan: Advance Fee Fraud and Fake Lotteries, Nigerian Fraud E-mail Gallery, Michigan CyberSecurity – Example of Email Fraud 

Hacked by keymachine.de

I just noticed that my WordPress installation got hacked by a search engine spam injection attack sometime in the past few weeks. This particular one inserts invisible text with lots of keywords in footer.php. The changes to the file were made using the built-in theme editor, originating from ns.km20725.keymachine.de, which is currently at 84.19.188.144. The spam campaign automatically updates the spam payload every day or so. The links point to a variety of servers that have also been hacked to host the spam content. Here is a sample: http://www.nanosolar.com/feb3/talk.php?28/82138131762.html
I’ve sent an e-mail to Nanosolar, so they’ll probably have that content cleaned up before long. But the automated SEO spam campaign updates the keyword and link payload regularly, so any affected WordPress sites will be updated to point at the new hosting victims.

From a quick check on Google, it looks like keymachine.de is a regular offender

More on the America Online search query data

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

I pulled down a copy of the data last night before the link went down, but didn’t get around to actually looking it over until this evening. In a casual glance at random sections of the data, I see a surprising (to me) number of people typing in complete URLs, a range of sex-related queries, (some of which I don’t actually understand), shopping-related queries, celebrity-related queries, and a lot of what looks like homework projects by high school or college students.

In the meantime, many other people have found interesting / problematic entries among the data, including probable social security numbers, driver’s license numbers, addresses, and other personal information. Here’s a list of queries about how to kill your wife from Paradigm Shift.

More samples culled from the data here, here, and here.

#479 Looks like a student at Prairie State University who like playing EA Sports Baseball 2006, is a White Sox fan, and was planning going to Ozzfest. When nothing else is going on, he likes to watch Nip/Tuck.

#507 likes to bargain on eBay, is into ghost hunting, currently drives a 2001 Dodge, but plans on getting a Mercedes. He also lives in the Detroit area.

#1021 is unemployed and living in New Jersey. But that didn’t get him down because with his new found time, he’s going to finally get to see the Sixers.

#1521 like the free porn.

Based on my own eclectic search patterns, I’d be reluctant to infer specific intent based only on a series of search queries, but it’s still interesting, puzzling, and sometimes troubling to see the clusters of queries that appear in the data.

Up to this point, in order to have a good data set of user query behavior, you’d probably need to work for one of the large search engines such as Google or Yahoo (or perhaps a spyware or online marketing company). I still think sharing the data was well-intentioned in spirit (albeit a massive business screwup).

Sav, commenting over at TechCrunch (#67) observes:

The funny part here is that the researchers, accustomed to looking at data like this every day, didn’t realize that you could identify people by their search queries. (Why would you want to do that? We’ve got everyone’s screenname. We’ll just hide those for the public data.) The greatest discoveries in research always happen by accident…

A broader issue in the privacy context is that all this information and more is already routinely collected by search engines, search toolbars, assorted desktop widget/pointer/spyware downloads, online shopping sites, etc. I don’t think most people have internalized how much personal information and behavioral data is already out there in private data warehouses. Most of the time you have to pay something to get at it, though.

I expect to see more interesting nuggets mined out of the query data, and some vigorous policy discussion regarding the collection and sharing of personal attention gestures such as search queries and link clickthroughs in the coming days.

See also: AOL Research publishes 20 million search queries

Update Tuesday 08-08-2006 05:58 PDT – The first online interface for exploring the AOL search query data is up at www.aolsearchdatabase.com (via TechCrunch).

Update Tuesday 08-08-2006 14:18 PDT – Here’s another online interface at dontdelete.com (via Infectious Greed)

Update Wednesday 08-09-2006 19:14 PDT – A profile of user 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, GA, along with a discussion of the AOL query database in the New York Times.

AOL Research publishes 20 million search queries

More raw data for search engineers and SEOs, and fodder for online privacy debates – AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Haven’t taken a look at the actual data yet, but I’m glad I’m not an AOL user.

Adam D’Angelo says:

This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.

On the search application side, this is a rare look at actual user search behavior, which would be difficult to obtain without access to a high traffic search engine or possibly through a paid service.

Plentyoffish sees an opportunity for PPC and Adsense spammers:

Google/ AOL have just given some of the worlds biggest spammers a breakdown of high traffic terms its just a matter of weeks now until google gets mega spammed with made for adsense sites and other kind of spam sites targetting keywords contained in this list.

I think it’s great that AOL is trying to open up more and engage with the research community, and it looks like there are some other interesting data collections on the AOL Research site — but I suspect they’re about to take a lot of heat on the privacy front, judging from the mix of initial reactions on Techmeme. Hope it doesn’t scare them away and they find a way to publish useful research data without causing a privacy disaster.

More on the privacy angle from SiliconBeat, Zoli Erdos

See also: Coming soon to DVD – 1,146,580,664 common five-word sequences

Update – Sunday 08-06-2006 20:31 PDT – AOL Research appears to have taken down the announcement and the log data in the past few hours in response to a growing number of blog posts, mostly critical, and mostly focused on privacy. Markus at Plentyoffish has also used the data to generate a list of ringtone search keywords which users clicked through to a ringtone site as an example of how this data can be used by SEO and spam marketers. Looks like the privacy issues are going to get the most airtime right now, but I think the keyword clickthrough data is going to have the most immediate effect.

Update Monday 08-07-2006 08:02 PDT: Some mirrors of the AOL data

The Long Tail of Invalid Clicks and other Google click fraud concepts

Some fine weekend reading for search engineers, SEOs, and spam network operators:

A 47-page independent report on Google Adwords / Adsense click fraud, filed yesterday as part of a legal dispute between Lane’s Gifts and Google, provides a great overview of the history and current state of click fraud, invalid clicks of all types, and the four-layered filtering process that Google uses to detect them.

Google has built the following four “lines of defense” against invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.

An interesting observation is that most click fraud can be eliminated through simple filters. Alexander Tuzhilin, author of the report, speculates on a Zipf-law Long Tail of invalid clicks of less common attacks, and observes:

Despite its current reasonable performance, this situation may change significantly in the future if new attacks will shift towards the Long Tail of the Zipf distribution by becoming more sophisticated and diverse. This means that their effects will be more prominent in comparison to the current situation and that the current set of simple filters deployed by Google may not be sufficient in the future. Google engineers recognize that they should remain vigilant against new possible types of attacks and are currently working on the Next Generation filters to address this problem and to stay “ahead of the curve” in the never-ending battle of detecting new types of invalid clicks.

He also highlights the irreducible problem of click fraud in a PPC model:

  • Click fraud and invalid clicks can be defined conceptually, but the only working defintion is an operationally defined one
  • The operational definition of invalid clicks can not be fully disclosed to the general public, because it will lead to massive click fraud.
  • If the operational definition is not disclosed to some degree, advertisers can not verify or dispute why they have been charged for certain clicks

The court settlement asks for an independent evaluation of whether Google’s efforts to combat click fraud are reasonable, which Tuzhulin believes they are. The more interesting question is whether they will continue to be sufficient as time progresses and the Long Tail of click fraud expands.

Links:

Google’s PageRank and Beyond – summer reading for search hackers

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

This book could serve as an excellent introduction to search algorithms for someone with a programming or mathematics background, covering PageRank at length, along with some discussion of HITS, SALSA, and antispam approaches. Some current topics, such as clustering, personalization, and reputation (TrustRank/SpamRank) are not covered here, although they are mentioned briefly. The bibliography and web resources provide a comprehensive source list for further research (up through around 2004), which will help point motivated readers in the right direction. I’m sure it will be popular at Google and Yahoo, and perhaps at various SEO agencies as well.

Those with less interest in the innards of search technology may enjoy a more casual summer read about Google, try John Battelle’s The Search. Or get Langville and Meyers’ book, skip the math, and just read the sidebars.

See also: A Reading List on PageRank and Search Algorithms, my del.icio.us links on search algorithms

Gold farming spam and game economy inflation

This evening, I received a spam e-mail offering to sell virtual gold for World of Warcraft.

We are a company which offer WOW gold both US and EU sever, cheap and quickly. Only 49.5 pounds(75EROUS) for 1000 gold EU sever and 75 dallors for 1000 gold US sever. You can visite our website

and pay with paypal.

We also accept Western Union and other payment method.

If you have any question,Please directly relates with us.
Email: deleted
Website: deleted

This is a first for me, I wonder if gold farming is scaling up in low cost-but-wired labor markets. The Wikipedia entry on gold farming notes

Gold farmers are most notably characterized by performing the same tasks repeatedly for long periods of time. Especially on English language servers, gold farmers operating from another country are often observed to speak poor or broken English

At today’s exchange rate, $1 US = EU 0.83527 or UK 0.57300, which means that the US price is a much better deal than the Euro or UK pound denominated price, at roughly 15% less. I gather that World of Warcraft players can’t migrate between servers, so perhaps there’s some other economic aspect at work here.

I’ve avoided diving into World of Warcraft and other MMORPGs in fear of the time sink, so I don’t have a good sense of how big the problem is for players and game developers, or how long it takes to accumulate gold in normal game play.

Since “gold” functions as currency in the game economy, injecting farmed gold into the system should gradually cause inflation (if prices are not controlled by game policy) similar to when government monetary policy creates excess currency. The traditional real world hedge against inflation is physical gold, which is one of the reasons you see dollar denominated gold prices generally going up lately. I’m not sure what an inflation hedge would look like in World of Warcraft. Maybe a stockpile of valuable artifacts?

Nick Yee has an extensive article on gold farming, with many links and comments at The Daedalus Project.

Reverse engineering a referer spam campaign

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

Referrer spam is more indirect. It is created by making an HTTP request with the REFERER header set to the URL being promoted. Most of the time, this will only be visible in the web server log.

Here is a typical HTTP log entry:

87.219.8.210 	[04/Feb/2006:15:20:35 	-0800]
    GET 	/weblog/archives/2005/09/15/google-blog-search-referrers-working-now 	HTTP/1.1
    403 	- 	"http://every-search.com"

Some blogs and other web sites post an automatically generated list of “recent referrers” on their home page or on a sidebar. In normal use, this would show a list of the sites that had linked to the site being viewed. Recent referrer lists are less common now, because of the rise of referrer spam.

Referrer spam will also show up in web site statistic and traffic summaries. These are usually private, but are sometimes left open to the public and to search engines.

One presumed objective of a link spam campaign is to increase the target site’s search engine ranking. In general this requires building a collection of valid inbound links, preferably without the “nofollow” attribute. Referrer spam may be more effective for generating inbound links, since recent referrer lists and web site reports typically don’t wrap their links with nofollow.

The landing pages for the links in this campaign are interesting in that they don’t contain advertising at all. This suggests that this campaign is trying to build a sort of PageRank farm to promote something else.

The actual pages are all built on the same blog template, and contain a combination of gibberish and sidebar links to subdomains based on “valuable” keywords. Using the blog format automatically provides a lot of site interlinking, and they also have “recent” and “top referer” lists, which are all from other spam sites in the network.

It looks like the content text should be easy to identify as spam based on frequency analysis. Perhaps having a very large cloud of spam sites linking to each other along with a dispersed set of incoming referrer spam links makes the sites look more plausible to a search engine? These sites don’t appear to have any, but I have come across other spam sites and comment spam posts that have links to non-spam sites such as .gov and .edu sites, perhaps trying to look more credible to a search engine ranking algorithm. All the sites being on the same subnet makes them easier to spot, though.

Given that there aren’t that many public web site stat pages and recent referrer lists around, I’m surprised that referrer spamming is worth the effort. If the spam network can achieved good ranking in the Google and the other search engines, they can probably boost the ranking for a selected target site by pruning back some of their initial links and adding some links pointing at the sites that they want to promote. Affiliate links to porn, gambling, or online pharmacy sites must pay reasonably well for this to work out for the spammers.

More reading: A list of references on PageRank and link spam detection.

If you’re having referrer spam problems on your site, you may find my notes on blocking referer spam useful.

Here’s some sample text from “search-buy.com”:

I search-buy over least and and next train. Ne so at cruelty the search-buy in after anaesthesia difficulty general urinating. T pastry a ben for search-buy boy. An refuses trip search-buy romances seemed azusa pacific university ca. Stoc of my is and search-buy direct having sex teen titans. Kid philadelphiaa would and york search-buy. G search-buy wore shed i dads. obstacles future search-buy right had satire nineteenth. The that i ups this on search-buy least finds audio express richmond. have this window been wonderful me search-buy so. Surel in actually search-buy our boy deep franklin notions. An search-buy it of my has of. To at head boy that a search-buy. O james search-buy everywhere of but. Alread originate search-buy good about since.

Here are a few spam sites from this campaign and their IP addresses:

bikini-now.com          A       85.255.114.212
babestrips.com          A       85.255.114.229
search-biz.biz          A       85.255.114.245
bustytart.com           A       85.255.114.250
cjtalk.net              A       85.255.114.227
search-galaxy.org             A       85.255.114.252
moresearch.org             A       85.255.114.237

Here is the WHOIS output for that netblock:

% Information related to '85.255.112.0 - 85.255.127.255'

inetnum:        85.255.112.0 - 85.255.127.255
netname:        inhoster
descr:          Inhoster hosting company
descr:          OOO Inhoster, Poltavskij Shliax 24, Kharkiv, 61000, Ukraine
remarks:        -----------------------------------
remarks:        Abuse notifications to: abuse@inhoster.com
remarks:        Network problems to: noc@inhoster.com
remarks:        Peering requests to: peering@inhoster.com
remarks:        -----------------------------------
country:        UA
org:            ORG-EST1-RIPE
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
tech-c:         FWHS1-RIPE
status:         ASSIGNED PI
mnt-by:         RIPE-NCC-HM-PI-MNT
mnt-lower:      RIPE-NCC-HM-PI-MNT
mnt-by:         RECIT-MNT
mnt-routes:     RECIT-MNT
mnt-domains:    RECIT-MNT
mnt-by:         DAV-MNT
mnt-routes:     DAV-MNT
mnt-domains:    DAV-MNT
source:         RIPE # Filtered

organisation:   ORG-EST1-RIPE
org-name:       INHOSTER
org-type:       NON-REGISTRY
remarks:        *************************************
remarks:        * Abuse contacts: abuse@inhoster.com *
remarks:        *************************************
address:        OOO Inhoster
address:        Poltavskij Shliax 24, Xarkov,
address:        61000, Ukraine
phone:          +38 066 4633621
e-mail:         support@inhoster.com
admin-c:        AK4026-RIPE
tech-c:         AK4026-RIPE
mnt-ref:        DAV-MNT
mnt-by:         DAV-MNT
source:         RIPE # Filtered

person:         Andrei Kislizin
address:        OOO Inhoster,
address:        ul.Antonova 5, Kiev,
address:        03186, Ukraine
phone:          +38 044 2404332
nic-hdl:        AK4026-RIPE
source:         RIPE # Filtered

person:       Fast Web Hosting Support
address:      01110, Ukraine, Kiev, 20Á, Solomenskaya street. room 201.
address:      UA
phone:        +357 99 117759
e-mail:       support@fwebhost.com
nic-hdl:      FWHS1-RIPE
source:       RIPE # Filtered

Why Link Farms (used to) Work

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ‘sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

This technique is less effective now than in the past, because Google has continued to update its indexing and ranking algorithms in response to the success of link spam and other ranking manipulation. Analysis of link patterns (SpamRank, link mass) and site reputation (Hilltop) can substantially reduce the effect described here. Nonetheless, it’s nice to see a quantitative description of the problem.

See also: A reading list on PageRank and Search Algorithms

Personalization, Intent, and modifying PageRank calculations

Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.

On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.

Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.

In fact, at least to my reading, the paper seems to imply that it would be ideal for both of these — the probability of following a link and the personalization vector’s probability of jumping to a page — to be based on actual usage data. They seem to suggest that this would yield a PageRank that would be the best estimate of searcher interest in a page.

Some thoughts:

1. The goal of the search ranking is to identify the most relevant results for the input query. Putting aside the question of scaling for a moment, it seems like there are good opportunities to incorporate information about intent, context, and reputation through the transition and personalization vector. We don’t actually care about the “PageRank” per se, but rather about getting the relevant result in front of the user. A hazard in using popularity alone (traffic data on actual clicked links) is it creates a fast positive feedback loop which may only reflect what’s well publicized rather than relevant. Technorati is particularly prone to this effect, since people click on the top queries just to see what they are about. Another example is that the Langville and Meyer paper is quite good, but references to it are buried deep in the search results page for “PageRank”. So…I think we can make good use of actual usage data, but only some applications (such as “buzz trackers”) can rely on usage data only (or mostly). A conditional or personalized ranking would be expensive to compute on a global basis, but might also give useful results if it were applied on a significantly reduced set of relevant pages.

2. In a reputation- and context-sensitive search application, the untraversed outgoing links may still help indicate what “neighborhood” of information is potentially related to the given page. I don’t know how much of this is actually in use already. I’ve been seeing vast quantities of incoming comment spam with gibberish links to actual companies (Apple, Macromedia, BBC, ABC News), which doesn’t make much sense unless the spammers think it will help their content “smell better”. Without links to “mainstream content”, the spam content is detectable by linking mostly to other known spam content, which tends not to be linked to by real pages.

3. If you assume that search users have some intent driving their choice of links to follow, it may be possible to build a conditional distribution of page transitions rather than the uniformly random one. Along these lines, I came across a demo (“Mindset”) and paper from Yahoo on a filter for indicating preference for “commercial” versus “non-commercial” search results. I think it might be practical to build much smaller collections of topic-domain-specific pages, with topic-specific ranking, and fall back to the generic ranking model for additional search results.

4. I think the search engines have been changing the expected behavior of the users over time, making the uniformly random assumption even more broken. When users exhaust their interest in a given link path, they’re likely to jump to a personally-well-known URL, or search again and go to another topically-driven search result. This should skew the distribution further in favor of a conditional ranking model, rather than simply a random one.

A reading list on PageRank and search algorithms

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.

  • Deeper Inside PageRank (PDF) – Internet Mathematics Vol. 1, No. 3: 335-380 Amy N. Langville and Carl D. Meyer. Detailed 46-page overview of PageRank and search analysis. This is the best technical introduction I’ve come across so far, and it has a long list of references which are also worth checking out.
  • Online Reputation Systems: The Cost of Attack of PageRank (PDF)
    Andrew Clausen. A detailed look by at the value and costs of reputation and some speculation on how much it costs to purchase higher ranking through spam, link brokering, etc. Somewhere in this paper or a related note he argues that raising search ranking is theoretically too expensive to be effective, which turned out not to be the case, but the basic ideas around reputation are interesting
  • SpamRank – Fully Automatic Link Spam Detection – Work in progress (PDF)
    András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher. Proposes a SpamRank metric based on personalized pagerank and local pagerank distribution of linking sites.
  • Detecting Duplicate and near duplicate files – William Pugh presentation slides on US patent 6,658,423 (assigned to Google) for an approach using shingles (sliding windowed text fragments) to compare content similarity. This work was done during an internship at Google and he doesn’t know if this particular method is being used in production (vs some other method).

I’m looking at a fairly narrow search application at the moment, but the general idea of using subjective reputation to personalize search results and to filter out spammy content seems fundamentally sound, especially if a network of trust (social or professionally edited) isn’t too big.

Building better personalized search, filtering spam blogs

Batelle’s Searchblog mentions an article by Raul Valdes-Perez of Vivisimo citing 5 reasons why search personalization won’t work very well. Paraphrasing his list:

  1. Individual users interests / search intent changes over time
  2. The click and viewing data available to do the personalization is limited
  3. Inferring user intent from pages viewed after search can be misleading because the click is driven by a snippet in search results, not the whole page
  4. Computers are often shared among multiple users with varying intent
  5. Queries are too short to accurately infer intent

Vivismo (Clusty) is taking an approach in which groups of search results are clustered together and presented to the user for further exploration. The idea is to allow the user to explicitly direct the search towards results which they find relevant, and I have found it can work quite well for uncovering groups of search results that I might otherwise overlook.

Among other things, general purpose search engines are dealing with ambiguous intent on the part of the user, and also with unstructured data in the pages being indexed. Brad Feld wrote some comments observing the absense of structure (in the database sense) on the web a couple of days ago. Having structured data works really well if there is a well defined schema that goes with it (which is usually coupled with application intent). So things like microformats for event calendars and contact information seem like they should work pretty well, because the data is not only cleaned up, but allows explicit linkage of the publisher’s intent (“this is my event information”) and the search user’s intent (“please find music events near Palo Alto between December 1 and December 15″). The additional information about publisher and user intent makes a much more “database-like” search query possible.

I encounter problems with “assumed user intent” all the time on Amazon, which keeps presenting me with pages of kids toys and books every time I get something for my daughter, sometimes continuing for weeks after the purchase. On the other hand, I find that Amazon does a much better job of searching than Google, Yahoo, or other general purpose search engines when my intent is actually to look for books, music, or videos. Similarly, I get much better results for patent searches at USPTO, or for SEC filings at EDGAR (although they’re slow and have difficult user interfaces).

The AttentionTrust Recorder is supposed to log your browser activity and click stream, allowing individuals to accumulate and control access to their personal data. This could help, but not solve the task of inferring search intent.

I think a useful approach to take might be less search personalization based on your individual search and browsing habits, and more based on the people and web sites that you’re associated with, along with explicitly stated intent. Going back to the example at Amazon, I’ve already indicated some general intent simply by starting out at their site. The “suggestions” feature often works in a useful way to identify other products that may be interesting to you based on the items the system thinks you’ve indicated interest in. A similar clustering function for generalized search would be interesting, if the input data (clickstreams, and some measure of relevant outcomes) could be obtained.

Among other things, this could generally reduce the visibility of spam blogs. Although organized spam blogs can easily build links to each other, it’s unlikely that many “real” (or at least well-trained) internet users would either link or click through to a spam blog site. If there an additional bit of input back to a search engine to provide feedback, i.e. “this is spam”, or “this was useful”, and I were able to aggregate my ratings with other “reputable” users, the ratings could be used to filter search results, and perhaps move the “don’t know” or “known spam” search results to the equivalent of the Google “supplemental results” index.

The various bookmarking services on the web today serve as simple vote-based filters to identify “interesting” content, in that the user communities are relatively small and well trained compared with the general population of the internet, and it’s unusual to see spammy links get more than a handful of votes. As the user base expands, the noise in the systems are likely to go up considerably, making them less useful as collaborative filters.

I don’t particularly want to share of my click stream with the world, or any search engine, for that matter. I would be quite happy to share my opinion about whether a given page is spammy or not, if I happened to come across one, though. That might be a simple place to start.

Spammers want donations for better hosting?

I haven’t noticed getting one of these in my e-mail before:

Becouse of a lot of complaints about our malings
we need to buy expensive balk bullet-prof hosting
for our sites. It costs a lot, please, send us
small donation to:

Nordea Bank AB, Sweden, Surte, SWIFT: NDEASESS
to Isa Dzhabrailov, account number: SE 163 000000000 6510032599

I guess they don’t take Paypal…

Model Portfolio of Hot Stock Tips from Spam E-mail

Catching up on the backlog of paper newspapers in my office, I came across an article in this week’s Barrons about a guy who’s been saving his incoming “hot stock tip” spam e-mail and set up a model portfolio tracker to see how it would have done.

On May 5th, 2005 (05/05/05 spooky!) I set out to determine just how much money I could lose by trusting SPAM.

What if I purchased 1000 shares of stock from EVERY stock tip mentioned in a SPAM email? Could we all really be missing out on a great opportunity?

Of course, I don’t have the money to actually waste on an experiment like this. I made this little web site to keep track of the value of those stocks… without my actually purchasing anything.

In other words I haven’t bought any of the stocks listed here. This is just pretend. BUT if I did actually buy them, this is how much money I could be making or losing as of today.

The model assumes that he purchased 1000 shares of each hot top when it came in. As of this weekend, his theoretical investment of $17,405 would have a current value of $9,897.90, for a net loss of $7,507.10.

The one winner among the 37 stocks in the portfolio is Sniffex, Inc., up 188% from $1.17 to $3.37 since June 27, 2005.

More from thestreet.com

Temporary Fix for Referrer Spam

I have a temporary fix for blocking the referrer spam that started a couple of weeks ago. The volume of referrer spam here has steadily been increasing since then, and the number of source IP addresses is also continuing to expand.

The main problem I’m having is that the conditional rewrite rules I want to use in .htaccess don’t seem to be working on my current WordPress setup at Dreamhost. Regular rewrites seem to work fine, but none of the conditional ones are working for me. The initial IP blocklists stopped most of it for a few days, but new spam IP addresses are appearing more quickly now than a few days ago.

In the meantime, the Dreamhost support knowledge base suggests using SetEnvIfNoCase to define patterns to be blocked. This does work at Dreamhost, and I’ve blocked most of the current spam run with the following:

SetEnvIfNoCase Referer ".*\.get\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.drop\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.hey\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.go\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.dive\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.switch\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.come\.to" BadReferrer
SetEnvIfNoCase Referer ".*\.mysite\.de" BadReferrer

order deny,allow
deny from env=BadReferrer

Combined with the IP blocklist from a few days ago, this has made a huge reduction in the outgoing bandwidth. For a while the spam was all HEAD requests, but lately they have all been GET requests on full pages. A few days ago it passed 10,000 spam requests for the day.

Today it looks like we’ll end up around 35,000 blocked referrer spam requests.

I’m a little busy lately so I haven’t tried chasing down the reason the conditional rewrites aren’t working. In the meantime, this is keeping the spam overhead down a bit.

See also: Blocking Referrer Spam, Referrer Spammer IP Blocklist

Referrer Spammer IP Blocklist

Here’s a list of IP addresses that have been sending me referrer spam this week. I haven’t a major attack like this past week before, I’m currently getting something like 10,000 per day since last week.

Most of the bad referrers point to”.go.to”, “.drop.to”, “.hey.to”, “.dive.to”, “.come.to”, “switch.to” and other “.to” TLD sites. The originating IPs are all over the internet. The typical pattern seems to be a few requests from each IP, rather than a stream from a single IP. The user-agent strings are all different, so perhaps these are individual PCs that have been hacked into a botnet for spamming purposes.

If your IP address is on this list, you’re temporarily blocked here, and your computer probably needs to be checked for viruses.

I’m still working on getting the regular expressions working in .htaccess. They’re not behaving consistently, which is too bad since this batch of referrer spam would easily be blocked that way.

See also: Blocking Referrer Spam


# Bad IP blocklist 10-2-2005
deny from 64.193.62.232
deny from 70.84.211.130
deny from 69.28.242.87

# Bad IP blocklist 10-6-2005
deny from 66.246.218.114
deny from 71.57.133.162
deny from 67.186.112.106
deny from 84.139.88.151
deny from 172.202.144.111
deny from 172.206.206.111
deny from 210.213.132.240
deny from 195.252.85.29
deny from 200.116.118.149
deny from 83.109.41.39
deny from 68.228.171.28
deny from 71.57.17.237
deny from 211.30.20.3
deny from 65.1.135.21
deny from 85.140.26.144
deny from 60.228.205.13
deny from 172.195.205.18
deny from 218.111.180.243
deny from 194.158.220.138
deny from 24.239.174.55
deny from 84.110.62.170
deny from 84.58.193.189
deny from 221.97.4.165
deny from 85.140.26.144
deny from 220.137.197.52
deny from 201.8.242.11
deny from 202.81.183.165
deny from 201.240.21.13
deny from 211.223.170.139
deny from 82.229.255.13
deny from 213.250.5.19
deny from 193.144.203.77
deny from 81.4.130.215
deny from 202.138.169.17
deny from 195.210.209.115
deny from 85.220.2.62
deny from 201.29.222.32
deny from 60.42.186.227
deny from 86.193.134.182
deny from 84.25.70.194
deny from 200.172.43.67
deny from 213.66.174.126
deny from 203.92.47.66
deny from 172.207.120.78
deny from 87.3.211.214
deny from 68.107.174.200
deny from 172.199.172.194
deny from 201.140.73.130
deny from 83.244.23.109
deny from 172.152.183.58
deny from 172.178.14.8
deny from 84.58.145.157
deny from 200.110.140.116
deny from 84.227.139.211
deny from 71.129.52.170
deny from 213.219.95.143
deny from 172.132.189.121
deny from 85.202.149.22

# added 10-07-2005
deny from 69.243.192.12
deny from 202.188.11.198
deny from 80.193.21.24
deny from 67.83.162.159
deny from 68.218.172.174
deny from 202.188.11.198
deny from 217.69.246.176
deny from 68.191.140.212
deny from 217.69.246.176
deny from 65.197.39.136
deny from 71.56.28.100
deny from 200.45.239.242
deny from 81.206.15.69
deny from 172.145.127.57
deny from 60.231.122.178
deny from 220.139.58.234
deny from 83.99.169.78
deny from 172.148.63.162
deny from 172.216.76.225
deny from 172.134.233.108
deny from 61.206.107.252
deny from 218.111.40.5
deny from 213.94.235.219
deny from 71.50.252.214
deny from 82.122.69.26
deny from 85.18.136.77
deny from 218.237.94.161
deny from 203.160.1.39
deny from 219.198.40.228
deny from 82.125.202.48
deny from 172.197.232.34
deny from 172.148.71.151
deny from 219.8.135.17
deny from 71.96.28.185
deny from 66.24.44.67
deny from 69.232.100.136
deny from 218.111.203.111
deny from 70.118.248.68
deny from 202.150.96.38
deny from 60.231.218.253
deny from 138.130.48.174
deny from 172.193.178.40
deny from 220.208.94.40
deny from 69.63.50.211
deny from 24.207.35.31
deny from 219.95.215.106
deny from 202.70.206.34
deny from 4.247.49.125
deny from 24.6.128.195
deny from 175.155.157.72
deny from 70.116.132.154
deny from 70.171.34.93
deny from 172.176.231.93
deny from 172.188.232.156
deny from 209.62.198.28
deny from 172.163.161.250
deny from 193.230.181.249
deny from 24.83.50.167
deny from 221.152.50.145
deny from 221.188.182.33
deny from 206.248.94.3

## added 10-08-2005
deny from 203.186.238.239
deny from 82.44.39.104
deny from 172.195.168.35
deny from 69.181.144.132
deny from 220.35.108.250
deny from 85.117.39.130
deny from 172.216.15.107
deny from 69.145.15.24
deny from 195.146.112.130
deny from 203.186.238.240
deny from 70.30.235.27
deny from 64.147.167.130
deny from 172.187.185.145
deny from 68.7.68.46
deny from 64.39.152.1
deny from 172.171.237.162
deny from 80.235.89.91
deny from 64.110.109.253
deny from 172.207.231.89
deny from 61.214.91.105
deny from 70.30.227.73
deny from 69.174.197.196
deny from 172.170.50.116
deny from 86.133.42.86
deny from 65.69.88.137
deny from 24.185.78.84
deny from 221.140.103.198
deny from 85.192.22.6
deny from 212.127.163.23
deny from 85.140.102.49
deny from 85.192.22.6
deny from 85.140.102.49
deny from 84.166.115.140
deny from 212.33.81.18
deny from 64.230.24.143
deny from 84.184.114.158
deny from 80.235.67.123
deny from 83.248.24.217
deny from 81.15.167.146
deny from 200.72.175.16
deny from 172.153.161.35
deny from 213.172.254.45
deny from 209.34.34.33
deny from 193.77.173.58
deny from 68.218.8.107

Blocking Referrer Spam

This afternoon, I’ve noticed there’s a steady stream of HTTP referrer (aka referer) spam originating from a few IP addresses, so I’m finally getting around to making some updates to reduce the volume of spam traffic. In the past I’ve been getting a few spam referrers here and there, but today there are thousands in just a few hours, and these changes are a bit overdue.

Here are the IP addresses sending me spam today:

64.193.62.232
70.84.211.130
69.28.242.87

All of the HTTP requests are HEAD only, not GET. Here’s a typical one:

64.193.62.232 - - [02/Oct/2005:14:34:34 -0700]
    "HEAD / HTTP/1.1" 403 - "http://cheap-vicodin.none.pl"
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

Notice the 403 Forbidden status code. That’s because I’ve added a section to .htaccess to block referrers with spammy keywords, and also to manually block IP addresses. Here’s an abbreviated version:

deny from 64.193.62.232
deny from 70.84.211.130
deny from 69.28.242.87

RewriteEngine on
RewriteCond %{HTTP_REFERER} ^(http://)?(www\.)?.*(-|.)vicodin(-|.).*$ [NC,OR]
< ...lots of other rules go here...>
RewriteRule .* - [F,L]

One convenient aspect of having non-stop incoming spam today is being able to make changes and immediately observe the effect. It’s modestly gratifying to see all the “200 OK” turn into “403 Forbidden” status.

The current block list I’m using for .htaccess is mostly from a list maintained by Aaron Logan.

I also looked through suggestions for .htaccess changes and block lists for referrer spam by Joe Maller, Dave Child, and Mike Healan.

Unfortunately, all of these approaches, especially the IP blocking, are manual processes. I’ve been meaning to get Bad Behavior implemented here, but this was a quick fix for today.

Update 10-06-2005 08:25 PDT: Still getting lots of incoming spam traffic, plus many new IP addresses showing up now. Here’s the revised block list, all of these addresses are actively sending spam.

deny from 64.193.62.232
deny from 70.84.211.130
deny from 69.28.242.87
deny from 66.246.218.114
deny from 71.57.133.162
deny from 67.186.112.106
deny from 84.139.88.151
deny from 172.202.144.111
deny from 172.206.206.111
deny from 210.213.132.240
deny from 195.252.85.29
deny from 200.116.118.149
deny from 83.109.41.39
deny from 68.228.171.28
deny from 71.57.17.237
deny from 211.30.20.3
deny from 65.1.135.21
deny from 200.116.118.149
deny from 85.140.26.144
deny from 60.228.205.13
deny from 172.195.205.18
deny from 218.111.180.243
deny from 194.158.220.138
deny from 24.239.174.55
deny from 84.110.62.170
deny from 84.58.193.189
deny from 221.97.4.165
deny from 85.140.26.144
deny from 220.137.197.52
deny from 201.8.242.11
deny from 202.81.183.165
deny from 201.240.21.13
deny from 211.223.170.139
deny from 82.229.255.13
Page 1 of 212