Google search results and DMOZ editorializing?

May 11th, 2008 6:58pm

I’ve never seen a search result page like this before. The meta text “Conservative think tank claiming to report about events and nations strategically important to the United States” doesn’t appear any where in the referenced page, which doesn’t contain any useful <META> content. Searching for that text, it looks like the text originated from the DMOZ directory listing.

Another entry from the same DMOZ list, the Kensington Review, also returns the DMOZ meta text, this time in place of the <META> text in the actual page. DMOZ says “An e-magazine of political and social commentary. When the left says the glass is half full and the right says it is half empty, Kensington suggests that it might be too big.” Kensington’s own META says “An electronic journal of political, financial and social commentary”.  DMOZ is a more interesting description, but again does not originate from the content itself. 

Hacked by keymachine.de

April 2nd, 2008 6:15pm

I just noticed that my Wordpress installation got hacked by a search engine spam injection attack sometime in the past few weeks. This particular one inserts invisible text with lots of keywords in footer.php. The changes to the file were made using the built-in theme editor, originating from ns.km20725.keymachine.de, which is currently at 84.19.188.144. The spam campaign automatically updates the spam payload every day or so. The links point to a variety of servers that have also been hacked to host the spam content. Here is a sample: http://www.nanosolar.com/feb3/talk.php?28/82138131762.html
I’ve sent an e-mail to Nanosolar, so they’ll probably have that content cleaned up before long. But the automated SEO spam campaign updates the keyword and link payload regularly, so any affected Wordpress sites will be updated to point at the new hosting victims.

Ms. Dewey - Stylish search, with whips, guns, and dating tips

October 29th, 2006 8:36pm


It’s been a while since I’ve come across something I haven’t seen before online. Ms. Dewey fits the bill. It is a Flash-based application combining video clips of actress Janina Gavankar with Windows Live search.

As a search application, it’s fat, slow, and the query results aren’t great. However, as John Batelle observes, “clearly, search ain’t the point.” This is search with an flirty attitude, where the speed and quality of the results aren’t at the top of the priority list.

As short-attention-span theater goes, it’s quite entertaining.

If you can’t think of anything to search for, Ms. Dewey will fidget for a while and eventually reach out and tap on the screen. “Helloooo…type something here…”

More on the America Online search query data

August 7th, 2006 7:58pm

The search query data that America Online posted over the weekend has been removed from their site following a blizzard of posts regarding the privacy issues. AOL officially regards this as “a screw up”, according to spokesperson Andrew Weinstein, who responded in comments on several sites:

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

AOL Research publishes 20 million search queries

August 6th, 2006 3:45pm

More raw data for search engineers and SEOs, and fodder for online privacy debates - AOL Research has released a collection of roughly 20 million search queries which include all searches done by a randomly selected set of around 500,000 users from March through May 2006.

This should be a great data set to work with if you’re doing research on search engines, but seems problematic from a privacy perspective. The data is anonymized, so AOL user names are replaced with a numerical user ID:

The data set includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl}.

I suspect it may be possible to reverse engineer some of the query clusters to identify specific users or other personal data. If nothing else, I occasionally observe people accidentally typing in user names or passwords into search boxes, so there are likely to be some of those in the mix. “Anonymous” in the comments over at Greg Linden’s blog thinks there will be a lot of those. The destination URLs have apparently been clipped as well, so you won’t be able to see the exact page that resulted in a click-through.

Coming soon to DVD - 1,146,580,664 common five-word sequences

August 5th, 2006 8:58pm

Google Research is publishing a huge n-gram dataset distilled from trillions of words perused by Google’s vast search spidering effort:

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

This looks like just the thing for developing some interesting predictive text applications, or just random data mining. The 6-DVD set will be distributed by the Linguistic Data Consortium, which collects and distributes interesting speech and text databases and training sets. Some other items in their collection include transcribed speech from 3000 speakers, a mapping between Chinese and English place, organization, and corporate names, and a transcription of colloquial Levantine Arabic speech.

Update Sunday 08-06-2006 16:41 PDT: See also AOL Research publishes 20 million search queries

Google is having problems this evening?

July 26th, 2006 8:09pm

This evening I’m getting slow response or connection timeouts from Google for the past half hour or so (20:30 - 21:00 PDT). Usually this means that the local network is having problems, but other major sites (Yahoo, CNN) are running as quickly as ever, along with various SSH sessions around the world, so it seems to be specific to Google.

So far I get slow or no response from the main search page, Gmail, Adsense, Adwords, Analytics, and Finance.

Pages that do respond are coming back in 10+ seconds, and some pages are loading without graphics or with templates only and no content.

Anyone else seeing these problems? This is the first time I’ve seen Google unusable for more than a minute or two. (Unlike this site, which has been bouncing up and down due to problems at Dreamhost lately).

Search referrals - July 2006 snapshot

July 24th, 2006 9:07pm


Here’s a quick snapshot of incoming search engine referrals for the past few weeks. Compare this with another post last year on search engine referral share, recently referenced in a post at Alexa noting the discrepancy between the published search engine traffic reports and anecdotal observations by webmasters.

Is it just me, or are these charts a bit goofy? Does Yahoo really still have 23% of the search market? Is Google at less than half the search market?

I don’t believe it. Any webmaster will tell you that Google represents almost ALL of the search engine traffic. Yahoo is nowhere near 23%. Just read the blogs, here, here, here and here and on countless other blogs.

Already at 82% last October, Google has increased to even more of the incoming search traffic (92%) here, largely at the expense of “Other”. In the fall, it looked like those were mostly miscellaneous Chinese search engines, so perhaps my site is not getting indexed or ranked well there anymore, or Google is picking up market share, or both.

The Long Tail of Invalid Clicks and other Google click fraud concepts

July 22nd, 2006 7:23pm

Some fine weekend reading for search engineers, SEOs, and spam network operators:

A 47-page independent report on Google Adwords / Adsense click fraud, filed yesterday as part of a legal dispute between Lane’s Gifts and Google, provides a great overview of the history and current state of click fraud, invalid clicks of all types, and the four-layered filtering process that Google uses to detect them.

Google has built the following four “lines of defense” against invalid clicks: pre-filtering, online filtering, automated offline detection and manual offline detection, in that order. Google deploys different detection methods in each of these stages: the rule-based and anomaly-based approaches in the pre-filtering and the filtering stages, the combination of all the three approaches in the automated offline detection stage, and the anomaly-based approach in the offline manual inspection stage. This deployment of different methods in different stages gives Google an opportunity to detect invalid clicks using alternative techniques and thus increases their chances of detecting more invalid clicks in one of these stages, preferably proactively in the early stages.

Google’s PageRank and Beyond - summer reading for search hackers

July 11th, 2006 7:31pm

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

Del.icio.us adds private bookmarks

March 19th, 2006 7:58pm

Del.icio.us is testing out private bookmarks now.

I’ve been playing with a private instance of Scuttle ever since del.icio.us was purchased by Yahoo a few months back, but have continued using del.icio.us for posting public links anyway.

My del.icio.us links are automatically posted here (except when one end or the other is out of service for some reason), don’t know if that would include the private ones or not. Also don’t know exactly where the private bookmarks might be visible, aside from in one’s own account. I’ll have to give it a try.

More tea leaves from Google’s analyst day presentation

March 7th, 2006 4:31pm

It seems that a lot of the interesting content from last week’s analyst event at Google is in the speaker notes from the PowerPoint slide deck. Greg Linden and others have already pointed out the notes about Google’s storage plans (GDrive, Lighthouse on slide 19).

This afternoon there’s another blip on CNBC about accidental communications in the slides.

The previously undisclosed notes stated that Google’s core advertising business was expected to grow by nearly 60 percent to $9.5 billion in 2006 but that profit margins in its mainstay AdSense business could be squeezed this year and beyond.

I didn’t remember seeing a revenue forecast in there, so I went back and looked to see what it actually said (slide 14).

Our ads business for the moment is healthy and growing and we’re on a strong trajectory
projected to grow from $6bn this year to $9.5bn next year based purely on trends in traffic and monetization growth

Randomly exploring the long tail of search results

March 6th, 2006 7:19pm

I sometimes click on a random “deep” search result page to see if anything interesting turns up, because of the limitations of popularity and PageRank for some queries.

Paul Kedrosky points at a recent paper from CMU which suggests randomly mixing in some low ranking pages may improve search results over time.

Unfortunately, the correlation between popularity and quality
is very weak for newly-created pages that have few
visits and/or in-links. Worse, the process by which new,
high-quality pages accumulate popularity is actually inhibited
by search engines. Since search engines dole out
a limited number of clicks per unit time among a large
number of pages, always listing highly popular pages at
the top, and because users usually focus their attention on
the top few results, newly-created but high-quality
pages are “shut out.”

Will Google grow at this rate forever? No? Then DIE!!

February 28th, 2006 9:39pm

Today was a moderately exciting or irritating day to be a investor in public technology companies. Google’s CFO, George Reyes, apparently forgot that he was webcasting to a public group of investors rather than conferencing with an in-house team at the Googleplex during the Q&A session at the Merrill Lynch Internet, Advertising, Information, & Education conference: (Yahoo/AP News)

Q: Looking back to Q3 2005, was there anything in there that was maybe sort of one-time in nature that accounted for such strong revenue growth…?

A: So we went through a period of probably 18 months where we thought we had…well, let me characterize it…we had what was called a RevForce initiative–Revenue Force–which was really a team of really very bright technical engineers that were trying to tweak and optimize the ad system, and not–you know in very very responsible ways [Don’t Be Evil!]–and that sort of paid off nicely with the fruits of that labor.

Google and magazine covers as a contrary indicator

February 12th, 2006 2:27pm

Is Google headed for a downturn? Not only is it featured in a generally negative cover article in this week’s Barron’s, but now it’s featured on the cover of Time as well. These magazines cater to very different audiences, so turning up on both at the same time could be considered a sign that Google is reaching a peak of sorts on both the financial and general cultural fronts.

There’s a long tradition of things going badly for companies and people after getting this sort of high profile magazine cover treatment. If Google turns up next on the cover of People or Entertainment Weekly they’re probably doomed…

Update 02-12-2006 18:31 PST: John Battelle suggests that having made the cover of Time, Google has “jumped the shark”, while Matt Cutts offers a recent historical perspective of Google’s non-shark-jumping behavior while simultaneously demonstrating effective link baiting technique.

Reverse engineering a referer spam campaign

February 4th, 2006 4:28pm

It looks like someone’s launched a new referrer spam campaign today, there’s a huge uptick in traffic here. The incoming requests are from all over the internet, presumably from a botnet of hijacked PCs, but it looks like all of the links point to a class C network at 85.255.114 somewhere in the Ukraine.

It’s interesting to think a little about link spam campaigns and what opportunity the operators hope to exploit. Two major types of link spam on blogs are comment spam and referrer spam. My perception is that comment spam is more common. Most blogs now wrap outgoing links in reader comments with “rel=nofollow” to prevent comments links from increasing Google rank for the linked items, but the links are still there for people to click on.

P.R.A.S.E. - PageRank assisted search engine - compare ranking on Google, Yahoo, and MSN

January 17th, 2006 11:01pm

page rank assisted search engine
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.

I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.

As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.

Tagnautica - fun Flickr tag navigator

January 15th, 2006 11:01pm

Tagnautica is a fun and interesting Flash user interface for exploring and navigating among tags, in this case on Flickr. After keying in an initial tag, related tags are displayed in a circle, with a sample image from each tag category displayed in a representative size.

When you move the cursor over a tag bubble, it temporarily becomes larger so you can get a look at it. The other bubbles keep resizing as well, giving the interface a very fluid appearance. When you find something you like, you can click on the Tagnautica bubble to view the tag page over at Flickr.

I always enjoy these sorts of user interfaces for semi-random exploration. I’ve noticed that I don’t really use any of the cool visualization tools when I actually want to find something, though. Not sure if that’s because they don’t represent a useful set of questions as implemented yet, or simply because my brain doesn’t work that way.

Watching 4th graders use search engines

January 15th, 2006 4:37pm

Last Friday I spent an hour with my daughter’s 4th grade class, helping them do online research for reports on early California explorers. They were individually assigned an explorer, and were looking for basic biographical information such as dates and places of birth and death, and notable historical achievements or other interesting items to write about. From my perspective, this turned out to be a sort of small focus group on using search engines.

I spend most of my time around people who are pretty good at using search engines and online research tools, so it was interesting to see what they would do with this assignment.

The kids are all familiar with computers to varying degrees. They have had classroom activities using the computer at least once a week since kindergarten, and most of them have some experience using computers at home (this is Palo Alto, after all). I don’t think they’ve done any organized “internet research” in school up to this point, though.

SearchSIG - January 2006

January 10th, 2006 10:35pm
IMG_5794 IMG_5795

This evening’s SearchSIG featured a panel discussion on tagging and social bookmarking.

L-R: Joshua Schachter (del.icio.us), Kevin Rose (Digg), Michael Tanne (Wink), Manish Chandra (Kaboodle)

Charlene Li (from Forrester) moderated.

The room at Yahoo was full — standing room only. A quick show of hands indicated nearly everyone in the room had used tagging services before.

Some discussion about “how can we trust the tags”, tag spam (Charlene’s term was “spag”), discerning intent from user tagging and other actions, and the problems of tagging users and the range of social gestures built into the various systems.

Joshua used the example of receiving LinkedIn connection requests from someone whose name you don’t recognize. You don’t want to accept it, because you don’t know who it is. You don’t want to reject it, because it would be rude, and you might actually know them. So he has a huge backlog of random connection requests piling up in his inbox.

Google
Next Page »
 
  • A Random Selection of Other Fine Posts

  •  
    Translate this page
    German Flag Spanish Flag French Flag Italian Flag Portuguese Flag Japanese Flag Korean Flag Chinese Flag
    Plugin by Taragana
    Google
    Web hojohnlee.com

    • You are currently browsing the archives for the Search Engines category.

    •  

     

     
     

    © 2004-2008 Ho John Lee