Coming soon to DVD - 1,146,580,664 common five-word sequences

August 5th, 2006 8:58pm

Google Research is publishing a huge n-gram dataset distilled from trillions of words perused by Google’s vast search spidering effort:

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

This looks like just the thing for developing some interesting predictive text applications, or just random data mining. The 6-DVD set will be distributed by the Linguistic Data Consortium, which collects and distributes interesting speech and text databases and training sets. Some other items in their collection include transcribed speech from 3000 speakers, a mapping between Chinese and English place, organization, and corporate names, and a transcription of colloquial Levantine Arabic speech.

Update Sunday 08-06-2006 16:41 PDT: See also AOL Research publishes 20 million search queries

Google’s PageRank and Beyond - summer reading for search hackers

July 11th, 2006 7:31pm

The past few evenings I’ve been working through a review copy of Google’s PageRank and Beyond, by Amy Langville and Carl Meyer. Unlike some recent books on Google, this isn’t exactly an easy and engaging summer read. However, if you have an interest in search algorithms, applied math, search engine optimization, or are considering building your own search engine, this is a book for you.

Students of search and information retrieval literature may recognize the authors, Langville and Meyer, from their review paper, Deeper Inside PageRank. Their new book expands on the technical subject material in the original paper, and adds many anecdotes and observations in numerous sidebars throughout the text. The side notes provide some practical, social, and recent historical context for the math being presented, including topics such as “PageRank and Link Spamming”, “How Do Search Engines Make Money?”, “SearchKing vs Google”, and a reference to Jeremy Zawodny’s PageRank is Dead post. There is also some sample Matlab code and pointers to web resources related to search engines, linear algebra, and crawler implementations. (The aspiring search engine builder will want to explore some of these resources and elsewhere to learn about web crawlers and large scale computation, which is not the focus here.)

Randomly exploring the long tail of search results

March 6th, 2006 7:19pm

I sometimes click on a random “deep” search result page to see if anything interesting turns up, because of the limitations of popularity and PageRank for some queries.

Paul Kedrosky points at a recent paper from CMU which suggests randomly mixing in some low ranking pages may improve search results over time.

Unfortunately, the correlation between popularity and quality
is very weak for newly-created pages that have few
visits and/or in-links. Worse, the process by which new,
high-quality pages accumulate popularity is actually inhibited
by search engines. Since search engines dole out
a limited number of clicks per unit time among a large
number of pages, always listing highly popular pages at
the top, and because users usually focus their attention on
the top few results, newly-created but high-quality
pages are “shut out.”

P.R.A.S.E. - PageRank assisted search engine - compare ranking on Google, Yahoo, and MSN

January 17th, 2006 11:01pm

page rank assisted search engine
P.R.A.S.E., aka “Prase” is a new web tool for examining the PageRank assigned to top search results at Google, Yahoo, and MSN Search. Search terms are entered in the usual way, but a combined list of results from the three search engines is presented in PageRank order, from highest to lowest, along with the search engine and result rank.

I tried a few search queries, such as “web 2.0″, “palo alto”, “search algorithm”, “martin luther king”, and was surprised to see how quickly the PageRank 0 pages start turning up in the search results. For “web 2.0″, the top result on Yahoo is the Wikipedia entry on Web 2.0, which seems reasonable, but it’s also a PR0 page, which is surprising to me.

As a further experiment, I tried a few keywords from this list of top paying search terms, with generally similar results.

Why Link Farms (used to) Work

December 22nd, 2005 2:58pm

I tripped over a reference to an interesting paper on PageRank hacking while looking at some unrelated rumors at Ian McAllister’s blog. The undated paper is titled “Faults of PageRank / Something is Wrong with Google’s Mathematical Model”, by Hillel Tal-Ezer, a professor at the Academic College of Tel-Aviv Yaffo.

It points out a fault in Google’s PageRank algorithm that causes ’sink’ pages that are not strongly connected to the main web graph to have an unrealistic importance. The author then goes on to explain a new algorithm with the same complexity of the original PageRank algorithm that solves this problem.

After a quick read through this, it appears to describe one of the techniques that had been popular among some search engine optimizers a while back, in which link farms would be constructed pointing at a single page with no outbound links, in an effort to artificially raise the target page’s search ranking.

Personalization, Intent, and modifying PageRank calculations

December 8th, 2005 4:00pm

Greg Linden took a look at Langville and Meyer’s Deeper Inside PageRank, one of the papers on my short PageRank reading list and is looking into some of the same areas I’ve been thinking about.

On the probabilities of transitioning across a link in the link graph, the paper’s example on pp. 338 assumes that surfers are equally likely to click on links anywhere in the page, clearly a questionable assumption. However, at the end of that page, they briefly state that “any suitable probability distribution” can be used instead including one derived from “web usage logs”.

Similarly, section 6.2 describes the personalization vector — the probabilities of jumping to an unconnected page in the graph rather than following a link — and briefly suggests that this personalization vector could be determined from actual usage data.

A reading list on PageRank and search algorithms

December 1st, 2005 1:00pm

If you’re subscribed to the full feed, you’ll notice I collected some background reading on PageRank, search crawlers, search personalization, and spam detection in the daily links section yesterday. Here are some references that are worth highlighting for those who have an interest in the innards of search in general and Google in particular.


 
  • A Random Selection of Other Fine Posts

  •  
    Translate this page
    German Flag Spanish Flag French Flag Italian Flag Portuguese Flag Japanese Flag Korean Flag Chinese Flag
    Plugin by Taragana
    Google
    Web hojohnlee.com

    •  

     

     
     

    © 2004-2008 Ho John Lee