Coming soon to DVD - 1,146,580,664 common five-word sequences
Google Research is publishing a huge n-gram dataset distilled from trillions of words perused by Google’s vast search spidering effort:
We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
This looks like just the thing for developing some interesting predictive text applications, or just random data mining. The 6-DVD set will be distributed by the Linguistic Data Consortium, which collects and distributes interesting speech and text databases and training sets. Some other items in their collection include transcribed speech from 3000 speakers, a mapping between Chinese and English place, organization, and corporate names, and a transcription of colloquial Levantine Arabic speech.
Update Sunday 08-06-2006 16:41 PDT: See also AOL Research publishes 20 million search queries
Tags: search, algorithms, research, datamining, data, computational, linguistics, resources


























