|9 Nov 2003 @ 12:36, by Roger Eaton|
Two or three principal finds plus the sheer fun of words have made the search through academia worthwhile.
Wordnet is a fascinating site for word bugs. It is an English dictionary/thesaurus developed at Princeton over the last 20 years or so to help automate semantic searches. Wordnet lists English nouns, verbs, adjectives and adverbs in all of their senses, including each sense in a "synset" of synonyms. In addition, antonyms, hypernyms and hyponyms are tracked, so each word is enmeshed in its relations with other words. Wordnet is available both online and for download under a liberal license. For more details, see the Wikipedia article.
See Jaap Kamps' and Maarten Marx's Words with Attitude article for an illustration of the fun that can be had with wordnet. These two professors have explored the synonyms of three pairs of adjectives in wordnet, good/bad, active/passive and strong/weak. These three pairs are the result of earlier work by Osgood, the discoverer of semantic space. There are 21,365 adjectives in wordnet 1.6. Of these, there is a subset of 5410 adjectives which can be reached by linking through synonym subsets beginning with the three pairs. The surprise is that the exact same 5,410 adjectives are reached this way from each of the three pairs of starter adjectives. Words with attitude! A lovely result.
Kamps continues in a fascinating paper called The Structure of Meaning. Those 5,410 adjectives that make up words with attitude are only one "component", where a component is a set of words that all link to each other through the wordnet synsets. It turns out that there is one giant adjectival component, one giant noun component, and one giant verb component. These three giants are much bigger than any other component. Zeroing in on the Words with Attitude component, these words can be divided very evenly into words with positive and negative connotations, allowing an article to be automatically rated as giving a positive or negative spin to its subject matter. We could do this for our English voh articles. Someday anyway. And don't miss Kamps wonderful jiggling display of "good" and "bad".
At some point, we may want to use wordnet and the wordnets for other languages that are springing up to add some sophistication to our voice of humanity search methods. Even now, we may want to use it as and English "stemmer". Wordnet does have the capability to stem all its vocabulary. So you can feed it "steamrolled" and it will return "steamroll". Most likely, we will not use stemming in the initial voh implementation. Somewhere I read that it does not really help that much, plus we would need a different stemmer for each language.
Here's a good overview of the statistical keyword applications from Prof Belew of UC San Diego. The big revelation here is the Inverse Document Frequency article. Idf is an easily applied keyword weight, which much improves document retrieval via keyword. The theory is that keywords that occur with the highest and lowest frequency are not useful, so we throw them out. Amongst those in the middle, it is those keywords that bunch up into fewer articles for the same number of occurrences overall that work best as keywords. Idf measures the bunching-up-ness. For instance, a paper by William Church and Kenneth Gale of Bell Labs, Inverse Document Frequency (IDF): A Measure of Deviations from Poisson compares the words "boycott" and "somewhat". Both words occur the same number of times in a set of Associated Press articles (the "corpus" as they say), but "boycott" occurs in 676 of the corpus articles, while "somewhat" occurs in 979. Our intuition tells us that "boycott" is a better keyword than "somewhat" and here we have a way to capture that intuition for automated use. Another article, this time from Kishore Papineni of IBM's Watson Research Center tells us idf has been proved to be the best measure. Kevin Prey, James C. French, Allison L. Powell, a team out of the University of Virginia along with Charles Viles from UNC Chapel Hill have shown that idf applies well even to very large corpora, such as subsets of the www. This one find makes the academic overview worthwhile.
Something else worthwhile that turned up was a free fast tool for distinguishing the language of a document on the fly.
If the reader knows of other such gems, please add a comment at the bottom of this article or emailing firstname.lastname@example.org.