Wicked. Evil. Foul. Bad. Those words mean essentially the same thing, but we don't talk about "wicked weather," "foul witches" or the "forces of bad." Understanding such subtle differences in usage comes naturally because our brains remember the millions of words we have processed over our lifetimes and which ones go together. But people learning English don't have that repository. So, Mark Davies is volunteering to be their English brains.
The Brigham Young University linguist loves words so much that he sorts through them for days – tens of millions at a time. He has built a searchable Web site that can spit out exactly how often any word appears in English usage, along with words that most often accompany it and many other factors. Want to know which word is most commonly associated with "slippery?" Davies can tell you in less than a second. (You guessed it -- "slope.") He can also break down usage by genre – the most commonly used adjective in British tabloids turns out to be "boozy."
With no fanfare, his site, view.byu.edu, has spread to thousands of users in 83 countries, many of them teaching or learning English. Other users include a sitcom writer looking for new puns, a psychiatrist at Columbia University's medical school who is developing cognitive tests for Alzheimer's patients, and a ("boozy"?) regular at a British pub seeking material for homemade trivia contests.
"The site is sufficiently sophisticated for us egghead academics, but also easy enough for language learners and others who just think language is fun," Davies explained, counting faculty at Stanford, Michigan and Swarthmore and dozens of top international universities among regular visitors.
The "VIEW" in view.byu.edu stands for Variation In English Words and phrases, and the site uses as its database the 100-million-word British National Corpus. Davies is among a rare breed who loves to gather millions of words of written and spoken communication and catalog them into a collection called a "corpus." In addition to building an interface for the material provided by the University of Oxford authors of the British corpus, Davies has built his own corpus for Spanish and is putting the finishing touches on his Portuguese version. Those two projects were funded by grants from the National Endowment for the Humanities totaling more than $300,000.
It's not simply a matter of dumping words into a database. "You want the corpus to represent the range of types of usage, so you need to first determine that you want a certain percentage from newspapers, a certain portion from books, another portion from speeches, and so on," Davis said. "And then within books, you balance that between fiction and nonfiction, and then within those, between westerns and romance and engineering and history, for example."
The entries then must be tagged as particular parts of speech and organized in an architecture and interface that allows them to be accessed easily. That's Davies' specialty and the reason that he was given access to the British National Corpus. He's already on tap to build the interface for the first American National Corpus, currently under construction. And he's building himself the largest historical corpus of English (the British entries are all post-1970), which will include a quarter-billion words produced from 1500-1900. That project will enable study of how usage and meaning of words has changed over time.
"Imagine a word like 'market,'" Davies said. "At one point it would be most commonly associated with words like 'pig' or 'corn.' Now it would be more commonly found with 'stock' or 'international.'"
Although the self-effacing linguist believes most folks would rather use his tool to play around with searches on their names and favorite words, he does use it for rigorous linguistic research, making three to four scholarly presentations a year and publishing a similar amount of scholarly papers. Recently he teamed with departmental colleague Dee Gardner to publish a study of phrasal verbs – those combinations like "burned down" and "put up" that come naturally to native speakers but drive learners down a wall.
Davies also recently completed a dictionary containing the 5,000 most commonly used Spanish words, in order. "If you're learning Spanish you don't want to start just picking up words willy-nilly you want to start with the most frequent ones," he said. The same publisher has him working on a similar "frequency dictionary" for Portuguese.
Understanding frequency turns out to be helpful for other reasons. A company that develops predictive text interfaces for cell phones and devices for the disabled came to Davies for his Spanish corpus, because knowing which words are most commonly used helps its software more accurately "guess" which word is being entered. Attorneys from a Fortune 500 company used Davies' tool to prove their client's product was being confused with another, more commonly used term at issue in a lawsuit.
Davies isn't surprised by the growth in popularity and utility of his site.
"There's a real need for non-native English speakers who want to know how English is authentically used," he said. "And for native speakers, it's just fun to get on there and immerse ourselves in this wonderful data."