Ngrams: the history of words

  • strict warning: Non-static method view::load() should not be called statically in /server2/blume-center/sites/all/modules/views/views.module on line 906.
  • strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /server2/blume-center/sites/all/modules/views/handlers/ on line 0.
  • strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /server2/blume-center/sites/all/modules/views/handlers/ on line 0.
  • strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /server2/blume-center/sites/all/modules/views/plugins/ on line 0.
  • strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /server2/blume-center/sites/all/modules/views/plugins/ on line 0.

NgramLately, statistical lexical analysis tools or N-Gram viewers have sparked a heated debate over their significance and impact on the Humanities (see a recent NYTimes article on the subject). The Microsoft Web N-Gram Service provides "access [to] petabytes of data via a cloud-based platform to drive discovery and innovation in web search, natural language processing, speech, and related areas" (source). Microsoft's approach is to use real-world, web data and build tools. Google's Ngram viewer, on the other hand, relies on its vast collection of scanned books, 15 million and counting. So far, Google has provided public access to a subset of this data, some 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. Users can trace the textual use of words and phrases over time and see how their appeal and, no doubt their meaning, has changed. There are certainly concerns about the accuracy and extent of this corpus. Here is a review of the analysis of a very common word and how it may have fooled these sophisticated tools. And yet, the ease of use and analytical power these techniques offer is undeniable.