I just put another tool on GitHub.
The tool, called CountNGrams, uses Apache Hadoop to count the number of occurrences of n-grams in a given set of text files. In case you aren't familiar, an n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three"). Knowing which n-grams are in a set of files, such as all the bug reports of Eclipse, and their frequency over time, can shed light onto the major development trends in the project.