Wednesday, 24 October 2012

CountNGrams - Using Apache Hadoop to Count N-Grams

I just put another tool on GitHub.

The tool, called CountNGramsuses Apache Hadoop to count the number of occurrences of n-grams in a given set of text files. In case you aren't familiar, an n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three"). Knowing which n-grams are in a set of files, such as all the bug reports of Eclipse, and their frequency over time, can shed light onto the major development trends in the project. 
I used the Apache Hadoop framework in order to easily support the analysis of big data: millions of files and hundreds of gigabytes or more. Hadoop takes care of distributing the workload across all available machines in your cluster, making the analysis fastfastfast and the implementation easyeasyeasy. Hadoop is truly awesome; learn all about it elsewhere.

I haven't published any papers that use CountNGrams, but one is in the works.
Check out the project for more details and to make your own changes. 

Saturday, 8 September 2012

lscp - A lightweight source code preprocesser

I've just released lscp, a lightweight source code preprocesser. Check it out on GitHub.

This is one of the many tools I've written to conduct my research in using IR models on software repositories. So many people have asked me for a copy of the tool, that I decided to clean it up a bit and make it accessible to the world.

Check out the GitHub page for a detailed description and how to use it. Feel free to fork it and extend it, or add any bugs or feature requests you find to the issue tracker.

Wednesday, 5 September 2012

A Linux one-liner to find all the acronyms in your Latex files

At the beginning of my PhD thesis, I include a List of Acronyms. Of course, I would like to be sure that my list is comprehensive. I don't want any strange acronyms to appear in the text of my thesis, without first appearing in my list of acronyms. But how can I easily identify all of the acronyms in the Latex source, without having to read all 244 pages manually?

grep to the rescue, again

Like most other areas of my life, this problem can be easily solved with a Linux one-liner centered around grep:

cat *.tex | grep -wo "[A-Z]\+\{2,10\}" | sort | uniq -c | sort -gr

Let's take a look at the pipeline:
  • The cat *.tex outputs all my Latex to standard output.
  • The grep -wo  "[A-Z]\+\{2,10\}" matches whole words (the -w flag) that contain between 2 and 10 upper case letters. The -o flag returns only the match, not the entire line.
  • The first sort sorts the acronyms, which is useful for the next step.
  • The uniq gets rid of duplicates, but retains a counter because of the -c flag.
  • Finally, the second sort sorts the entries numerically (-g) and reverses the results (-r).

  • Here's the output on my thesis:

        292 IR
        241 LDA
        166 LSI
        125 VSM
         87 TCP
         80 EM
         35 APFD
         34 SUT
         29 HSD
         22 TOPIC
         22 II
         18 MALLET
         16 LOC
         14 PS
         14 OR
         14 IDE
         12 CALLG
         11 ICA
         10 RNDM
         10 MAP
         10 KL
         10 CS

Note that this command works with any text file; it is not unique to Latex. Just change the cat command.