Wednesday, 24 October 2012

CountNGrams - Using Apache Hadoop to Count N-Grams

I just put another tool on GitHub.

The tool, called CountNGramsuses Apache Hadoop to count the number of occurrences of n-grams in a given set of text files. In case you aren't familiar, an n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three"). Knowing which n-grams are in a set of files, such as all the bug reports of Eclipse, and their frequency over time, can shed light onto the major development trends in the project. 
I used the Apache Hadoop framework in order to easily support the analysis of big data: millions of files and hundreds of gigabytes or more. Hadoop takes care of distributing the workload across all available machines in your cluster, making the analysis fastfastfast and the implementation easyeasyeasy. Hadoop is truly awesome; learn all about it elsewhere.

I haven't published any papers that use CountNGrams, but one is in the works.
Check out the project for more details and to make your own changes. 

Saturday, 8 September 2012

lscp - A lightweight source code preprocesser

I've just released lscp, a lightweight source code preprocesser. Check it out on GitHub.

This is one of the many tools I've written to conduct my research in using IR models on software repositories. So many people have asked me for a copy of the tool, that I decided to clean it up a bit and make it accessible to the world.

Check out the GitHub page for a detailed description and how to use it. Feel free to fork it and extend it, or add any bugs or feature requests you find to the issue tracker.

Wednesday, 5 September 2012

A Linux one-liner to find all the acronyms in your Latex files

At the beginning of my PhD thesis, I include a List of Acronyms. Of course, I would like to be sure that my list is comprehensive. I don't want any strange acronyms to appear in the text of my thesis, without first appearing in my list of acronyms. But how can I easily identify all of the acronyms in the Latex source, without having to read all 244 pages manually?

grep to the rescue, again

Like most other areas of my life, this problem can be easily solved with a Linux one-liner centered around grep:

cat *.tex | grep -wo "[A-Z]\+\{2,10\}" | sort | uniq -c | sort -gr

Let's take a look at the pipeline:
  • The cat *.tex outputs all my Latex to standard output.
  • The grep -wo  "[A-Z]\+\{2,10\}" matches whole words (the -w flag) that contain between 2 and 10 upper case letters. The -o flag returns only the match, not the entire line.
  • The first sort sorts the acronyms, which is useful for the next step.
  • The uniq gets rid of duplicates, but retains a counter because of the -c flag.
  • Finally, the second sort sorts the entries numerically (-g) and reverses the results (-r).

  • Here's the output on my thesis:

    
        292 IR
        241 LDA
        166 LSI
        125 VSM
         87 TCP
         80 EM
         35 APFD
         34 SUT
         29 HSD
         22 TOPIC
         22 II
         18 MALLET
         16 LOC
         14 PS
         14 OR
         14 IDE
         12 CALLG
         11 ICA
         10 RNDM
         10 MAP
         10 KL
         10 CS
         ...

Note that this command works with any text file; it is not unique to Latex. Just change the cat command.

Saturday, 25 August 2012

My favorite computer science, statistics, and research blogs

If you haven't seen these before, they're worth a look.

Did I miss any?

Thursday, 26 July 2012

A Linux one-liner to count the unique words in a set of files

Say you have a bunch of text files, and you want to find out all the unique words in them, and how often each words appears. Using the brilliance of Linux commands, this can be achieved in a few strokes of the finger:

$ cat *.txt | tr " " "\n" | sort | uniq

Use tr to put all the words (separated by spaces, but you can make this more complex if you need) into their own line. sort will then sort all these words. Finally, uniq will get rid of duplicate lines, and the -c flag will add counts. I'll run this on a small set of bug reports from the Eclipse Bugzilla repository:

$ cat * | tr " " "\n" | sort | uniq -c
      1 ^^^^
      6 able
     12 about
      1 aboutdialog
      1 abovebackground
      1 absolute
     18 abstract
      2 abstractannotationprocessormanager
      4 abstractcompletiontest
      1 abstractdebugeventhandler
    [...]

The words will be displayed in alphabetical order. (Yes, the word "^^^^" actually appeared somewhere in the bug reports.) To instead display them by usage in decreasing order, run it all through sort one more time:

$ cat * | tr " " "\n" | sort | uniq -c | sort -gr
   1118 java
   1006 eclipse
   1005 at
    956 org
    825 the
    546 internal
    361 to
    356 jdt
    320 ui
    318 in
   [...]

The -g option sorts by numeric values, instead of alphanumeric, and the -r option reverses the sort. VoilĂ !

State transition diagrams in Latex



Using the tikz package in Latex, I recently created some pretty cool state transition diagrams:

Apache flow chart.
PostgreSQL flow chart.

These diagrams show four states (the circles), and the transition probability between them (the lines and associated numbers). They also encode some meta data:
  • The name of the states (the text in the circles)
  • The percentage of time spent in each state (the numbers in the circles)
  • The transition probabilities, shown both as the numbers next to the lines, and the thickness of the line, to provide extra visual indication
  • The colors of the circles also align with the colors of other diagrams (not shown here) in my article, for more visual reinforcement and to ease the burden on the reader
All of this created with some simple Latex code:

\begin{tikzpicture}[->,>=stealth,shorten >=1pt,auto,node distance=2.9cm, semithick] 
\tikzstyle{every state}=[fill=gray,draw=black,text=white] 
 
\node[state, fill=white, text=black, inner sep = -3pt]      (1) {\small \begin{tabular}{c} 
    Inactiv. \\ 
    {\footnotesize .09} 
\end{tabular}}; 
\node[state, fill=lightgray,text = black, inner sep = -3pt]         (2) [right of=1] {\small \begin{tabular}{c} 
   Discus. \\ 
    {\footnotesize .16} \end{tabular}}; 
\node[state, fill=black, inner sep = -3pt]         (3) [below of=1] {\small \begin{tabular}{c} 
    Active \\ 
    {\footnotesize .55}\end{tabular}}; 
\node[state, fill=gray, inner sep = -3pt]         (4) [below of=2] {\small \begin{tabular}{c} 
    Imple. \\ 
    {\footnotesize .21} \end{tabular}}; 
 
\path 
(1) edge[line width=1.20pt,loop left] node {\small .25} (1) 
(1) edge[line width=1.20pt,bend left=20] node {\small .25} (2) 
(1) edge[line width=1.20pt] node {\small .25} (3) 
(1) edge[line width=1.20pt,bend left=20] node {\small .25} (4) 
(2) edge[line width=0.60pt,loop right] node {\small .10} (2) 
(2) edge[line width=1.53pt,bend left=20] node {\small .33} (3) 
(2) edge[line width=0.87pt,bend left=20] node {\small .17} (4) 
(3) edge[line width=0.96pt,bend left=20] node {\small .19} (2) 
(3) edge[line width=2.42pt,loop left] node {\small .56} (3) 
(3) edge[line width=0.87pt] node {\small .17} (4) 
(4) edge[line width=0.70pt,bend left=20] node {\small .12} (3) 
(4) edge[line width=2.87pt,loop right] node {\small .67} (4) 
 
 
; 
\end{tikzpicture} 


What's great about using Latex to create these diagrams, as opposed to using a graphical editor or drawing program, is that the Latex code can be created programmatically. That is, you can create a simple tool that outputs the Latex code given an arbitrary data set (in this case, the numbers in the diagram). This is especially great if: 
  • Your data is likely to change often, and you don't want to keep manually changing the diagram
  • You need to create many diagrams, and you hate manual, repetitive work as much as I do
  • You value making your processes reproducible by others
  • You want to impress your wife
In my case, I first manually wrote the Latex code to position and color the states, since those won't change depending on the data. (But you can easily make these data-dependent, as well.) Then,  I used R to output the Latex code for the state transitions and their thicknesses:


###########################################################################
###########################################################################
# Given an object a, which contains an member object tran.median, which is
# itself a 4 x 4 matrix of transition values between states, output tikz Latex
# code. 
printStateTransitions = function(a){

    opts=c(
    ",loop left",
    ",bend left=20",
    "", 
    ",bend left=20",
    "",     
    ",loop right",
    ",bend left=20",
    ",bend left=20",
    ",bend left=20",
    ",bend left=20",
    ",loop left", 
    "", 
    ",bend left=20",
    "",     
    ",bend left=20",
    ",loop right"
    )           
            
    counter<-1
    for (i in 1:4){
        for (j in 1:4){
            t <- a$tran.median[i,j]
    
            if (t>0.03){
                # 2.5 = max, 0 = min
                w <- t*4 + 0.2 
                cat(sprintf("(%d) edge[line width=%.2fpt%s] node {\\small .%02.0f} (%d)\n", i, w, opts[counter], t*100, j))
            }
            counter <- counter+1
        }
    }
}   


Here, I've hard coded the size of the input matrix (4x4) and played with some option values to make the diagram pretty to my eye, but you can easily roll your own flavor.

Conclusion

State transition diagrams are a visually-appealing way to spice up any publication, report, or presentation. The tikz package in Latex makes creating state transition diagrams a breeze, and you won't need to rely on third party software. Finally, programmatically generating diagrams is like hiring a maid: less work, more fun.

Wednesday, 25 July 2012

Topic stability diagrams in R


In this post I will show you how to create topic stability diagrams in R. If you already know what a topic model is and want to get right to the good stuff, click here.

Introduction to Topic Models

Topic models, such as latent Dirichlet allocation (LDA), are a fun and powerful way to analyze huge messes of unstructured data. I've used topic models throughout my Ph.D. career and have followed the related research closely for many moons.

Unstructured data? Topic models to the rescue!

Briefly, topic models automatically infer the underlying topics within an unlabeled, unstructured collection of text documents. For example, topics models could automatically tell me that Wikipedia has topics about "films", "war", "computer systems", "mathematics", and "diseases", amongst many others. More importantly, topic models will also tell me which articles contain which topics. For example, say I wanted to find all articles on "mathematics". Topic models will tell me just that! All I have to do is select all documents that contain that particular topic, according to the topic model. Topic models do this without any help from humans, no labeling of the original Wikipedia articles, do not require predefined topic lists or taxonomies of any kind. Instead, they operate directly on the statistical word counts of the raw text, and use machine learning techniques (read: complicated mathematics) to just figure it out on their own. Topic models can be applied to any collection of documents in the world, and they are fast, fast, fast.

Before using topics models on a collection of documents, you need to specify a few parameters:
  • K: the number of topics you want
  • a, b: some smoothing parameters for the underlying mathematics
  • I: number of sampling iterations (again, for the underlying mathematics)
There are an infinite number of values that each parameter can take, so how do you know if the ones you've chosen are any good? This has been a subject of much debate and research. In some cases, topic models are used as part of a larger process, so you can tell which parameters are better because they make the whole process better. For example, if you used topic models as a way to predict bugs in software (as we did in a recent research article), then you can try a bunch of different parameter values and see which ones result in the most bugs predicted. But sometimes, you want to find topics just for topic's sake (i.e., just to explore and organize the documents), so you must find another way to determine if your chosen values are good or not.

Topic Stability

One way to determine how good your parameter choices are is to determine how stable the resulting topics are, compared to other parameter choices. (What if I had chosen 44 topics, instead of 45? Would everything be completely different, or mostly the same?) The strategy for determining stability is simple. First, run the topic model twice, each time with different parameters. Then, see how different the topics are.

But how can you determine how different a set of topics are? Is "war" and "army" similar? What about "war" and "government"? "War" and "pottery"? Luckily, since topic models only think of topics as a bunch of numbers (instead of words, like humans do), we have a robust and unbiased way to determine how similar two topics are, called the Kullback-Leibler (KL) divergence:


[See our Science of Computer Programming article for a description of notation.]

For example, the "war" and "army" topics will have a low divergence, whereas "war" and "pottery" will have a high divergence. KL divergence is a simple and popular way to compare any two probability distributions, and works especially nicely with topics.

To determine how similar two sets of topics are, as we need to do here, we can calculate the KL divergence between all the topics in set A against all the topics in set B. If each topic in set A has a low divergence to exactly one topic in set B, and a high divergence to all other topics in set B, then we can be confident that the sets of topics are indeed similar. The farther we are from this ideal, the less similar the topics in the sets.

Topic Stability Diagrams in R

Let's look at a real-world example. JHotDraw is a simple open-source graphics framework whose source code I've studied using topic models several times in the past. I set the parameters of the topic model to K=45, a=b=0.02, and I=10,000. The topic model gave me topics like "bezier paths", "XML elements", and "input streams". Indeed, these are all important concepts in the source code. After all, JHotDraw allows Bezier paths to be drawn, it represents drawing objects internally using XML, and it can read and write various file formats. 

This is all good and well, but now I wanted to find out how stable these topics were, given the parameter choices I made. First, as a sanity check, I use the KL-divergence methodology on the original JHotDraw topics against themselves, just to see what would happen. In theory, the similarities should be very high, since we are comparing the topics against themselves. I visualize the results of my topic similarity methodology using a topic stability diagram, motivated by Steyvers and Griffiths:  


JHotDraw topics against themselves.
[See the Appendix for the R code.]

This diagram shows a heatmap of the similarity matrix of the topics on the x-axis against the topics on the y-axis. Each dot in the heatmap visually shows the KL divergence between two topics. A darker dot at x=i and y=j mean that topics i and j are more similar (i.e., low divergence), while a lighter dot mean they are less similar (high divergence). Since in this case the topics we are comparing are exactly the same, we see a nice diagonal of black dots, and lighter dots everywhere else. This tells us that each topic on the x-axis has exactly one matching partner topic on the y-axis. Excellent. Just as expected.

Let's try another experiment, where we compare the topics from JHotDraw with the topics from a completely different data source, say the source code of jEdit, a software system that has nothing to do with JHotDraw. I'll create a topic model on jEdit using the exact same parameters, but we shouldn't expect the topics to be similar to JHotDraw's because the two systems are in completely different domains:

JHotDraw topics against jEdit topics.

Indeed, this diagram shows a lack of a strong diagonal line, and is mostly a muddled mix of grey squares. This indicates that the topics in JHotDraw are not very similar to the topics in jEdit, which is exactly what we expect. Another good sign.

Now, let's get down to business: how stable are my topics in JHotDraw? Let's see what happens when I change one of the parameters, say K, to a higher and lower value:


In Subfigure (a), I reduced the number of topics from 45 to 30. The new topics (all 30 of them) are shown on the x-axis, and the old topics are shown on the y-axis. First, notice that the heatmap is no longer square, because the number of topics on the axes are different. But more importantly, notice the strong diagonal line, which indicates that the topics are still similar, despite the different value for K! The same is true in Subfigure (b), where I increased the number of topics to 60. These somewhat surprising result means that the topics in JHotDraw are stable to the exact value of K, and that I needn't worry too much. (I'll spare you the details of performing similar analysis for the other three parameters: the topics are stable there, too.)

Conclusion

Topic stability diagrams are a quick and easy way to visually determine the stability of your topics. Use topic diagrams to find out how sensitive your data is to the parameters of the topic model, or to impress your wife with interesting-looking heatmaps.

Appendix: R Code

# First, we need to compute the similarity matrix between the two topics
# j30.hall and j45.hall are topic model instances, with two key data structures:
# $topics.p: A K by |V| matrix containing the word probabilities for each
#                 topic, where |V| is the length of the vocabulary.
# $vocab: a |V| by 1 matrix of character arrays, representing the vocabulary

s.30.45   = computeTopicStabilityMatrix(j30.hall, j45.hall)
s.30.45.s = reorderMatrix(s.30.45)

# Use the image function as our main workhorse
image(1:(nrow(s.30.45.s)+1), 1:(ncol(s.30.45.s)+1), 
       s.30.45.s, col=gray.colors(100,start=0,end=1),
       xlab="Topic IDs (run 2)", ylab="Topic IDs (run 1)", 
       cex.axis=1.8, cex.lab=2.2)


#############################################################
#############################################################
# Given two topic model instances s1, s2, return the similarity matrix between all
# pairs of topics. Builds the matrix iteratively
computeTopicStabilityMatrix = function(s1, s2){
    sims = matrix(nrow=s1$K, ncol=s2$K)
    for (i in 1:s1$K){
        cat(sprintf("%d of %d\n", i, s1$K))
        for (j in 1:s2$K){
            sims[i,j] = compareTopicsFromDifferentModels(s1, s2, i, j)
        }
    }    
    sims
}   

#############################################################
#############################################################
# Given two topic model instances, a, b, (with possibly different vocabularies and 
# word orders), computes the topic similarity between topic ak in a and bk in b
# Note: deals with varying vocabs between the topic models
compareTopicsFromDifferentModels = function(a, b, ak, bk){
    a.newT = as.vector(a$topics.p[ak,])
    names(a.newT) = as.character(a$vocab[,])
    b.newT = as.vector(b$topics.p[bk,])
    names(b.newT) = as.character(b$vocab[,])

    # First, need to sort the vocabularies
    a.newT = a.newT[sort(names(a.newT), index.return=T)$ix]
    b.newT = b.newT[sort(names(b.newT), index.return=T)$ix]

    # Switch indices around, based on the new sorted-ness
    ix = which(! (names(a.newT) %in% names(b.newT)))
    if (length(ix)>0){
        a.newT = a.newT[-ix]
    }
    ix = which(! (names(b.newT) %in% names(a.newT)))
    if (length(ix)>0){
        b.newT = b.newT[-ix]
    }

    sim  = computeTopicSimilarity(as.numeric(a.newT),as.numeric(b.newT))
    sim
}

#############################################################
#############################################################
# Given two topics a and b, compute the KL divergence between them.
computeTopicSimilarity = function(a, b){
    library(flexmix)
    kl = KLdiv(cbind(a,b), overlap=F)
    kl[2,1]
}

#############################################################
#############################################################
# Reorders the similarity matrix so that strongest similarities are on the diagonal
# (only used for visual ease)
reorderMatrix = function(s2){
    s = s2 
    d = min(nrow(s),ncol(s))
    mins = apply(s, 1, min)
    ix = sort(mins, index.return=T)$ix
    s = s[ix,]

    used = vector(length=d)
    # Sort by row 
    for (i in 1:d){
        idx = sort(s[i,], index.return=T)$ix
        j = 1        
        while ((idx[j] %in% used)){
            j = j+1
        }
        used[i] = i
        s[,c(i,idx[j])] = s[,c(idx[j],i)]
    }
    s
}



Tuesday, 24 July 2012

Measuring effect size with the Vargha-Delaney A measure

Sometimes, you want to compare the performance of two techniques. That is, you want to quantify how much better one technique is than the other, in a statistically-sound fashion. In this post, I will show you how to do just that, using the Varga-Delaney A measure.The A measure is awesome because it is: agnostic to the underlying distribution of the data; easy to interpret; fast and easy to compute; and compatible with real numbers (continuous or not) and ordinal data alike.

For example, say you were trying to kill zombies that were trying to come in your house and eat your brains. You dream up two different techniques for defending your homestead:
  • Technique T1: Rocket launcher to their faces
  • Technique T2: Hurling fruit cake at their faces
Which technique can eliminate more zombies?

vs. 

The Experiment

To evaluate your two techniques, you conduct a simple experiment. First, you invite 100 sets of zombies to try and break into your house. Let's say that each set contains exactly 10 zombies. For the sake of simplicity, let's assume that each set of zombies comes to your house twice, no matter what happens to them the first time. The first time they come, you unleash the rocket launcher (technique T1); the second time, the fruit cake (T2). After each set comes and you try each technique, you tally up the number of dead zombies.

When the dust settles, you will have a simple dataset that looks something like:

1, 3, 5
2, 9, 5
3, 0, 5
[97 more rows] 

[Download the sample dataset here.]

The first column shows the zombie set ID, an integer between 1 and 100. The second column shows the number of zombies eliminated by technique T1, and the third column shows the number of zombies eliminated by technique T2.

Which Technique is Better?

One way to determine which set is better is to simply compare the means and medians of columns 2 and 3: how many zombies did each technique eliminate, on average? In R, we could do something like this:


> zombies = read.csv("zombies.dat", sep=",", header=F)
> summary(zombies[,2])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    3.00    5.00    4.79    7.00    9.00 
> summary(zombies[,3])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    5.00    6.00    6.48    8.00   10.00 

Here, we see that the rocket launchers killed about 4.8 zombies on average, whereas the fruit cake killed about 6.5. However, averages are easily skewed by extreme values, and as a result, can be misleading. In addition, they cannot handle ordinal data nor do they do well with non-continuous variables.

Instead, we could use the Vargha-Delaney A measure, which tells us how often, on average, one technique outperforms the other. When applied to two populations (like the results of our two techniques), the A measure is a value between 0 and 1: when the A measure is exactly 0.5, then the two techniques achieve equal performance; when A is less than 0.5, the first technique is worse; and when A is more than 0.5, the second technique is worse. The closer to 0.5, the smaller the difference between the techniques; the farther from 0.5, the larger the difference.

Let's return to our example:

> AMeasure(zombies[,2], zombies[,3])
[1] 0.33555

[See the Appendix for the implementation of the function AMeasure().]

In our case, the A measure is 0.34, indicating that rocket launchers (T1) performed worse than fruit cake (T2), because 0.34 is less than 0.5. We can interpret the value as follows:  34% of the time, rocket launchers will work better than fruit cake. Equivalently, 66% of the time, fruit cake will work better than rocket launchers. (Anyone who has ever tried fruit cake can understand its true destructive power.) We can deduce that you have twice the changes of surviving should you forgo the military equipment and instead put Aunt Gertrude's culinary creation to good use. 

Conclusion

There are many statistical techniques to help you choose your zombie-fighting strategies, but the Vargha-Delaney A measure is simple, robust, and easy to interpret.

Note: My coauthors and I used this measure in an upcoming Empirical Software Engineering article.


Appendix: R code for Varga-Delaney

Using Equations (13) and (14) in their paper, we get R code as follows:

##########################################################################
##########################################################################
#
# Computes the Vargha-Delaney A measure for two populations a and b.
#
# Equation numbers below refer to the paper:
# @article{vargha2000critique,
#  title={A critique and improvement of the CL common language effect size
#               statistics of McGraw and Wong},
#  author={Vargha, A. and Delaney, H.D.},
#  journal={Journal of Educational and Behavioral Statistics},
#  volume={25},
#  number={2},
#  pages={101--132},
#  year={2000},
#  publisher={Sage Publications}
# }
# 
# a: a vector of real numbers
# b: a vector of real numbers 
# Returns: A real number between 0 and 1 
AMeasure <- function(a,b){

    # Compute the rank sum (Eqn 13)
    r = rank(c(a,b))
    r1 = sum(r[seq_along(a)])

    # Compute the measure (Eqn 14) 
    m = length(a)
    n = length(b)
    A = (r1/m - (m+1)/2)/n

    A
}

Monday, 23 July 2012

Removing spaces in filenames


Ever have a bunch of files with spaces in their names, and need to get rid of the spaces?

$ ls -1
My data.csv
Surfing USA - Beach boys.mp3
Who let the dogs out.mp3
Dear American Express.doc
thesis version2-oct2008 final revisions3 steves comments.tex

Some scripts and commands don't handle spaces nicely, without painful escaping of each and every space and each and every file name. You could use mv to rename each file individually, but I'd rather shove ghost peppers in my eyeballs, especially when there's an easy one-liner to do it for me.

The Easy Way

On the eight day, God created the rename command:
$ rename 's/ //g' ./*
The rename command takes a regular expression and a list of files, and renames each file according to the regular expression. See the man page for complete details, but the regular expression syntax is similar to Perl's. In our simple case, we search for any spaces ("/ /") and replacing them with nothing ("//").

The Hard Way

Another approach is to use bash to loop through each file, use sed to find spaces and replace them with nothing, and then use the mv command to finish it all up.
$ for file in *; do mv "$file" `echo $file | sed -e 's/  */_/g' -e 's/_-_/-/g' -e 's/[()]//g'`; done
Although much more involved and a pain to read, it achieves the same effect as the rename command. If the rename command is not available to you but bash is, then here's your ticket.