Wednesday 25 July 2012

Topic stability diagrams in R


In this post I will show you how to create topic stability diagrams in R. If you already know what a topic model is and want to get right to the good stuff, click here.

Introduction to Topic Models

Topic models, such as latent Dirichlet allocation (LDA), are a fun and powerful way to analyze huge messes of unstructured data. I've used topic models throughout my Ph.D. career and have followed the related research closely for many moons.

Unstructured data? Topic models to the rescue!

Briefly, topic models automatically infer the underlying topics within an unlabeled, unstructured collection of text documents. For example, topics models could automatically tell me that Wikipedia has topics about "films", "war", "computer systems", "mathematics", and "diseases", amongst many others. More importantly, topic models will also tell me which articles contain which topics. For example, say I wanted to find all articles on "mathematics". Topic models will tell me just that! All I have to do is select all documents that contain that particular topic, according to the topic model. Topic models do this without any help from humans, no labeling of the original Wikipedia articles, do not require predefined topic lists or taxonomies of any kind. Instead, they operate directly on the statistical word counts of the raw text, and use machine learning techniques (read: complicated mathematics) to just figure it out on their own. Topic models can be applied to any collection of documents in the world, and they are fast, fast, fast.

Before using topics models on a collection of documents, you need to specify a few parameters:
  • K: the number of topics you want
  • a, b: some smoothing parameters for the underlying mathematics
  • I: number of sampling iterations (again, for the underlying mathematics)
There are an infinite number of values that each parameter can take, so how do you know if the ones you've chosen are any good? This has been a subject of much debate and research. In some cases, topic models are used as part of a larger process, so you can tell which parameters are better because they make the whole process better. For example, if you used topic models as a way to predict bugs in software (as we did in a recent research article), then you can try a bunch of different parameter values and see which ones result in the most bugs predicted. But sometimes, you want to find topics just for topic's sake (i.e., just to explore and organize the documents), so you must find another way to determine if your chosen values are good or not.

Topic Stability

One way to determine how good your parameter choices are is to determine how stable the resulting topics are, compared to other parameter choices. (What if I had chosen 44 topics, instead of 45? Would everything be completely different, or mostly the same?) The strategy for determining stability is simple. First, run the topic model twice, each time with different parameters. Then, see how different the topics are.

But how can you determine how different a set of topics are? Is "war" and "army" similar? What about "war" and "government"? "War" and "pottery"? Luckily, since topic models only think of topics as a bunch of numbers (instead of words, like humans do), we have a robust and unbiased way to determine how similar two topics are, called the Kullback-Leibler (KL) divergence:


[See our Science of Computer Programming article for a description of notation.]

For example, the "war" and "army" topics will have a low divergence, whereas "war" and "pottery" will have a high divergence. KL divergence is a simple and popular way to compare any two probability distributions, and works especially nicely with topics.

To determine how similar two sets of topics are, as we need to do here, we can calculate the KL divergence between all the topics in set A against all the topics in set B. If each topic in set A has a low divergence to exactly one topic in set B, and a high divergence to all other topics in set B, then we can be confident that the sets of topics are indeed similar. The farther we are from this ideal, the less similar the topics in the sets.

Topic Stability Diagrams in R

Let's look at a real-world example. JHotDraw is a simple open-source graphics framework whose source code I've studied using topic models several times in the past. I set the parameters of the topic model to K=45, a=b=0.02, and I=10,000. The topic model gave me topics like "bezier paths", "XML elements", and "input streams". Indeed, these are all important concepts in the source code. After all, JHotDraw allows Bezier paths to be drawn, it represents drawing objects internally using XML, and it can read and write various file formats. 

This is all good and well, but now I wanted to find out how stable these topics were, given the parameter choices I made. First, as a sanity check, I use the KL-divergence methodology on the original JHotDraw topics against themselves, just to see what would happen. In theory, the similarities should be very high, since we are comparing the topics against themselves. I visualize the results of my topic similarity methodology using a topic stability diagram, motivated by Steyvers and Griffiths:  


JHotDraw topics against themselves.
[See the Appendix for the R code.]

This diagram shows a heatmap of the similarity matrix of the topics on the x-axis against the topics on the y-axis. Each dot in the heatmap visually shows the KL divergence between two topics. A darker dot at x=i and y=j mean that topics i and j are more similar (i.e., low divergence), while a lighter dot mean they are less similar (high divergence). Since in this case the topics we are comparing are exactly the same, we see a nice diagonal of black dots, and lighter dots everywhere else. This tells us that each topic on the x-axis has exactly one matching partner topic on the y-axis. Excellent. Just as expected.

Let's try another experiment, where we compare the topics from JHotDraw with the topics from a completely different data source, say the source code of jEdit, a software system that has nothing to do with JHotDraw. I'll create a topic model on jEdit using the exact same parameters, but we shouldn't expect the topics to be similar to JHotDraw's because the two systems are in completely different domains:

JHotDraw topics against jEdit topics.

Indeed, this diagram shows a lack of a strong diagonal line, and is mostly a muddled mix of grey squares. This indicates that the topics in JHotDraw are not very similar to the topics in jEdit, which is exactly what we expect. Another good sign.

Now, let's get down to business: how stable are my topics in JHotDraw? Let's see what happens when I change one of the parameters, say K, to a higher and lower value:


In Subfigure (a), I reduced the number of topics from 45 to 30. The new topics (all 30 of them) are shown on the x-axis, and the old topics are shown on the y-axis. First, notice that the heatmap is no longer square, because the number of topics on the axes are different. But more importantly, notice the strong diagonal line, which indicates that the topics are still similar, despite the different value for K! The same is true in Subfigure (b), where I increased the number of topics to 60. These somewhat surprising result means that the topics in JHotDraw are stable to the exact value of K, and that I needn't worry too much. (I'll spare you the details of performing similar analysis for the other three parameters: the topics are stable there, too.)

Conclusion

Topic stability diagrams are a quick and easy way to visually determine the stability of your topics. Use topic diagrams to find out how sensitive your data is to the parameters of the topic model, or to impress your wife with interesting-looking heatmaps.

Appendix: R Code

# First, we need to compute the similarity matrix between the two topics
# j30.hall and j45.hall are topic model instances, with two key data structures:
# $topics.p: A K by |V| matrix containing the word probabilities for each
#                 topic, where |V| is the length of the vocabulary.
# $vocab: a |V| by 1 matrix of character arrays, representing the vocabulary

s.30.45   = computeTopicStabilityMatrix(j30.hall, j45.hall)
s.30.45.s = reorderMatrix(s.30.45)

# Use the image function as our main workhorse
image(1:(nrow(s.30.45.s)+1), 1:(ncol(s.30.45.s)+1), 
       s.30.45.s, col=gray.colors(100,start=0,end=1),
       xlab="Topic IDs (run 2)", ylab="Topic IDs (run 1)", 
       cex.axis=1.8, cex.lab=2.2)


#############################################################
#############################################################
# Given two topic model instances s1, s2, return the similarity matrix between all
# pairs of topics. Builds the matrix iteratively
computeTopicStabilityMatrix = function(s1, s2){
    sims = matrix(nrow=s1$K, ncol=s2$K)
    for (i in 1:s1$K){
        cat(sprintf("%d of %d\n", i, s1$K))
        for (j in 1:s2$K){
            sims[i,j] = compareTopicsFromDifferentModels(s1, s2, i, j)
        }
    }    
    sims
}   

#############################################################
#############################################################
# Given two topic model instances, a, b, (with possibly different vocabularies and 
# word orders), computes the topic similarity between topic ak in a and bk in b
# Note: deals with varying vocabs between the topic models
compareTopicsFromDifferentModels = function(a, b, ak, bk){
    a.newT = as.vector(a$topics.p[ak,])
    names(a.newT) = as.character(a$vocab[,])
    b.newT = as.vector(b$topics.p[bk,])
    names(b.newT) = as.character(b$vocab[,])

    # First, need to sort the vocabularies
    a.newT = a.newT[sort(names(a.newT), index.return=T)$ix]
    b.newT = b.newT[sort(names(b.newT), index.return=T)$ix]

    # Switch indices around, based on the new sorted-ness
    ix = which(! (names(a.newT) %in% names(b.newT)))
    if (length(ix)>0){
        a.newT = a.newT[-ix]
    }
    ix = which(! (names(b.newT) %in% names(a.newT)))
    if (length(ix)>0){
        b.newT = b.newT[-ix]
    }

    sim  = computeTopicSimilarity(as.numeric(a.newT),as.numeric(b.newT))
    sim
}

#############################################################
#############################################################
# Given two topics a and b, compute the KL divergence between them.
computeTopicSimilarity = function(a, b){
    library(flexmix)
    kl = KLdiv(cbind(a,b), overlap=F)
    kl[2,1]
}

#############################################################
#############################################################
# Reorders the similarity matrix so that strongest similarities are on the diagonal
# (only used for visual ease)
reorderMatrix = function(s2){
    s = s2 
    d = min(nrow(s),ncol(s))
    mins = apply(s, 1, min)
    ix = sort(mins, index.return=T)$ix
    s = s[ix,]

    used = vector(length=d)
    # Sort by row 
    for (i in 1:d){
        idx = sort(s[i,], index.return=T)$ix
        j = 1        
        while ((idx[j] %in% used)){
            j = j+1
        }
        used[i] = i
        s[,c(i,idx[j])] = s[,c(idx[j],i)]
    }
    s
}



1 comment:

  1. Hi Sir,

    Very Good Explanation... If possible can you please provide us the data source So that i can try it on my machine. Thanks so much...

    ReplyDelete