Thursday, 26 July 2012

A Linux one-liner to count the unique words in a set of files

Say you have a bunch of text files, and you want to find out all the unique words in them, and how often each words appears. Using the brilliance of Linux commands, this can be achieved in a few strokes of the finger:

$ cat *.txt | tr " " "\n" | sort | uniq

Use tr to put all the words (separated by spaces, but you can make this more complex if you need) into their own line. sort will then sort all these words. Finally, uniq will get rid of duplicate lines, and the -c flag will add counts. I'll run this on a small set of bug reports from the Eclipse Bugzilla repository:

$ cat * | tr " " "\n" | sort | uniq -c
      1 ^^^^
      6 able
     12 about
      1 aboutdialog
      1 abovebackground
      1 absolute
     18 abstract
      2 abstractannotationprocessormanager
      4 abstractcompletiontest
      1 abstractdebugeventhandler
    [...]

The words will be displayed in alphabetical order. (Yes, the word "^^^^" actually appeared somewhere in the bug reports.) To instead display them by usage in decreasing order, run it all through sort one more time:

$ cat * | tr " " "\n" | sort | uniq -c | sort -gr
   1118 java
   1006 eclipse
   1005 at
    956 org
    825 the
    546 internal
    361 to
    356 jdt
    320 ui
    318 in
   [...]

The -g option sorts by numeric values, instead of alphanumeric, and the -r option reverses the sort. Voilà!

18 comments:

  1. Thanks for this one-liner ! Using the tr command is key here.

    ReplyDelete
  2. And tr is SO much faster than using a "while read line" loop

    ReplyDelete
  3. Thank you! I was a bit boggled to realize there wasn't a utility for this.

    ReplyDelete
  4. Thanks for taking the time to discuss that, I feel strongly about this and so really like getting to know more on this kind of field. Do you mind updating your blog post with additional insight? It should be really useful for all of us. this

    ReplyDelete