grep to the rescue, again
Like most other areas of my life, this problem can be easily solved with a Linux one-liner centered around grep:
cat *.tex | grep -wo "[A-Z]\+\{2,10\}" | sort | uniq -c | sort -gr
Let's take a look at the pipeline:
- The cat *.tex outputs all my Latex to standard output.
- The grep -wo "[A-Z]\+\{2,10\}" matches whole words (the -w flag) that contain between 2 and 10 upper case letters. The -o flag returns only the match, not the entire line.
- The first sort sorts the acronyms, which is useful for the next step.
- The uniq gets rid of duplicates, but retains a counter because of the -c flag.
- Finally, the second sort sorts the entries numerically (-g) and reverses the results (-r).
Here's the output on my thesis:
292 IR
241 LDA
166 LSI
125 VSM
87 TCP
80 EM
35 APFD
34 SUT
29 HSD
22 TOPIC
22 II
18 MALLET
16 LOC
14 PS
14 OR
14 IDE
12 CALLG
11 ICA
10 RNDM
10 MAP
10 KL
10 CS
...
Note that this command works with any text file; it is not unique to Latex. Just change the cat command.
Hi Stephen,
ReplyDeleteTwo questions.
1- There are some acronyms like "QoS" (quality of Service) having a combination of uppercase and lowercase together. How I can change your command to detect such acronyms as well.
2- what are the numbers coming in left hand side of the acronyms.
PS: it seems that some words written in uppercase are included in the list.
Cheers,
Homayoon
I changed the command to
Deletecat *.tex | grep -wo "[A-Zo-]\+\{2,5\}" | sort | uniq -c | sort -gr
Cheers
Homayoun,
DeleteThanks for your reply.
Your command doesn't quite work. It only allows the lowercase letter to be 'o', and does not require any following characters to be uppercase, and in fact does require the first character to be uppercase either. Hence, 'Qoo' would match, so would 'oN'. To allow for any lowercase characters in the middle of the word (but still requiring uppercase characters at least as the first and last character of the word), you can try:
cat *.tex | grep -wo "[A-Z]\+[a-z]*[A-Z]\+" | sort | uniq -c | sort -gr
The numbers to the left hand side indicate the number of times the acronym appears in the text. To remove the numbers, remove the -c flag of the uniq command (and hence you will not longer need the second sort command).