Wednesday 24 October 2012

CountNGrams - Using Apache Hadoop to Count N-Grams

I just put another tool on GitHub.

The tool, called CountNGramsuses Apache Hadoop to count the number of occurrences of n-grams in a given set of text files. In case you aren't familiar, an n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three"). Knowing which n-grams are in a set of files, such as all the bug reports of Eclipse, and their frequency over time, can shed light onto the major development trends in the project. 
I used the Apache Hadoop framework in order to easily support the analysis of big data: millions of files and hundreds of gigabytes or more. Hadoop takes care of distributing the workload across all available machines in your cluster, making the analysis fastfastfast and the implementation easyeasyeasy. Hadoop is truly awesome; learn all about it elsewhere.

I haven't published any papers that use CountNGrams, but one is in the works.
Check out the project for more details and to make your own changes. 

6 comments:

  1. Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
    Regards,
    cognos tm1 Training in Chennai|cognos Certification|cognos Training in Chennai

    ReplyDelete
  2. A table is the basic unit of data storage in an oracle database. The table of a database hold all of the user accesible data. Table data is stored in rows and columns. But what is all about the clusters and how to handle it using oracle database system? Expecting a right answer from you. By the way you are maintaining a great blog. Thanks for sharing this in here.
    Oracle Training in Chennai | Oracle Course in Chennai | Oracle Training Center in Chennai

    ReplyDelete
  3. Maharashtra Police Patil Recruitment 2016

    Prefect explanation., Very Impressive and helpful information, Thanks to author for sharing.........

    ReplyDelete
  4. I read this article. I think You have put a lot of effort to create this article. I appreciate your work.
    Thank you much more for sharing with us...!
    Reactjs Training in Chennai |
    Best Reactjs Training Institute in Chennai |
    Reactjs course in Chennai

    ReplyDelete