Doofus Software: CountNGrams - Using Apache Hadoop to Count N-Grams

Wednesday, 24 October 2012

CountNGrams - Using Apache Hadoop to Count N-Grams

I just put another tool on GitHub.

The tool, called CountNGrams, uses Apache Hadoop to count the number of occurrences of n-grams in a given set of text files. In case you aren't familiar, an n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three"). Knowing which n-grams are in a set of files, such as all the bug reports of Eclipse, and their frequency over time, can shed light onto the major development trends in the project.

I used the Apache Hadoop framework in order to easily support the analysis of big data: millions of files and hundreds of gigabytes or more. Hadoop takes care of distributing the workload across all available machines in your cluster, making the analysis fastfastfast and the implementation easyeasyeasy. Hadoop is truly awesome; learn all about it elsewhere.

I haven't published any papers that use CountNGrams, but one is in the works.

Check out the project for more details and to make your own changes.

6 comments:

Unknown28 September 2015 at 05:46
Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
Regards,
cognos tm1 Training in Chennai|cognos Certification|cognos Training in Chennai
ReplyDelete
Replies
Unknown29 September 2015 at 05:27
A table is the basic unit of data storage in an oracle database. The table of a database hold all of the user accesible data. Table data is stored in rows and columns. But what is all about the clusters and how to handle it using oracle database system? Expecting a right answer from you. By the way you are maintaining a great blog. Thanks for sharing this in here.
Oracle Training in Chennai | Oracle Course in Chennai | Oracle Training Center in Chennai
ReplyDelete
Replies
Unknown25 December 2015 at 02:30
Maharashtra Police Patil Recruitment 2016

Prefect explanation., Very Impressive and helpful information, Thanks to author for sharing.........
ReplyDelete
Replies
Anonymous25 May 2020 at 03:18
I hope you continue to do the sharing through the post to the reader. and good luck for the visitors site.

Big Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
ReplyDelete
Replies
ram14 August 2020 at 03:05
great blog really good
oracle training in chennai
ReplyDelete
Replies
Aishwariya29 June 2021 at 05:45
I read this article. I think You have put a lot of effort to create this article. I appreciate your work.
Thank you much more for sharing with us...!
Reactjs Training in Chennai |
Best Reactjs Training Institute in Chennai |
Reactjs course in Chennai
ReplyDelete
Replies

Add comment