Here is another Word Count count Hadoop tutorial. Why? You ask. It is a learning exercise for me. I am writing it out so that I can refer to it in future. Also, rather than just copying the example already available with Hadoop installation, I will try to fix some shortcomings of the word count program. Before I do that, let’s just write a stock-standard one.
For this walkthrough if you want to call it that, I have Hadoop running on a single node setup on Ubuntu 11.10. My preferred IDE is Netbeans.
Here it goes.
Create a project
First of all create a Java project in Netbeans. Call it HadoopWordCountTutorial. I also like to use proper package names so my class HadoopWordCountTutorial is in package com.thereforesystems.hadoop.

Add Libraries
Next thing we need to do is add some libraries. Here is a list of libraries required to compile our Hadoop project.

These jars can be found in your Hadoop folders. An easy way to find where things are is by using locate command. For example to locate hadoop-core-0.20.2-cdh3u4.jar execute the command in terminal.
locate hadoop-core-0.20.2-cdh3u4.jar
On my machine the file is located in
/usr/lib/hadoop-0.20/
Once we have added required libraries, we are all set to write some code.
Writing code
Hadoop is a framework which provides us plumbing to write MapReduce operations (This is such an understatement). Here is a good tutorial on MapReduce. If you are not familiar with MapReduce then I suggest that you read it before continuing with this tutorial.
There are two operations we will write. One is the mapper and the other is reducer. Our objective is to count words in a file or many files and write the results to an output location. We will start with our mapper.
Mapper
Mapper in Hadoop is implemented by extending Mapper class found in org.apache.hadoop.mapreduce. This class implements a map method in which we will write our logic. Here is the code for our class.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
public static class WordCountMapper extends Mapper<Object /* KEYIN */, Text /* VALUEIN */, Text /* KEYOUT */, IntWritable /* VALUEOUT */> { private Text word = new Text(); private final static IntWritable numberOne = new IntWritable(1); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, numberOne); } } } |
Let’s look at the map method. The map method tokenizes the text passed in. What gets passed in is handled by Hadoop. Keep in mind that text for the entire file may not be passed in to the mapper. And this is a good thing. Imagine if the file was many gigabytes in size, Hadoop will take care of splitting the file into blocks and will spin off n number of mappers to handle the chunked file.
Reducer
The reducer is implemented in a class which extends Reducer. Here is the code for Reducer.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
public static class WordCountReducer extends Reducer<Text /* KEYIN */, IntWritable /* VALUEIN */, Text /* KEYOUT */, IntWritable /* VALUEOUT */> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values){ sum += val.get(); } context.write(key, new IntWritable(sum)); } } |
The method of interest here is reduce() which receives a list of IntWritable objects for a key. In our example a key will be a word. For example the word could be “Imagine” which occurs many times in our file. After Mapper is done, Reducer will be called for key “Imagine” and values [1, 2, 1, 1]. Within our reduce method we sum the values up for each key and write it out. Writing out part is handled by the Context for us.
Main method
Main method is where it all get’s tied up. Let’s look at the main method.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration config = new Configuration(); String[] otherArgs = new GenericOptionsParser(config, args).getRemainingArgs(); Job job = new Job(config, "Word Count Tutorial"); job.setJarByClass(HadoopWordCountTutorial.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } |
The first thing we do is create an instance of Configuration object. This returns us the default configuration for our installation. Next we parse arguments passed in. These arguments for the purpose of this example are input-directory and output-directory. Note that Hadoop will create output directory for us and it should not already exist.
We then create an instance of Job object by passing in the configuration instance and a name for our job. Next three lines tell Hadoop about our Jar file, the mapper it should use and the reducer it should use for the job.
After this we call setOutputKeyClass and setOutputValueClass on the job instance. This tells Hadoop about data types we expect it to deal with.
Finally we set the locations for input directory and output directory.
Running the job
We are all set to run this job. I executed this job by pointing it to a directory which contains only one file. This file is lyrics for Imagine by John Lennon.
On my machine I executed the job with this command.
java -jar /home/deepak/NetBeansProjects/HadoopWordCountTutorial/dist/HadoopWordCountTutorial.jar /home/deepak/temp/HadoopWordCountTutorial/input /home/deepak/temp/HadoopWordCountTutorial/output
After the job is run, the output shows me how many times a particular word occured in the file. Here is partial output.

What is wrong with this output? Take a look at the partial output above, you will notice that “A” has been counted as 1 and “a” is counted as 2. To resolve this issue we can tell our StringTokenizer to ignore certain characters.
|
1 2 3 4 |
StringTokenizer tokenizer = new StringTokenizer(value.toString(), " tnrf,.:;?[]'(),~!@#%^&*()_"); |
Also when we all word.set we can call toLowerCase method. This will make all our keys lowercase and provide expected ouput.
|
1 2 3 |
word.set(tokenizer.nextToken().toLowerCase()); |
Here is the output after making two minor changes. We now have the count for “a” as 3. This is what we expected.

Conclusion
This concludes the post. I hope you learned a thing or two here. These days I am spending more and more time with Hadoop and most importantly I am enjoying my time with it. Stay tuned for more ramblings as I make my way through this massive framework.





