Sunday, 1 April 2012

Write your first hadoop c++ program (WordCount)

Step 1:Write the following program using any editor then save it.
you can use vi editor.
type vi WordCount.cpp into your command script then press i to write then write it. then press esc then wq to save and quite.
// WordCount.cpp
#include "algorithm"
#include "limits"
#include "string"

#include  "stdint.h"  // <--- to prevent uint64_t errors!

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

using namespace std;

class WordCountMapper : public HadoopPipes::Mapper {
public:
  // constructor: does nothing
  WordCountMapper( HadoopPipes::TaskContext& context ) {
  }

  // map function: receives a line, outputs (word,"1")
  // to reducer.
  void map( HadoopPipes::MapContext& context ) {
    //--- get line of text ---
    string line = context.getInputValue();

    //--- split it into words ---
    vector< string > words =
      HadoopUtils::splitString( line, " " );

    //--- emit each word tuple (word, "1" ) ---
    for ( unsigned int i=0; i < words.size(); i++ ) {
      context.emit( words[i], HadoopUtils::toString( 1 ) );
    }
  }
};

class WordCountReducer : public HadoopPipes::Reducer {
public:
  // constructor: does nothing
  WordCountReducer(HadoopPipes::TaskContext& context) {
  }

  // reduce function
  void reduce( HadoopPipes::ReduceContext& context ) {
    int count = 0;

    //--- get all tuples with the same key, and count their numbers ---
    while ( context.nextValue() ) {
      count += HadoopUtils::toInt( context.getInputValue() );
    }

    //--- emit (word, count) ---
    context.emit(context.getInputKey(), HadoopUtils::toString( count ));
  }
};

int main(int argc, char *argv[]) {
  return HadoopPipes::runTask(HadoopPipes::TemplateFactory<
                                                      WordCountMapper,
                              WordCountReducer >() );
}
Step 2:
compile the program using g++
following is the command to compile

 g++ -I/opt/hadoop/c++/Linux-amd64-64/include -c wordcount.cpp
g++ wordcount.o -o wordcount -L/opt/hadoop/c++/Linux-amd64-64/lib -lnsl -lpthread -lhadooppipes -lhadooputils

it will create WordCount binary into your current folder.

Step 3:
Now pur this binary into hadoop hdfs by using following command:

hadoop fs -mkdir /user/test
hadoop fs -put WordCount /user/test/

Step 4:
Now make a input file for your test.
use the following command
vi input.txt
press i
write few lines
press esc then press wq.

Step 5:
keep your input inside the hadoop by using the following command:
 hadoop fs -put input.txt /user/test/

Step 6:
Now run your program using following command:

hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -program /user/test/WordCount -input /user/test/input.txt -output /user/test/output

Hopefully it will get run now you can see your output in the output directory which we have given at the time of run /user/test/output.

to see the output type following command:

hadoop fs -text /user/test/output/part-00000

any query is welcome
good luck :)

18 comments:

  1. My Hadoop is working fine with Java but when I run any c++ example including the above one, I am getting following error:
    wordcount.cpp:8:27: error: hadoop/Pipes.hh: No such file or directory
    wordcount.cpp:9:37: error: hadoop/TemplateFactory.hh: No such file or directory
    wordcount.cpp:10:33: error: hadoop/StringUtils.hh: No such file or directory
    wordcount.cpp:14: error: ‘HadoopPipes’ has not been declared
    wordcount.cpp:14: error: expected ‘{’ before ‘Mapper’
    wordcount.cpp:14: error: invalid type in declaration before ‘{’ token
    wordcount.cpp:14: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x
    wordcount.cpp:15: error: expected primary-expression before ‘public’
    wordcount.cpp:15: error: expected ‘}’ before ‘public’
    wordcount.cpp:15: error: expected ‘,’ or ‘;’ before ‘public’
    wordcount.cpp:22: error: variable or field ‘map’ declared void
    wordcount.cpp:22: error: ‘HadoopPipes’ has not been declared
    wordcount.cpp:22: error: ‘context’ was not declared in this scope



    Please help me. I have installed hadoop many times

    ReplyDelete
    Replies
    1. Hi Gul
      see the error shows that u r not including proper path.
      here's the command please try with is by changing the paths for ur files and let you tell me what command u r firing for the compilation of ur program

      rm -rf *.o *~ wordcount

      g++ wordcount.cpp -I /usr/include/libxml2 -I /usr/src/hadoop-0.20.1+133/c++/Linux-i386-32/include -I /usr/src/hadoop-0.20/src/c++/librecordio/ -L/usr/src/hadoop-0.20.1+133/c++/Linux-i386-32/lib -lhadooppipes -lhadooputils -lpthread -lxml2 -o wordcount

      Delete
  2. wordcount.cpp:8:27: error: hadoop/Pipes.hh: No such file or directory
    wordcount.cpp:9:37: error: hadoop/TemplateFactory.hh: No such file or directory
    wordcount.cpp:10:33: error: hadoop/StringUtils.hh: No such file or directory
    wordcount.cpp:14: error: ‘HadoopPipes’ has not been declared
    wordcount.cpp:14: error: expected ‘{’ before ‘Mapper’
    wordcount.cpp:14: error: invalid type in declaration before ‘{’ token
    wordcount.cpp:14: warning: extended initializer lists only available with -std=c++0x or -std=gnu++0x
    wordcount.cpp:15: error: expected primary-expression before ‘public’
    wordcount.cpp:15: error: expected ‘}’ before ‘public’
    wordcount.cpp:15: error: expected ‘,’ or ‘;’ before ‘public’
    wordcount.cpp:22: error: variable or field ‘map’ declared void
    wordcount.cpp:22: error: ‘HadoopPipes’ has not been declared
    wordcount.cpp:22: error: ‘context’ was not declared in this scope


    I am getting this error even after verifying the path

    ReplyDelete
  3. Replies
    1. hadoop/ src / c++ / pipes / api / hadoop / TemplateFactory.hh
      Hi, I think the problem is that you need to change the head files like this, because these head files are actually not in the hadoop/TemplateFactory.hh.

      Delete
  4. Hi Rakhi,

    I use the following command in your step2 to generate the binary code and get an error:
    hduser@localhost:/usr/local/hadoop$ g++ wordcount.o -o wordcount -L/opt/usr/local/hadoop/c++/Linux-i386-32/lib -lnsl -lpthread -lhadooppipes -lhadooputils
    /usr/bin/ld: cannot find -lhadooppipes
    /usr/bin/ld: cannot find -lhadooputils
    collect2: ld returned 1 exit status

    ReplyDelete
    Replies
    1. Sorry for late reply, try using below command once and let me know:
      g++ -I/opt/hadoop/c++/Linux-i386-32/include -c wordcount.cpp
      g++ wordcount.o -o wordcount -L/opt/hadoop/c++/Linux-i386-32/lib -lnsl -lpthread -lhadooppipes -lhadooputils

      Delete
    2. also please check once libhadooppipes.a and libhadooputils.a should be present in the path -L/opt/hadoop/c++/Linux-i386-32/lib

      Delete
    3. I'm getting that same error. I'm using a 64-bit linux and the .a files are in the correct directory

      Delete
    4. me too getting the same issue in 64bit machine

      Delete
  5. Hi, thanks for the nice tutorial. I am on Mac OSX. I am trying to compile the code using the following command:
    g++ -I/usr/lib/c++/v1 -c wordcount.cpp

    As I have found my c++ header files inside /usr/lib/c++/v1 directory.

    But I am getting the error:

    fatal error: 'hadoop/Pipes.hh' file not found

    so what should I do to make it work? Please let me know. Thanks in advance.

    ReplyDelete
    Replies
    1. Hi,
      1. Check whether you have directory hadoop inside path" /usr/lib/c++/v1".
      If not then create and then try it and please let me know.
      2. Please check the command once.
      g++ -I/opt/hadoop/c++/Linux-amd64-64/include -c wordcount.cpp
      In this -I is Capital i.

      Delete
  6. Hi All,

    Hope you have typed the command correct.
    g++ -I/opt/hadoop/c++/Linux-amd64-64/include -c wordcount.cpp
    In this -I is Capital i.

    ReplyDelete
  7. Hi
    I am getting an error on running the wordcount in hadoop. I tried to run the java word count also which worked.

    Please help

    13/11/28 10:58:52 INFO mapred.LocalJobRunner: Map task executor complete.
    13/11/28 10:58:52 WARN output.FileOutputCommitter: Output Path is null in cleanupJob()
    13/11/28 10:58:52 WARN mapred.LocalJobRunner: job_local1620942588_0001
    java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
    Caused by: java.lang.NullPointerException
    at org.apache.hadoop.mapred.pipes.Application.(Application.java:104)
    at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:69)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:679)
    13/11/28 10:58:52 INFO mapreduce.Job: Job job_local1620942588_0001 running in uber mode : false
    13/11/28 10:58:52 INFO mapreduce.Job: map 0% reduce 0%
    13/11/28 10:58:52 INFO mapreduce.Job: Job job_local1620942588_0001 failed with state FAILED due to: NA
    13/11/28 10:58:52 INFO mapreduce.Job: Counters: 0
    Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:264)
    at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:503)
    at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:518)

    ReplyDelete
    Replies
    1. Please let me know whcih version of hadoop to use. I am using 2.2.0

      Delete
    2. Hi, I have the same proble with hadoop 2.2.0. It is not a problem of your code, it occurs before running the executable. Can you post your configuration files?

      Delete
    3. I have the same problem on running the wordcount. But I still don't know how to solve it... Do you have any ideas?
      Hadoop 2.2.0

      Delete
    4. Hi everybody ! did someone solve the problem ???

      Delete