How to Compile HADOOP code

Before even going ahead with this article, I wish you to unlearn all the BigData and Hadoop that you have learnt in the past, and just look at hadoop as a Java Project. Java Maven project to be more specific.

So, I guess you landed up here because:

  • You wish to understand the internals of hadoop
  • You want to make changes to hadoop eco-system
  • You want to design components across hadoop.

There is so much you can do once you understand things, but lets keep the scope of this blog to building hadoop code locally. Importing the hadoop code into eclipse will be taken ahead later.

So, what is hadoop ? For me, just a simple java project. Well ! not that simple, but yeah, its just like any other java maven project. So, maven clean -> compile -> install is all you need to do.

Our steps would be :

  1. Install the required tools and dependencies
  2. Download the code from site
  3. Build it

Install the required dependencies

You will need:

  1. A linux machine, I have used Ubuntu 14.04, but you can use any other linux versions. I am limiting myself to ubuntu, but think of any linux machine. Even a Raspberry Pi 🙂
  2. Oracle Java. Its required the sun.* classes used in hadoop.
    1. Install the ppa.
      1. sudo add-apt-repository ppa:webupd8team/java
    2. Update and install
      1. sudo apt update; sudo apt install oracle-java8-installer
    3. Check you java version. It should be java8
  3. You will need to install the tools.jar separately, as some of the YARN projects need it.
    1. Fix the tools.jar. The code will expect it at this location.
      ${java.home}/../lib/tools.jar
      
      
    2. My tools.jar was installed via oracle-8. To make this running, I created a link as below.
      1. Command :
        1. $ ln -s /usr/lib/jvm/java-8-oracle/lib/tools.jar /usr/lib/tools.jar
          abhay@mean-machine:/opt$ ls -ltr /usr/lib/tools.jar 
          lrwxrwxrwx 1 root root 40 Mar 9 21:26 /usr/lib/tools.jar -> /usr/lib/jvm/java-8-oracle/lib/tools.jar
          abhay@mean-machine:/opt$
          
          
    3. Maven : Required to compile and generate the executables.
      1. $ sudo apt-get install maven
    4. protoc : Required for generating the protocol buffers
      1. $ sudo apt-get install protobuf-compiler
      2. $ protoc --version
        libprotoc 2.6.1
        $

Thats it, you are done for the dependencies

Download the code from site

There are couple of choices here. You can download it directly from

  • Hadoop git
  • Some vendor like Cloudera / Hortonworks

I prefer to download the code from Cloudera. They are supposed to be stable.

Here is the link : http://archive.cloudera.com/cdh5/cdh/5/hadoop-latest.tar.gz

It will provide you the compiled deliverables, as well as the code that was used to create those artifacts.

Some tweaks you need to make compilation go smooth !

Once you extract the tarball from above steps, you will get a “src” directory. Go to it and compile it using maven. Compiling the code directly might fail. It failed for me atleast. Here are the changes I needed to do to make it working..

  1. Java version
    1. Error was :
      1. Detected JDK Version: 1.8.0-121 is not in the allowed range [1.7.0,1.7.1000}].
    2. Resolution was :
      1. Update the java version in src/pom.xml
      2. $ git diff pom.xml
        diff --git a/pom.xml b/pom.xml
        index 22f932d..4204891 100644
        --- a/pom.xml
        +++ b/pom.xml@@ -105,8 +105,8 @@ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xs
         UTF-8
         UTF-8
        - 1.7
        - 1.7
        + 1.8
        + 1.8
        
  2. Tools.jar
    1. This jar is required for packages like import com.sun.javadoc.*; And its being used in Hadoop-annotations code. So below is the change you need to make in src/hadoop-common-project/hadoop-annotations/pom.xml.
    2. $ git diff hadoop-common-project/hadoop-annotations/pom.xml 
      diff --git a/hadoop-common-project/hadoop-annotations/pom.xml b/hadoop-common-project/hadoop-annotations/pom.xml
      index b6996b0..5d5e43d 100644
      --- a/hadoop-common-project/hadoop-annotations/pom.xml
      +++ b/hadoop-common-project/hadoop-annotations/pom.xml
      @@ -40,15 +40,15 @@
      - jdk1.7
      + jdk1.8
       
      - 1.7
      + 1.8
       jdk.tools
       jdk.tools
      - 1.7
      + 1.8
       system
       ${java.home}/../lib/tools.jar
      
    3. Also, ensure your ${java.home}/../lib/tools.jar path is accessible.
  3. protobuf version
    • My protobuf version was 2.6.1 while CDH was expecting it as 2.5.1. So, I had to update the version.
    • $ git diff hadoop-project/pom.xml
      diff --git a/hadoop-project/pom.xml b/hadoop-project/pom.xml
      index 9e00060..0e0bcb3 100644
      --- a/hadoop-project/pom.xml
      +++ b/hadoop-project/pom.xml
      @@ -71,7 +71,7 @@
       
       
       ${env.HADOOP_PROTOC_CDH5_PATH}
      - ${cdh.protobuf.version}
      + 2.6.1
       ${cdh.protobuf.path}
      3.4.6
      
  4. With code version hadoop-2.6.0-cdh5.10.0.tar.gz, I saw a compilation failure because of Java 1.8. Something like below
    • [ERROR] /home/abhay/MyHome/WorkArea/CodeHome/Hadoop/CDH/hadoop-2.6.0-cdh5.10.0/
      src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java:[101,10] 
      error: unreported exception Throwable; must be caught or declared to be thrown
    • This is because of the change in the way Throwables were handled in Java 1.7 v/s 1.8. JDK 1.8 does not like the way YARN throws Exceptions. Below is a quick patch to fix this.
    • $ git diff hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java 
      diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java
      index ada0669..1086ff0 100644
      --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java
      +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ipc/RPCUtil.java
      @@ -70,6 +70,21 @@ public static YarnException getRemoteException(String message) {
       }
      }
      + private static  T instantiateYarnException(
      + Class<? extends T> cls, RemoteException re) throws RemoteException {
      + return instantiateException(cls, re);
      + }
      +
      + private static  T instantiateIOException(
      + Class<? extends T> cls, RemoteException re) throws RemoteException {
      + return instantiateException(cls, re);
      + }
      +
      + private static  T instantiateRuntimeException(
      + Class<? extends T> cls, RemoteException re) throws RemoteException {
      + return instantiateException(cls, re);
      + }
      +
       /**
       * Utility method that unwraps and returns appropriate exceptions.
       * 
      @@ -94,17 +109,17 @@ public static Void unwrapAndThrowException(ServiceException se)
       // Assume this to be a new exception type added to YARN. This isn't
       // absolutely correct since the RPC layer could add an exception as
       // well.
      - throw instantiateException(YarnException.class, re);
      + throw instantiateYarnException(YarnException.class, re);
       }
      if (YarnException.class.isAssignableFrom(realClass)) {
      - throw instantiateException(
      + throw instantiateYarnException(
       realClass.asSubclass(YarnException.class), re);
       } else if (IOException.class.isAssignableFrom(realClass)) {
      - throw instantiateException(realClass.asSubclass(IOException.class),
      + throw instantiateIOException(realClass.asSubclass(IOException.class),
       re);
       } else if (RuntimeException.class.isAssignableFrom(realClass)) {
      - throw instantiateException(
      + throw instantiateRuntimeException(
       realClass.asSubclass(RuntimeException.class), re);
       } else {
       throw re;

Compile it !

Now that you have applied all the patches, you can go ahead and compile. I would recommend to skip tests for now, as that would increase the build time. We need to compile before importing, because there are some protobuf (*.proto) files that needs to be compiled. Here is more info on protobuf. https://en.wikipedia.org/wiki/Protocol_Buffers. The judgement time. If everything goes well, you will see a BUILD SUCCESS message.

$ cd /src
$ mvn -DskipTests clean install

Maven will download and install all the project dependencies in its local. So, be patient for the first compilation.

On my machine ( Core i5 vPro, 16GB RAM machine), it took around 4 mins to compile without tests.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:08 min
[INFO] Finished at: 2017-03-13T22:15:53+05:30
[INFO] Final Memory: 271M/1535M
[INFO] ------------------------------------------------------------------------

Voila !

Simple, ain’t it ? Any troubles while following the steps, do leave out a comment, and I will respond back with some solution.

Most of these steps would work on all debian systems. They can even work an RaspberryPi, with the debs coming in for RaspberryPi. Opportunities are endless !!!

Thanks for reading through ! In the next article we will see how to import the code into IDE.

Danke und Grüße,

Abhay Dandekar

Advertisements

One thought on “How to Compile HADOOP code

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s