Building Hadoop Job JAR with Maven

Recently we started using Maven more and more here at work. So it wasn’t long before I needed to create a Hadoop job jar to run on our cluster. I find Maven’s documentation confusing, mainly due to the lack of good examples of what fields can actually contain. I found a ton of examples via Google of creating executable jars, all of which like to unpack third party dependencies, which to say the least makes me feel really dirty. After enough scavenging I managed to put together what I feel is a reasonably simple and workable solution.

If you want to build a custom jar in maven stop looking at the jar plugin and move straight on to the assembly plugin. Let’s start by defining the assembly XML file. I put this in src/main/assembly. I’m not sure if that is standard or not, but I saw that someone else had used this directory before and that was good enough for me. The file I created looks like so:

<assembly
	xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
	<id>job</id>
	<formats>
		<format>jar</format>
	</formats>
	<includeBaseDirectory>false</includeBaseDirectory>
	<dependencySets>
		<dependencySet>
			<unpack>false</unpack>
			<scope>runtime</scope>
			<outputDirectory>lib</outputDirectory>
			<excludes>
				<exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
			</excludes>
		</dependencySet>
		<dependencySet>
			<unpack>false</unpack>
			<scope>system</scope>
			<outputDirectory>lib</outputDirectory>
			<excludes>
				<exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
			</excludes>
		</dependencySet>
	</dependencySets>
	<fileSets>
		<fileSet>
			<directory>${basedir}/target/classes</directory>
			<outputDirectory>/</outputDirectory>
			<excludes>
				<exclude>*.jar</exclude>
			</excludes>
		</fileSet>
	</fileSets>
</assembly>

The first child XML node we see is the “id” node. This will get appended onto the end of the filename for the output jar file created by Maven. The “format” we want here is obviously a jar file. We don’t want to include the base directory so that’s fairly straightforward.

The heart of this is dependency and file sets. Straightaway you’ll see that I set “unpack” to false. This is so we don’t start unpacking all of our third party jars. Instead we put them in a lib directory using the “outputDirectory” node. The first dependencySet is used to define dependencies with a scope of “runtime”. The curious part for me was that I have some dependencies with “scope” system because they are internal and I haven’t uploaded them to our in-house maven repository. These were not included by the first dependency set which is why I created the second one. The other important thing here is that Maven will include the default package jar it creates for your project in the dependency set. To me this seems really counter-intuitive, a “chicken before the egg” problem. So I added the project “groupId:artifactId” to the exclusion list of both dependency sets. Lastly I added a fileSet to pack in all off my classes. I think this could be done using a filtered unpack on only our project jar, but I felt this was easier to understand what is going on.

After you’ve created the assembly XML file we need to add a section to our pom.xml.

<plugin>
	<artifactId>maven-assembly-plugin</artifactId>
	<configuration>
		<finalName>${project.name}-${project.version}</finalName>
		<appendAssemblyId>true</appendAssemblyId>
		<descriptors>
			<descriptor>src/assembly/job.xml</descriptor>
		</descriptors>
	</configuration>
</plugin>

This is pretty easy to figure out. We’re using the Maven assembly plugin and pointing it to the descriptor file we just created. Now let’s create our job jar:

mvn assembly:assembly

Assuming all goes well you should now have two jar files in your target directory. One will end with “-job” and that’s the one you want to use for running Hadoop jobs.

This should be generic enough to work for most people. It will include all of your third party dependencies excluding only Hadoop and your project jar and put them in the lib subdirectory.

About these ads

5 thoughts on “Building Hadoop Job JAR with Maven

  1. Pingback: Inductive Bias » Building a Hadoop Job Jar with Maven

  2. Hi, good stuff!

    I just tried this myself and it worked like a charme. One thing is that I set the Hadoop dependency in the project pom.xml with “scope=provided” and it takes care of excluding it in the assembly. This means you can (and should or you get a warning!) safely drop the “excludes” section in the “runtime” dependency set.

    Cheers,
    Lars

    • If you’re working on a scheduler you probably just want to build a regular jar (not the job jar) package and then copy it to every node under $HADOOP_HOME/lib/. What I’m calling a “job jar” (for lack of a better term) is meant for MapReduce jobs only, not extensions to Hadoop itself.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s