How to submit a job to a remote JobTracker

One of the first things you start thinking after your very first hadoop jobs is what should I do to submit a job to a remote JobTracker? Cause otherwise you’re restricted to run this jobs using the same user and the same machine where your JobTracker is.

So here there are a few hints on how to do it.

The jobtracker should not only be listening to localhost connections, to do what we’ll update the mapred-site.xml file, like this:


<property >
<name >mapred.job.tracker </name >
<value >192.168.1.1:54311 </value >
</property >

after that the JobTracker in this location start accepting connection to this port.

you’ve to obviously update all references to localhost, like for example the ones located under master or slaves files.

Next thing to update is the core-site.xml file with


<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.1:54310</value>
</property>

The hadoop distributed file system should be not just listening to localhost.

After this slightly little changes we must tell our client’s to submit the job to the desired JobTracker, so to do that we’ve to:


Configuration config = new Configuration();
config.set("mapred.job.tracker", "192.168.1.1:54311");

then our job is ready to be submitted to the desired JobTracker, but there is still two other little things we’ve to take care of.

First of all we’ve to be aware that all classes related to the job should be properly submitted to tracker, to do that we can for example compile the job as a jar and run it from the console or using the hadoop command.

And finally we’ve only to take care that we’ve access to the right hdfs location, in order to get that up and working we’ve to:
Tell our job to connect to the right hdfs location:


config.set("fs.default.name", "hdfs://192.168.1.5:54310");

Be sure that the user we’re using to run the job exist and have the right permissions to access the desired location. In order to archive this we should


<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/tmp</value>
</property>


<property>
<name>dfs.permissions.supergroup</name>
<value>hadoop</value>
</property>

And run jobs from “anywhere”!!. ;-)

 
59
Kudos
 
59
Kudos

Now read this

Hadoop Benchmarking - TeraSort and others

Another kind of important benchmark to be done it’s about the behaviour during a mapreduce wave execution in our cluster. To do that people use to deal with internal use cases or the traditional TeraSort benchmark, highly inspired by one... Continue →