Pere Urbon-Bayes

Software craftman and Internet inhabitant

Read this first

How to submit a job to a remote JobTracker

One of the first things you start thinking after your very first hadoop jobs is what should I do to submit a job to a remote JobTracker? Cause otherwise you’re restricted to run this jobs using the same user and the same machine where your JobTracker is.

So here there are a few hints on how to do it.

The jobtracker should not only be listening to localhost connections, to do what we’ll update the mapred-site.xml file, like this:

<property >
<name >mapred.job.tracker </name >
<value > </value >
</property >

after that the JobTracker in this location start accepting connection to this port.

you’ve to obviously update all references to localhost, like for example the ones located under master or slaves files.

Next thing to update is the core-site.xml file with


Continue reading →

Hadoop Benchmarking - TeraSort and others

Another kind of important benchmark to be done it’s about the behaviour during a mapreduce wave execution in our cluster. To do that people use to deal with internal use cases or the traditional TeraSort benchmark, highly inspired by one of the performance test proposed by Google at his original paper.

Lets talk a bit about the TeraSort benchmark

Sort benchmarking

As one of the “hard” points of map reduce, sorting is being use a model for benchmarking operations. Although most people use enormous amount of data for this benchmark, it’s important to know you can adapt it to your needs.

Some things this benchmark could helps to set up are:

  • The computer resources for your TaskTrackers, like RAM and cores.
  • Sorting (io.sort.mb) and JVM ( parameters, etc.
  • The scheduler configuration.

The faces to run this are:

  • Generate the data using the teragen utility (only

Continue reading →

Benchmarking a Hadoop Cluster

HDFS benchmarking

In order to benchmark the HDFS we’re mainly interested on the performance of reading and writing operations, and to archive that Apache Hadoop give us the TestDFSIO tool. Later on we’re going to talk also about doing something more than just an I/O Benchmark running things like TeraSort/PI/etc.


This is tool provided by apache to test the HDFS performance, you will find the desired source under $HADOOP_HOME/src/test/org/apache/hadoop/fs/ if interested to check what he is exactly doing.

Basically this test provides you with IO performance information using creating files and reading from them using a 1:1 mapping, so 1 map per file. In this case we’re able to see the performance of different files sizes, number of files, etc.

After each execution, either write or read, a report is created under /benchmarks/TestDFSIO/, keep in mind after each

Continue reading →