Hadoop Benchmarking - TeraSort and others

Another kind of important benchmark to be done it’s about the behaviour during a mapreduce wave execution in our cluster. To do that people use to deal with internal use cases or the traditional TeraSort benchmark, highly inspired by one of the performance test proposed by Google at his original paper.

Lets talk a bit about the TeraSort benchmark

Sort benchmarking

As one of the “hard” points of map reduce, sorting is being use a model for benchmarking operations. Although most people use enormous amount of data for this benchmark, it’s important to know you can adapt it to your needs.

Some things this benchmark could helps to set up are:

The faces to run this are:

so lets see how to run it:

Data generation, in this case 1000000 to be sorted.

$ hadoop jar hadoop-examples.jar teragen 1000000 /data/hadoop/terasort-input

How to actually run the sort jobs.

$ hadoop jar hadoop-examples.jar terasort /data/hadoop/terasort-input /data/hadoop/terasort-output

And the last how to validate the last operation.

$ hadoop jar hadoop-examples.jar teravalidate /data/hadoop/terasort-output /data/hadoop/terasort-validate

but in this case the jobs are not providing us with any real benchmarking friendly output, so we must get ours for example using the historical API provided by Hadoop.

So to see how the sorting method was performing.

$ hadoop job -history all /data/hadoop/terasort-output

in this case the output is kind of very large, but interesting variables are all FileSystemCounters and some of the Job Counters. You can also grab some memory usage and time being spent per different parts of the system.


The HDFS master piece component is the NameNode, basically taking care of maintaining and providing with a uniform view of the filesystem, dealing with tasks as keeping the file organization, locating files, etc. With this benchmark we can load test the NameNode configuration.

An example on how to run this benchmark is:

$ hadoop jar hadoop-test.jar nnbench -operation create_write \
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 \
-replicationFactorPerFile 3 -readFileAfterOpen true \
-baseDir /benchmarks/NNBench-hostname -s

creating a set of 1000 files using 12 maps and 6 reducers. There are other option variables that one can check using the –help option.

The output for that test it’s like:

hdfs.NNBench: ————– NNBench ————– :
hdfs.NNBench: Version: NameNode Benchmark 0.4
hdfs.NNBench: Date & time: 2013-11-29 16:41:52,417
hdfs.NNBench: Test Operation: create_write
hdfs.NNBench: Start time: 2013-11-29 16:40:22,49
hdfs.NNBench: Maps to run: 12
hdfs.NNBench: Reduces to run: 6
hdfs.NNBench: Block Size (bytes): 1
hdfs.NNBench: Bytes to write: 0
hdfs.NNBench: Bytes per checksum: 1
hdfs.NNBench: Number of files: 1000
hdfs.NNBench: Replication factor: 3
hdfs.NNBench: Successful file operations: 0
hdfs.NNBench: # maps that missed the barrier: 0
hdfs.NNBench: # exceptions: 0
hdfs.NNBench: TPS: Create/Write/Close: 0
hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 0.0
hdfs.NNBench: Avg Lat (ms): Create/Write: Infinity
hdfs.NNBench: Avg Lat (ms): Close: NaN
hdfs.NNBench: RAW DATA: AL Total #1: 66511
hdfs.NNBench: RAW DATA: AL Total #2: 0
hdfs.NNBench: RAW DATA: TPS Total (ms): 0
hdfs.NNBench: RAW DATA: Longest Map Time (ms): 0.0
hdfs.NNBench: RAW DATA: Late maps: 0
hdfs.NNBench: RAW DATA: # of exceptions: 0

where we can see lots of interesting data that will help to understand and iterate to improve the NameNode performance if possible.


With the TeraSort benchmark we used a very large job in order to see how they were performing, but sometimes is also interesting to see how small jobs are doing. In this case the MRBench is our tool for that.

An example on how to run that is:

$ hadoop jar hadoop-test.jar mrbench -numRuns 50

Looping the same test over 50 rounds. There are other option variables that one can check using the –help option.

Proving us with an expected output like this:

DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 19815

where you see the averaged time per job and the number of mappers and reducers used, keep in mind you can change this.

Other options and links

There are other options like

who mostly used the same strategy for benchmarking.

As the one article before, this recap would not be possible without the articles being wrote by the awesome Hadoop community and Michael Toll specially.



Now read this

How to submit a job to a remote JobTracker

One of the first things you start thinking after your very first hadoop jobs is what should I do to submit a job to a remote JobTracker? Cause otherwise you’re restricted to run this jobs using the same user and the same machine where... Continue →