Hadoop Benchmarking - TeraSort and others
Another kind of important benchmark to be done it’s about the behaviour during a mapreduce wave execution in our cluster. To do that people use to deal with internal use cases or the traditional TeraSort benchmark, highly inspired by one of the performance test proposed by Google at his original paper.
Lets talk a bit about the TeraSort benchmark
As one of the “hard” points of map reduce, sorting is being use a model for benchmarking operations. Although most people use enormous amount of data for this benchmark, it’s important to know you can adapt it to your needs.
Some things this benchmark could helps to set up are:
- The computer resources for your TaskTrackers, like RAM and cores.
- Sorting (io.sort.mb) and JVM (mapred.child.java.opts) parameters, etc.
- The scheduler configuration.
The faces to run this are:
- Generate the data using the teragen utility (only when needed)
- Running the sorting operations (TeraSort).
- Validating the output (TeraValidate).
so lets see how to run it:
Data generation, in this case 1000000 to be sorted.
$ hadoop jar hadoop-examples.jar teragen 1000000 /data/hadoop/terasort-input
How to actually run the sort jobs.
$ hadoop jar hadoop-examples.jar terasort /data/hadoop/terasort-input /data/hadoop/terasort-output
And the last how to validate the last operation.
$ hadoop jar hadoop-examples.jar teravalidate /data/hadoop/terasort-output /data/hadoop/terasort-validate
but in this case the jobs are not providing us with any real benchmarking friendly output, so we must get ours for example using the historical API provided by Hadoop.
So to see how the sorting method was performing.
$ hadoop job -history all /data/hadoop/terasort-output
in this case the output is kind of very large, but interesting variables are all FileSystemCounters and some of the Job Counters. You can also grab some memory usage and time being spent per different parts of the system.
The HDFS master piece component is the NameNode, basically taking care of maintaining and providing with a uniform view of the filesystem, dealing with tasks as keeping the file organization, locating files, etc. With this benchmark we can load test the NameNode configuration.
An example on how to run this benchmark is:
$ hadoop jar hadoop-test.jar nnbench -operation create_write \
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 \
-replicationFactorPerFile 3 -readFileAfterOpen true \
creating a set of 1000 files using 12 maps and 6 reducers. There are other option variables that one can check using the –help option.
The output for that test it’s like:
hdfs.NNBench: ————– NNBench ————– :
hdfs.NNBench: Version: NameNode Benchmark 0.4
hdfs.NNBench: Date & time: 2013-11-29 16:41:52,417
hdfs.NNBench: Test Operation: create_write
hdfs.NNBench: Start time: 2013-11-29 16:40:22,49
hdfs.NNBench: Maps to run: 12
hdfs.NNBench: Reduces to run: 6
hdfs.NNBench: Block Size (bytes): 1
hdfs.NNBench: Bytes to write: 0
hdfs.NNBench: Bytes per checksum: 1
hdfs.NNBench: Number of files: 1000
hdfs.NNBench: Replication factor: 3
hdfs.NNBench: Successful file operations: 0
hdfs.NNBench: # maps that missed the barrier: 0
hdfs.NNBench: # exceptions: 0
hdfs.NNBench: TPS: Create/Write/Close: 0
hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 0.0
hdfs.NNBench: Avg Lat (ms): Create/Write: Infinity
hdfs.NNBench: Avg Lat (ms): Close: NaN
hdfs.NNBench: RAW DATA: AL Total #1: 66511
hdfs.NNBench: RAW DATA: AL Total #2: 0
hdfs.NNBench: RAW DATA: TPS Total (ms): 0
hdfs.NNBench: RAW DATA: Longest Map Time (ms): 0.0
hdfs.NNBench: RAW DATA: Late maps: 0
hdfs.NNBench: RAW DATA: # of exceptions: 0
where we can see lots of interesting data that will help to understand and iterate to improve the NameNode performance if possible.
With the TeraSort benchmark we used a very large job in order to see how they were performing, but sometimes is also interesting to see how small jobs are doing. In this case the MRBench is our tool for that.
An example on how to run that is:
$ hadoop jar hadoop-test.jar mrbench -numRuns 50
Looping the same test over 50 rounds. There are other option variables that one can check using the –help option.
Proving us with an expected output like this:
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 19815
where you see the averaged time per job and the number of mappers and reducers used, keep in mind you can change this.
Other options and links
There are other options like
- https://github.com/intel-hadoop/HiBench : Intel bencharking toolset that includes also some data mining and another use cases examples. Worth to try if fits a destination use case of choice.
- http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/pi/package-summary.html : Another example commonly used to benchamark hadoop is the PI computation, however one must take into account this is only focused on “computation” and involve nearly no IO bound.
- https://www.vmware.com/files/pdf/VMW-Hadoop-Performance-vSphere5.pdf is an interesting white paper on how to deal with hadoop performance in a VMWare environment. Especially including some VM configuration options into account.
who mostly used the same strategy for benchmarking.
As the one article before, this recap would not be possible without the articles being wrote by the awesome Hadoop community and Michael Toll specially.