Benchmarking a Hadoop Cluster

HDFS benchmarking

In order to benchmark the HDFS we’re mainly interested on the performance of reading and writing operations, and to archive that Apache Hadoop give us the TestDFSIO tool. Later on we’re going to talk also about doing something more than just an I/O Benchmark running things like TeraSort/PI/etc.

TestDFSIO

This is tool provided by apache to test the HDFS performance, you will find the desired source under $HADOOP_HOME/src/test/org/apache/hadoop/fs/TestDFSIO.java if interested to check what he is exactly doing.

Basically this test provides you with IO performance information using creating files and reading from them using a 1:1 mapping, so 1 map per file. In this case we’re able to see the performance of different files sizes, number of files, etc.

After each execution, either write or read, a report is created under /benchmarks/TestDFSIO/, keep in mind after each execution this data gets lost.
Write ops
To run the write benchmark you could do it with:

$ hadoop jar hadoop-*test.jar TestDFSIO -write -nrFiles 10 -fileSize 100*

creating 10 files of 1Mb each, having an output like

13/11/29 12:43:42 INFO fs.TestDFSIO: —– TestDFSIO —– : write
13/11/29 12:43:42 INFO fs.TestDFSIO: Date & time: Fri Nov 29 12:43:42 >CET 2013
13/11/29 12:43:42 INFO fs.TestDFSIO: Number of files: 10
13/11/29 12:43:42 INFO fs.TestDFSIO: Total MBytes processed: 1000
13/11/29 12:43:42 INFO fs.TestDFSIO: Throughput mb/sec: >87.27526618956188
13/11/29 12:43:42 INFO fs.TestDFSIO: Average IO rate mb/sec: >89.21245574951172
13/11/29 12:43:42 INFO fs.TestDFSIO: IO rate std deviation: >13.1585896405699
13/11/29 12:43:42 INFO fs.TestDFSIO: Test exec time sec: 28.658

Read ops

To run the read benchmark you can do it just changing write per read, so using a command like

$ hadoop jar hadoop-*test.jar TestDFSIO -read -nrFiles 10 -fileSize 100*

should be noticed that prior to the read ops, one must create the files using the write benchmark.

A detailed output of the read operations is:

13/11/29 15:08:30 INFO fs.TestDFSIO: —– TestDFSIO —– : read
13/11/29 15:08:30 INFO fs.TestDFSIO: Date & time: Fri Nov 29 15:08:30 CET 2013
13/11/29 15:08:30 INFO fs.TestDFSIO: Number of files: 10
13/11/29 15:08:30 INFO fs.TestDFSIO: Total MBytes processed: 1000
13/11/29 15:08:30 INFO fs.TestDFSIO: Throughput mb/sec: 157.0598397989634
13/11/29 15:08:30 INFO fs.TestDFSIO: Average IO rate mb/sec: 157.7977752685547
13/11/29 15:08:30 INFO fs.TestDFSIO: IO rate std deviation: 11.239874400753356
13/11/29 15:08:30 INFO fs.TestDFSIO: Test exec time sec: 27.571

as we can see both operations give us a set of very interesting variables so lets see what they exactly mean:

Two other, but not calculated here interesting variables is what happen when I’ve more possible file chunks than maps to deal with them (so we’re on a concurrent scenario). It’s also important to keep in mind the HDFS configuration and replication factor, it’s going to be a complete different set of results when there is more network usage.

We’ve seen how to get up and running a test for our HDFS I/O performance, but not to become to large, and informative I aim to give it in separate editions, next one will be about how to do a more complete benchmark including algorithms like sort/pi/etc who run usual operations in Hadoop.

Must say this recap where impossible to make without the nice articles from Michael Toll and others, I’m just a newbie on the topic.

 
27
Kudos
 
27
Kudos

Now read this

Hadoop Benchmarking - TeraSort and others

Another kind of important benchmark to be done it’s about the behaviour during a mapreduce wave execution in our cluster. To do that people use to deal with internal use cases or the traditional TeraSort benchmark, highly inspired by one... Continue →