The "Queries Per Second Benchmark" allows you to get a quick overview of the throughput and latency characteristics of grpc.
To build the benchmark type
$ ./gradlew :grpc-benchmarks:installDist
from the grpc-java directory.
You can now find the client and the server executables in benchmarks/build/install/grpc-benchmarks/bin
.
The C++
counterpart can be found at https://github.com/grpc/grpc/tree/master/test/cpp/qps
The netty benchmark directory contains the standard benchmarks used to assess the performance of GRPC. Since these benchmarks run on localhost over loopback the performance of the underlying network is considerably different to real networks and their behavior. To address this issue we recommend the use of a network emulator to make loopback behave more like a real network. To this end the benchmark code looks for a loopback interface with 'benchmark' in its name and attempts to use the address bound to that interface when creating the client and server. If it cannot find such an interface it will print a warning and continue with the default localhost address.
To attempt to standardize benchmark behavior across machines we attempt to emulate a 10gbit ethernet interface with a packet delay of 0.1ms.
On Linux we can use netem to shape the traffic appropriately.
# Remove all traffic shaping from loopback
sudo tc qdisc del dev lo root
# Add a priority traffic class to the root of loopback
sudo tc qdisc add dev lo root handle 1: prio
# Add a qdisc under the new class with the appropriate shaping
sudo tc qdisc add dev lo parent 1:1 handle 2: netem delay 0.1ms rate 10gbit
# Add a filter which selects the new qdisc class for traffic to 127.127.127.127
sudo tc filter add dev lo parent 1:0 protocol ip prio 1 u32 match ip dst 127.127.127.127 flowid 2:1
# Create an interface alias call 'lo:benchmark' that maps 127.127.127.127 to loopback
sudo ip addr add dev lo 127.127.127.127/32 label lo:benchmark
to remove this configuration
sudo tc qdisc del dev lo root
sudo ip addr del dev lo 127.127.127.127/32 label lo:benchmark
Contributions are welcome!
The QPS client comes with the option --save_histogram=FILE
, if set it serializes the histogram to FILE
which can then be used with a plotter to visualize the latency distribution. The histogram is stored in the file format of HdrHistogram. That way it can be plotted very easily using a browser based tool like https://hdrhistogram.github.io/HdrHistogram/plotFiles.html. Simply upload the generated file and it will generate a beautiful graph for you. It also allows you to plot two or more histograms on the same surface in order two easily compare latency distributions.
When running a benchmark it's often useful to adjust some JVM options to improve performance and to gain some insights into what's happening. Passing JVM options to the QPS server and client is as easy as setting the JAVA_OPTS
environment variables. Below are some options that I find very useful:
-Xms
gives a lower bound on the heap to allocate and-Xmx
gives an upper bound. If your program uses more than what's specified in-Xmx
the JVM will exit with anOutOfMemoryError
. When setting those always setXms
andXmx
to the same value. The reason for this is that the young and old generation are sized according to the total available heap space. So if the total heap gets resized, they will also have to be resized and this will then trigger a full GC.-verbose:gc
prints some basic information about garbage collection. It will log to stdout whenever a GC happend and will tell you about the kind of GC, pause time and memory compaction.-XX:+PrintGCDetails
prints out very detailed GC and heap usage information before the program terminates.-XX:-HeapDumpOnOutOfMemoryError
and-XX:HeapDumpPath=path
when you are pushing the JVM hard it sometimes happens that it will crash due to the lack of available heap space. This option will allow you to dive into the details of why it happened. The heap dump can be viewed with e.g. the Eclipse Memory Analyzer.-XX:+PrintCompilation
will give you a detailed overview of what gets compiled, when it gets compiled, by which HotSpot compiler it gets compiled and such. It's a lot of output. I usually just redirect it to file and look at it withless
andgrep
.-XX:+PrintInlining
will give you a detailed overview of what gets inlined and why some methods didn't get inlined. The output is very verbose and like-XX:+PrintCompilation
and useful to look at after some major changes or when a drop in performance occurs.- It sometimes happens that a benchmark just doesn't make any progress, that is no bytes are transferred over the network, there is hardly any CPU utilization and low memory usage but the benchmark is still running. In that case it's useful to get a thread dump and see what's going on. HotSpot ships with a tool called
jps
andjstack
.jps
tells you the process id of all running JVMs on the machine, which you can then pass tojstack
and it will print a thread dump of this JVM. - Taking a heap dump of a running JVM is similarly straightforward. First get the process id with
jps
and then usejmap
to take the heap dump. You will almost always want to run it with-dump:live
in order to only dump live objects. If possible, try to size the heap of your JVM (-Xmx
) as small as possible in order to also keep the heap dump small. Large heap dumps are very painful and slow to analyze.
Newer JVMs come with a built-in profiler called Java Flight Recorder
. It's an excellent profiler and it can be used to start a recording directly on the command line, from within Java Mission Control
or
with jcmd.
A good introduction on how it works and how to use it are http://hirt.se/blog/?p=364 and http://hirt.se/blog/?p=370.