Wednesday, May 22, 2013

Java GC tuning for High Frequency Trading apps

I am interested in Java performance issues for some time. Now we look not so much at the Java code performance but at Garbage Collection process performance. I am inspired by a lecture from a Warsaw JUG group, which was about Hot Spot in low latency Java. Lecturer Wojciech Kudla prepared a simple small application to simulate HFT system in GC point of view. All the source code with great GC Viewer (plugin for JVisualVM) is available on thier GitHub: heaptrasher.

In a nutshell heaptrasher is an application which is used to generate a lot of garbage to GC operate on. In addition it contains some code to generate histogram with latency statistics. Mentioned statistics data can be collected in two different ways. Worth mentioning is that the first one (array) allocate a lot of memory in Old Generation space and the second does not. I was later to find, it has a considerable impact on GC performance.

All charts bellow are from JVisualVM with mentioned GC Viewer and VisualGC plugins.

Test platform

  • CPU: i7-3612QM
  • RAM: 8GB
  • OS: Windows 7 Ultimate
  • JDK: 1.7.0_u21

A common feature of all the tests is to use parallel GC for Young Space collections. The GC for Old Generation doesn't matter because in this case Old Generation collection will never happen.

First test

Description

4GB heap, array as statistics data holder

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main array

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 4888 nanos
 99,9000% of message latency is less than: 11241 nanos
 99,9900% of message latency is less than: 19549 nanos
 99,9990% of message latency is less than: 75263 nanos
 99,9999% of message latency is less than: 3480174 nanos
 

Summary

6,5 ms pauses? It isn't low latency :(. The graph is also very irregular. 99,9% messaged was processed in below 11241 nanos.

Second test

Description

4GB heap, direct statistics data collection

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:+UseParallelGC Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 2933 nanos
 99,0000% of message latency is less than: 4887 nanos
 99,9000% of message latency is less than: 11240 nanos
 99,9900% of message latency is less than: 19060 nanos
 99,9990% of message latency is less than: 82104 nanos
 99,9999% of message latency is less than: 762892 nanos
 

Summary

Hmmm much better, isn't it? It appears that Old Generation is major low-latency apps enemy... The graph is irregular like in first case. Change does not affected the 99,9% of messages process time.

Third test

Description

4GB heap, direct statistics data collection, 3GB NewSize

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:+UseParallelGC Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 4888 nanos
 99,9000% of message latency is less than: 11241 nanos
 99,9900% of message latency is less than: 20038 nanos
 99,9990% of message latency is less than: 69887 nanos
 99,9999% of message latency is less than: 652930 nanos
 

Summary

In general changes does not affected a single collection time, but the entire collections count is more than 2 times less. It is interesting that increase Young Generation space does not affected single collection time. It is just because app generates a lot of short lived objects, that means GC has relative small graph of living objects to check. One more time 99,9% of messages are processed with the same latency.

Third test

Description

4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 5375 nanos
 99,9000% of message latency is less than: 11729 nanos
 99,9900% of message latency is less than: 21015 nanos
 99,9990% of message latency is less than: 70865 nanos
 99,9999% of message latency is less than: 612855 nanos
 

Summary

It isn't a lot of changes in GC pause times. Must admit that chart is more regular. We have also the first time for a long time change in 99,9% messages processing time.

Fourth test

Description

4GB heap, direct statistics data collection, 3GB NewSize, number_of_gc_threads=cpu_core_number-1 with ParGCCardsPerStrideChunk

Command line

java -XX:+UnlockDiagnosticVMOptions -Xmx4g -Xms4g -XX:MaxNewSize=3g -XX:NewSize=3g -XX:ParallelGCThreads=7 -XX:+UseGCTaskAffinity -XX:+BindGCTaskThreadsToCPUs -XX:ParGCCardsPerStrideChunk=32768 Main direct

Garbage Collector pauses duration

Garbage Collections times

Message latency

 50,0000% of message latency is less than: 1466 nanos
 75,0000% of message latency is less than: 2443 nanos
 90,0000% of message latency is less than: 3421 nanos
 99,0000% of message latency is less than: 5375 nanos
 99,9000% of message latency is less than: 10752 nanos
 99,9900% of message latency is less than: 21503 nanos
 99,9990% of message latency is less than: 76729 nanos
 99,9999% of message latency is less than: 629960 nanos
 

Summary

It look almost the same as last time but we have significant better performance in 99,9% messages process time.

Summary

Durning the presentation lecturer use Linux based platform. During presentation the lecturer was used Linux based platform. In Linux it look that individual switches has more impact for collections time. A little summary here:

Test number Collections number Collections time Avg. collection time 99,90% latency less than 99,99% latency less than
first126706,342 ms5,60 ms11241 nanos19549 nanos
second12692,002 ms0,73 ms11240 nanos19560 nanos
third5753,738 ms0,94 ms11729 nanos21015 nanos
fourth5746,508 ms0,81 ms10752 nanos21503 nanos

Warsaw JUG meeting

More info???

Thursday, May 9, 2013

What's next?

I stopped with Scala. It is really powerful, amazing and flexible language but now I can not find its use in everyday work. It doesn't mean that it was wasted time. I think this lecture broadened my horizons significantly. What's next? For some time I'm interested in performance tuning in Java. It just so happens that it was talk about this on Warszawa JUG meeting recently. The Speaker had prepared program that simulates load encountered in High Frequency Trading applications. He showed how can be the performance affected by setting various JVM parameters. Application code is available on GitHub. In the near future I'm going to try this.
In addition next week I'm going to Krakow for the GeeCON conference and after that I'm going to describe my impressions there...