ApacheCon - Java Monitoring and Troubleshooting Tools in Action

Presented by Bill Au, of the Platform Infrastructure group at CNET

Bill is going to help us learn about how to troubleshoot and monitor java apps, thread and heap dumps, hung or slow apps, OutOfMemoryErrors, and JVM crashes. All of the tools he is going to show us are free tools - open source or free for download.

NOTE: This was a very interesting session. One of the things I was impressed with was his demonstrations of some tools that I have not seen. There was the HP JMeter tool, an open source tool called Samurai, and a perl script that he wrote. All of them look very helpful, and I want to try them all out (not that I have every written any apps that ever have performance issues - but I’m sure I can help someone else look at theirs!)

Monitoring

Within JDK 5 (Sun’s), there is a java.lang.management package that has quite a few MX beans (i.e. MemoryMXBean, RuntimeMXBean). These beans will allow you to get all of the information you might need to monitor the JVM. You can see sample code for how to use them in: $JAVA_HOME/demo/management.

Management tools

jinfo - for getting rutntime info
jmap - for getting heap info and taking heap dump
jstack - for thread dumps

The tools above can be run against a running JVM or a core dump file if the JVM crashed.

jstat - can be run against a running instance from a command line to get monitoring information. jconsole - a GUI for those not comfortable with the console - must open JMX port on running JVM (in JDK 6 there is an attach-on-demand functionality). jconsole also support plugins. There is an example in $JAVA_HOME/demo/management/JTop that is a very useful plugin. garbage collection enable this because it is handy: -Xloggc:. It will give you the timestamp of GC events and size before and after.

Thread and Heap Dumps

kill -3 (or kill –SIGQUIT) will give you a thread dump
ThreadMXBean or jstack or jconsole can also give you thread dumps.

Bill recommends taking one, and then after five seconds, take another (do this several times). Once you have several, you can compare where threads are to see which ones are locking.

If you turn this option (-XX:+HeapDumpOnOutOfMemoryError) on, then if you get an OOM, the memory will be logged so that you can debug where the memory was used at the time of the crash. jmap and jconsole also allow you to take heap dumps. hprof can also give you information for the heap, but it was significant overhead - so you don’t want to use it in production.

Hung or Slow App - Debugging

Look in garbage collection logs - how long is the application pausing for garbage collection (five seconds is a long time). Look for how much overall time that the JVM is spending garbage collecting rather than running your code? You can tune this by modifying the heap size - of course, a larger heap takes a longer time to garbage collect. Making one change at a time obviously gives you more of an opportunity to see what actual worked.
HPjmeter.jar - a free tool from HP (this is not Apache JMeter). This is a tool for monitoring HP’s JVM, but it can also analyze garbage collection logs from most JVMs. It will give you a chart of how much time is spent in garbage collection, and how many GC events, as well as average duration, etc.
One comment was made that if you are using large heap sizes (8gb-16gb), you need to make sure to use the appropriate (larger) page size. I’ll have to Google this to find out more.
Beware of over-optimizing - your app will change over time. Bill suggests trying to find a good heap size, and perhaps using concurrent garbage collection rather than full garbage collection, even though this has higher overhead.
Deadlock - if you encounter a dead lock, you obviously want to take a thread dump (see above) to run the deadlock detector.
Loop threads - You may get certain threads in a long-running loop. To find this, monitor your CPU times of threads, using ThreadMXBean or jconsole with jtop (see above for more details)
Blocked threads - there is an open source GUI tool called “samurai” that analyzes thread dumps for you. It can also understand consequtive thread dumps and display the information in an easy-to-read format.
Bill also wrote a perl script that gives you an overview of a thread dump. Both of these tools are located in the same folder as his slides (link included at bottom of this post). His perl script gives a very nice overview, including how many threads are locked, and where they are locked. Running it against several thread dumps that are several seconds apart gives you a good look at what’s going on.
Stuck threads - you may not have deadlocks, but you can still get stuck threads. Typical causes include network I/O without a timeout set. You can use the thread dump techniques to analyze this the same as you would if you had deadlocked threads.

OutOfMemoryError - heap

Some common causes:

Your heap may be too small, so you could increase the heap size (-Xms<size>)
Excessive use of finalizers (analyze with MemoryMXBean, jmap, jconsole)
look for logic error in array allocation code - allocating larger arrays than you need
Memory leak - take a heap dump (see above for helpful tools) and see if you are holding on to things that you didn’t think you were holding on to.
jhat - Java Heap Analysis Tool - this is a very helpful tool for analyzing a heap dump file that you have taken. When you start it, you can browse http://localhost:7000 and walk through the heap, seeing where everything is being referenced, and see where you’re holding on to objects that you shouldn’t be

OutOfMemoryError - permgen

Some common causes:

If you hot-reload your webapp, your server uses a new classloader, and if somewhere in your code you accidentally held on to a classloader (or possibly class?) from the old classloader - those classes can’t be garbage collected, and you run out of perm gen space.
If you do have a leak like just described, look for threads that you started running in the background that may be holding on to old classes / classloaders.
You can use JHAT again - look for references to a class loader. If you find what’s holding on to the classloader, you’ve found the leak.

OutOfMemoryError - too many threads

You can run out of stack size if you have too many threads. Each Java thread has a native thread and a stack. You can either lower the maximum number of threads or decrease the maximum stack size (per thread) - but be careful - you may run into StackOverflowError

OutOfMemoryError - native memory

This happens when your system actually runs out of memory, and you can’t allocate any more to the JVM. Look for other processes that are using too much memory.
You may also have a leak in JNI or a native method. The stack trace in the OOM will tell you what native thread caused this.
There have also been leaks in older JVM codes. You will get an error log with the JVM crash. Obviously the fix for this is to update to the latest JVM. You can also look into hs_err_<pid>.log to see where it’s happening and possibly find a workaround. Or you can check Sun’s bug database (http://bugs.sun.com/) or post to Sun’s java forum and / or open a bug. Bill says the Sun developers are very helpful.
One workaround he has had to use in the past was running the JVM in client mode rather than server mode (this only works in 32-bit systems - on 64-bit, it always runs in server mode, regardless of what you told it to do). You can download Bill’s slides here: http://people.apache.org/~billa/apacheconus2008/