[ Team LiB ] |
2.3 Method CallsMost profiling tools provide a profile of method calls, showing where the bottlenecks in your code are and helping you decide where to target your efforts. By showing which methods and lines take the most time, a good profiling tool can quickly pinpoint bottlenecks. Most method profilers work by sampling the call stack at regular intervals and recording the methods on the stack.[7] This regular snapshot identifies the method currently being executed (the method at the top of the stack) and all the methods below, to the program's entry point. By accumulating the number of hits on each method, the resulting profile usually identifies where the program is spending most of its time. This profiling technique assumes that the sampled methods are representative, i.e., if 10% of stacks sampled show method foo( ) at the top of the stack, then the assumption is that method foo( ) takes 10% of the running time. However, this is a sampling technique , so it is not foolproof: methods can be missed altogether or have their weighting misrecorded if some of their execution calls are missed. But usually only the shortest tests are skewed. Any reasonably long test (i.e., seconds rather than milliseconds) normally gives correct results.
The JDK comes with a minimal profiler, obtained by running a program using the java executable with the -Xrunhprof option (-prof before JDK 1.2, -Xprof with HotSpot). This option produces a profile data file called java.hprof.txt (java.prof before 1.2). The filename can be specified by using the modified option -Xrunhprof: file=<filename> (-prof:<filename> before 1.2). 2.3.1 Profiling MethodologyWhen using a method profiler, the most useful technique is to target the top five to ten methods and choose the quickest to fix. The reason for this is that once you make one change, the profile tends to be different the next time, sometimes markedly so. This way, you can get the quickest speedup for a given effort. However, it is also important to consider what you are changing so you know what your results are. If you select a method that is taking 10% of the execution time and then halve the time that method takes, you speed up your application by 5%. On the other hand, targeting a method that takes up only 1% of execution time gives you a maximum of only 1% speedup to the application, no matter how much effort you put in. Similarly, if you have a method that takes 10% of the time but is called a huge number of times, with each individual method call being quite short, you are less likely to speed up the method. On the other hand, if you can eliminate some significant fraction of the calling methods (the methods that call the method that takes 10% of the time), you might gain speed that way. Let's look at the profile output from a short program that repeatedly converts some numbers to strings and inserts them into a hash table: package tuning.profile; import java.util.*; public class ProfileTest { public static void main(String[ ] args) { //Repeat the loop this many times int repeat = 2000; //Two arrays of numbers, eight doubles and ten longs double[ ] ds = {Double.MAX_VALUE, -3.14e-200D, Double.NEGATIVE_INFINITY, 567.89023D, 123e199D, -0.000456D, -1.234D, 1e55D}; long[ ] ls = {2283911683699007717L, -8007630872066909262L, 4536503365853551745L, 548519563869L, 45L, Long.MAX_VALUE, 1L, -9999L, 7661314123L, 0L}; //Initializations long time; StringBuffer s = new StringBuffer( ); Hashtable h = new Hashtable( ); System.out.println("Starting test"); time = System.currentTimeMillis( ); //Repeatedly add all the numbers to a stringbuffer //and also put them into a hash table for (int i = repeat; i > 0; i--) { s.setLength(0); for (int j = ds.length-1; j >= 0; j--) { s.append(ds[j]); h.put(new Double(ds[j]), Boolean.TRUE); } for (int j = ls.length-1; j >= 0; j--) { s.append(ls[j]); h.put(new Long(ls[j]), Boolean.FALSE); } } time = System.currentTimeMillis( ) - time; System.out.println(" The test took " + time + " milliseconds"); } } The relevant output from running this program with the JDK 1.2 method profiling option follows. (See Section 2.3.2 later in this chapter for a detailed explanation of the 1.2 profiling option and its output.) CPU SAMPLES BEGIN (total = 15813) Wed Jan 12 11:26:47 2000 rank self accum count trace method 1 54.79% 54.79% 8664 204 java/lang/FloatingDecimal.dtoa 2 11.67% 66.46% 1846 215 java/lang/Double.equals 3 10.18% 76.64% 1609 214 java/lang/FloatingDecimal.dtoa 4 3.10% 79.74% 490 151 java/lang/FloatingDecimal.dtoa 5 2.90% 82.63% 458 150 java/lang/FloatingDecimal.<init> 6 2.11% 84.74% 333 213 java/lang/FloatingDecimal.<init> 7 1.23% 85.97% 194 216 java/lang/Double.doubleToLongBits 8 0.97% 86.94% 154 134 sun/io/CharToByteConverter.convertAny 9 0.94% 87.88% 148 218 java/lang/FloatingDecimal.<init> 10 0.82% 88.69% 129 198 java/lang/Double.toString 11 0.78% 89.47% 123 200 java/lang/Double.hashCode 12 0.70% 90.17% 110 221 java/lang/FloatingDecimal.dtoa 13 0.66% 90.83% 105 155 java/lang/FloatingDecimal.multPow52 14 0.62% 91.45% 98 220 java/lang/Double.equals 15 0.52% 91.97% 83 157 java/lang/FloatingDecimal.big5pow 16 0.46% 92.44% 73 158 java/lang/FloatingDecimal.constructPow52 17 0.46% 92.89% 72 133 java/io/OutputStreamWriter.write In this example, I extracted only the top few lines from the profile summary table. The methods are ranked according to the percentage of time they take. Note that the trace does not identify actual method signatures, only method names. The top three methods take, respectively, 54.79%, 11.67%, and 10.18% of the time taken to run the full program.[8]
The fourth method in the list takes 3.10% of the time, so clearly you need look no further than the top three methods to optimize the program. The methods ranked first, third, and fourth are the same method, possibly called in different ways. Obtaining the traces for these three entries from the relevant section of the profile output (trace 204 for the first entry, and traces 215 and 151 for the second and fourth entries), you get: TRACE 204: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method) java/lang/Double.toString(Double.java:Compiled method) java/lang/String.valueOf(String.java:Compiled method) TRACE 214: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) TRACE 151: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method) java/lang/Double.toString(Double.java:132) java/lang/String.valueOf(String.java:2065) In fact, both traces 204 and 151 are the same stack, but trace 151 provides line numbers for two of the methods. Trace 214 is a truncated entry, and is probably the same stack as the other two (these differences highlight the fact that the JDK profiler sometimes loses information). All three entries refer to the same stack: an inferred call from the StringBuffer to append a double, which calls String.valueOf( ), which calls Double.toString( ), which in turn creates a FloatingDecimal object. (<init> is the standard way to write a constructor call; <clinit> is the standard way to show a class initializer being executed. These are also the actual names for constructors and static initializers in the class file.) FloatingDecimal is private to the java.lang package, which handles most of the logic involved in converting floating-point numbers. FloatingDecimal.dtoa( ) is the method called by the FloatingDecimal constructor that converts the binary floating-point representation of a number into its various parts of digits before the decimal point, after the decimal point, and the exponent. FloatingDecimal stores the digits of the floating-point number as an array of chars when the FloatingDecimal is created; no strings are created until the floating-point number is converted to a string. Since this stack includes a call to a constructor, it is worth checking the object-creation profile to see whether you are generating an excessive number of objects; object creation is expensive, and a method that generates many new objects is often a performance bottleneck. (I show the object-creation profile and how to generate it later in this chapter under Section 2.4.) The object-creation profile shows that a large number of extra objects are being created, including a large number of FDBigInt objects that are created by the new FloatingDecimal objects. Clearly, FloatingDecimal.dtoa( ) is the primary method to optimize in this case. Almost any improvement in this one method translates directly to a similar improvement in the overall program. However, normally only Sun can modify this method, and even if you want to modify it, it is long and complicated and takes an excessive amount of time to optimize unless you are already familiar with both floating-point binary representation and converting that representation to a string format. Normally when tuning, the first alternative to optimizing FloatingDecimal.dtoa( ) is to examine the other significant bottleneck method, Double.equals( ) , which was second in the summary. Even though this entry takes up only 11.67% compared to over 68% for the FloatingDecimal.dtoa( ) method, it may be an easier optimization target. But note that while a small 10% improvement in the FloatingDecimal.dtoa( ) method translates into a 6% improvement for the program as a whole, the Double.equals( ) method needs to be speeded up to be more than twice as fast to get a similar 6% improvement for the full program. The trace corresponding to this second entry in the summary example turns out to be another truncated trace, but the example shows the same method in 14th position, and the trace for that entry identifies the Double.equals( ) call as coming from the Hashtable.put( ) call. Unfortunately for tuning purposes, the Double.equals( ) method itself is already quite fast and cannot be optimized further. When methods cannot be directly optimized, the next best choice is to reduce the number of times they are called or even avoiding the methods altogether. (In fact, eliminating method calls is actually the better tuning choice, but it is often considerably more difficult to achieve and so is not a first-choice tactic for optimization.) The object-creation profile and the method profile together point to the FloatingDecimal class as being a huge bottleneck, so avoiding this class is the obvious tuning tactic here. In Chapter 5, I employ this technique, avoiding the default call through the FloatingDecimal class for the case of converting floating-point numbers to Strings, and I obtain an order-of-magnitude improvement. Basically, the strategy is to create a more efficient routine to run the equivalent conversion functionality and then replace the calls to the underperforming FloatingDecimal methods with calls to more efficient optimized methods.
The best way to avoid the Double.equals( ) method is to replace the hash table with another implementation that stores double primitive data types directly rather than requiring the doubles to be wrapped in a Double object. This allows the = = operator to make the comparison in the put( ) method, thus completely avoiding the Double.equals( ) call. This is another standard tuning tactic, replacing a data structure with one more appropriate and faster for the task. 2.3.2 Java 2 cpu=samples Profile OutputThe default profile output gained from executing with -Xrunhprof in Java 2 is not useful for method profiling. The default output generates object-creation statistics from the heap as the dump (output) occurs. By default, the dump occurs when the application terminates; you can modify the dump time by typing Ctrl-\ on Solaris and other Unix systems, or Ctrl-Break on Windows. To get a useful method profile, you need to modify the profiler options to specify method profiling. A typical call to achieve this is: % java -Xrunhprof:cpu=samples,thread=y <classname>
(Note that in a Windows command-line prompt, you need to surround the options with double quotes because the equals sign is considered a meta character.) The full list of options available with -Xrunhprof can be viewed using the -Xrunhprof:help option. The profiling option in JDKs 1.2/1.3/1.4 can be pretty flaky. Several of the options can cause the runtime to crash (core dump). The output is a large file because huge amounts of trace data are written rather than summarized. Since the profile option is essentially a Sun engineering tool, it is pretty rough, especially since Sun has a separate (not free) profile tool for its engineers. Another tool that Sun provides to analyze the output of the profiler is the Heap Analysis Tool (search http://www.java.sun.com for "HAT"). But this tool analyzes only the object-creation statistics output gained with the default profile output, so it is not that useful for method profiling (see Section 2.4 for slightly more about this tool). Nevertheless, I expect the free profiling option to stabilize and be more useful in future versions. The output when run with the options already listed (cpu=samples, thread=y) already results in fairly usable information. This profiling mode operates by periodically sampling the stack. Each unique stack trace provides a TRACE entry in the second section of the file, describing the method calls on the stack for that trace. Multiple identical samples are not listed; instead, the number of their "hits" is summarized in the third section of the file. The profile output file in this mode has three sections:
Section 3 is the place to start when analyzing this profile output. It consists of a table with six fields:
The profiler in this mode (cpu=samples) suffices when you have no better alternative. It does have an effect on real measured times, slowing down operations by variable amounts even within one application run. But it normally indicates major bottlenecks, although sometimes a little extra work is necessary to sort out multiple identical method-name references. Using the alternative cpu=times mode, the profile output gives a different view of application execution. In this mode, the method times are measured from method entry to method exit, including the time spent in all other calls the method makes. This profile of an application provides a tree-like view of where the application is spending its time. Some developers are more comfortable with this mode for profiling the application, but I find that it does not directly identify bottlenecks in the code. 2.3.3 HotSpot and 1.3 -Xprof Profile OutputHotSpot does not support the standard Java 2 profiler detailed in the previous section; it supports a separate profiler using the -Xprof option. JDK 1.3 supports the HotSpot profiler as well as the standard Java 2 profiler. The HotSpot profiler has no further options available to modify its behavior; it works by sampling the stack every 10 milliseconds. The output, printed to standard output, consists of a number of sections. Each section lists entries in order of the number of ticks counted while the method was executed. The various sections include methods executing in interpreted and compiled modes, and VM runtime costs as well:
The entries at the top of Section 3 are the methods that probably need tuning. Any method listed near the top of Section 2 should have been targeted by the HotSpot optimizer and may be listed lower down in Section 3. Such methods may still need to be optimized, but it is more likely that the methods at the top of Section 3 need optimizing. The ticks for the two sections are the same, so you can easily compare the time taken up by the top methods in the different sections and decide which to target. 2.3.4 JDK 1.1.x -prof and Java 2 cpu=old Profile OutputThe JDK 1.1.x method-profiling output, obtained by running with the -prof option, is quite different from the normal 1.2 output. This output format is supported in Java 2, using the cpu=old variation of the -Xrunhprof option. This output file consists of four sections:
This first line of the example is followed by lines consisting of three fields: first, the class name prefixed by the array dimension if the line refers to the array data; next, the number of instances of that class (or array class); and last, the total amount of space used by all the instances, in bytes. So the example reports that there are 219 HashtableEntry instances that comprise (collectively) a total of 3504 bytes,[9] and three HashtableEntry arrays having 1103 array indexes (which amounts to 4412 bytes total, as each entry is a 4-byte object handle).
Sections 3 and 4 give snapshots of the object table memory and can be used in an interesting way: to run a garbage collection just before termination of your application. That leaves in the object table all the objects that are rooted[10] by the system and by your application (from static variables). If this snapshot shows significantly more objects than you expect, you may be referencing more objects than you realized.
The first section of the profile output is the most useful. It consists of multiple lines, each specifying a method and its caller, together with the total cumulative time spent in that method and the total number of times it was called from that caller. The first line of this section specifies the four fields in the profile table in this section: count, callee, caller, and time. They are detailed here:
This first section is normally sorted into count order. However, for this profiler, the time spent in methods tends to be more useful. Because the times in the time field include the total time that the callee method was anywhere on the stack, interpreting the output of complex programs can be difficult without processing the table to subtract subcall times. This format is different from the 1.2 output with cpu=samples specified, and is similar to a 1.2 profile with cpu=times specified. The lines in the profile output are unique for each callee-caller pair, but any one callee method and any one caller method can (and normally do) appear in multiple lines. This is because any particular method can call many other methods, and so the method registers as the caller for multiple callee-caller pairs. Any particular method can also be called by many other methods, and so the method registers as the callee for multiple callee-caller pairs. The methods are written out using the internal Java syntax listed in Table 2-1.
There are free viewers, including source code, for viewing this file:
The biggest drawback to the 1.1 profile output is that threads are not shown at all. This means that it is possible to get time values for method calls that are longer than the total time spent in running the application, since all the call times from multiple threads are added together. It also means that you cannot determine from which thread a particular method call was made. Nevertheless, after re-sorting the section on the time field rather than the count field, the profile data is useful enough to suffice as a method profiler when you have no better alternative. One problem I've encountered is the limited size of the list of methods that can be held by the internal profiler. Technically, this limitation is 10,001 entries in the profile table, and there is presumably one entry per method. There are four methods that help you avoid the limitation by profiling only a small section of your code: sun.misc.VM.suspendJavaMonitor( ) sun.misc.VM.resumeJavaMonitor( ) sun.misc.VM.resetJavaMonitor( ) sun.misc.VM.writeJavaMonitorReport( ) These methods also allow you some control over which parts of your application are profiled and when to dump the results. 2.3.5 Java 2 -Xhprof OptionThe -Xhprof option seems to be simply an alternative name for the -Xrunhprof option. I believe that originally it was called -Xhprof, and then the option was left in although all subsequent documentation used -Xrunhprof. 2.3.6 Java 2 -Xaprof option-Xaprof appears to be a simple allocation profiler. It prints the number and total size of instances allocated per class, including array classes, accumulating instances across all threads and creation points. In fact, it seems to be very similar to the tool I describe in the next section. Like other VM profiling tools, it is unfortunately not 100% stable (for example, it core dumps with my 1.4 Windows VM). Nevertheless, it is useful when it works, and it was introduced with 1.3. Using this profiler to monitor the tuning.profile.ProfileTest class used in the example from the "Profiling Methodology" section results in the following output: Allocation profile (sizes in bytes, cutoff = 0 bytes): _ _ _ _ _ _Size_ _Instances_ _Average_ _Class_ _ _ _ _ _ _ _ _ _ _ _ _ 13491592 186025 73 [I 5634592 86602 65 [C 2496352 156022 16 java.lang.FDBigInt 875112 36463 24 java.lang.String 768000 16000 48 java.lang.FloatingDecimal 320000 20000 16 java.lang.Long 256000 16000 16 java.lang.Double 29832 14 2131 [B 14256 594 24 java.util.Hashtable$Entry 8960 6 1493 [S 8112 25 324 [Ljava.util.Hashtable$Entry; 2448 102 24 java.lang.StringBuffer 2312 3 771 [Ljava.lang.FDBigInt; 1600 24 67 [Ljava.lang.Object; 584 9 65 [Ljava.util.HashMap$Entry; 528 22 24 java.util.Locale 440 11 40 java.util.Hashtable 432 9 48 java.util.HashMap 392 4 98 [D 376 3 125 [J 320 7 46 [Ljava.lang.String; 256 4 64 java.lang.Thread ... 23916904 518123 46 --total-- The listing has been truncated, but a full listing is output down to objects and arrays with only one instance created. The primitive arrays are listed using the one-character labels from Table 2-1. The listing is fairly clear. All instances created at any time by the VM are included, whether they have been garbage-collected or not. The first column is the total size in bytes taken up by all the instances summed together; the second column provides the number of those instances created; and the third divides the first column by the second column to give an average size per object in bytes. The only disadvantage seems to be that you cannot take a snapshot. There seems to be no way of registering only those objects created between time 1 (e.g., after initialization) and time 2. Otherwise, this is another useful tool to add to your armory. |