[ Team LiB ] |
16.2 Tuning JDBCAlthough JDBC tuning follows general tuning guidelines, specifying that you find the bottlenecks before tuning, it is worth knowing what tuning techniques are available from the outset. You can structure your JDBC usage so it's easier to tune without violating good design and coding practices. Two general principles are:
16.2.1 JDBC DriversThe JDBC package provides interfaces to standardize the API that allows database connections. However, the JDBC package does not provide concrete implementations of those APIs (except an ODBC driver). Third-party implementations are necessary to make the actual connection to a particular database, and the database vendor usually provides these implementations. These implementations are called JDBC drivers, and they fall into four categories or types, which differ in implementation and performance:
16.2.1.1 Type 1: JDBC-bridging driverType 1 drivers provide a link to a bridging driver, most commonly to ODBC, that can connect to any ODBC-compatible database (i.e., most databases). Because this type of driver uses a second bridging product, it is usually the slowest type and should be avoided if possible. 16.2.1.2 Type 2: Native API, part-Java driverType 2 drivers connect to a local native binary client. This connection is equivalent to using a native database client library, connecting to it via JNI, and wrapping that connection by using classes implementing the JDBC APIs. This configuration should provide reasonable performance, assuming the driver itself has been through a round of performance tuning (most have by now). If the database is on the local machine and the driver has optimizations to take advantage of that configuration, then a Type 2 driver could provide the fastest performance. The JNI connection adds overhead, and that overhead becomes relatively larger as JVMs get faster. But this type of driver is certainly worth evaluating if you have a choice of available drivers. 16.2.1.3 Type 3: Pure Java driver for database middlewareType 3 drivers provide a connection to a middleware product, which in turn connects to a database. In some ways, they are similar to Type 1 drivers. However, Type 1 ODBC drivers are generic. For example, ODBC drivers are limited to using ODBC interfaces to databases, and that type of database interface does not usually provide optimal performance. On the other hand, Type 3 driver middleware can use any interface to the database, including the fastest one. Type 3 drivers have bridging overhead similar to that of Type 1 drivers, but middleware's optimization features and position in the architecture allow it to take full advantage of any spare processing capacity available on the network. Type 3 drivers actually outperform Type 2 drivers in many configurations. 16.2.1.4 Type 4: Pure Java driver direct to a databaseType 4 drivers connect from Java directly to a database by using network communications (sockets) and the database's network application protocol. As it bypasses every type of intermediate bridging, this type of driver generally provides the fastest communication. However, if the database and the Java code run on the same machine and the database's native client library provides optimizations for that machine's configuration, Type 2 drivers could be faster. My advice is to use a Type 4 driver when possible, test the performance of Type 2 and Type 3 drivers, if available, and stay away from Type 1 drivers. Type 3 drivers are rarer and tend to be third-party drivers, but this often means that they were built by a company dedicated to building high-performance database drivers. Generally, Type 3 and Type 4 drivers provide the overall best performance, but every configuration is different (i.e., database type, usage, and distribution), so try to test the available alternatives. A list of current drivers is available at http://industry.java.sun.com/products/jdbc/drivers, but that page does not provide much information to help select the fastest driver. Try to use the latest version available, preferably with JDBC 3.0 or higher support. JDBC 3.0 drivers include connection pooling, distributed transaction support, RowSets, and prepared statement pooling. 16.2.2 Connection PoolingOpening a connection to a database is costly, potentially requiring several network round trips, resources in both the client and database, and setup of several layers of transport and authentication. Once a connection is created, keeping it open is not expensive, even if it is idle. If you make many calls to a database, then repeatedly opening and closing connections can kill the application's performance. Keeping connections to the database open is an important JDBC performance-tuning technique. However, if you simply open more connections and keep them all open, you quickly reach the database's limit for open connections and probably exceed the number of connections for optimal efficiency. To maintain an efficient number of open database connections, use a connection pool: an object pool consisting of Connection objects. Connection pools are directly supported starting in JDBC 2.0, although earlier drivers may have their own connection pools. Many third-party and open source connection-pool implementations are available, and creating one yourself is also fairly simple. Remember that Connection objects are not reentrant: only one client of the pool can use the Connection object at a time. You can even use a Connection wrapper to intercept close( ) requests and deposit the Connection back into the pool, if you want completely transparent connection pooling. The reference page at http://www.JavaPerformanceTuning.com/tips/jdbcconnpool.shtml provides helpful links for finding or building a connection-pool implementation appropriate to your application. The suggested order of preference for selecting a connection-pool implementation is:
The connection pool itself provides further tuning possibilities. The size of the pool is a tuning parameter. The ideal pool size is just large enough to service requests without forcing the requests for connections to wait a long time. If the pool size is flexible (i.e., connections are created when the pool has no more available connections and destroyed when they are idle for too long), then the goal is to tune the connection-pool size to minimize the creation and destruction of database connections. Timeouts and initial, minimum, and maximum pool sizes all provide parameters to optimize your application's use of the connection pool. The overall strategy for using pooled connections is to obtain and release pooled connections within each method that requires the connection, if the request is short. However, do not release the connection only to use it again almost immediately in another method. Instead, hold the connection until it is not immediately needed. Once a connection is obtained from the pool, it is essentially an exclusively locked shared resource. Keeping the connection locked exclusively for the minimum amount of time ensures maximum scalability for the application. You are generally better off obtaining and releasing connections too frequently, and accepting the performance overhead of that strategy, than holding onto connections for too long. For optimal use of connection pools, you might want to consider using several pools, with each pool supporting a particular connection feature. For example, some databases can optimize read-only operations, which the Connection class supports through the setReadOnly(boolean) method. Since some proportion of JDBC operations are inevitably read-only, you may want to have a read-only connection pool as well as a general connection pool. 16.2.3 Optimizing SQLDevelopers use SQL to interact with databases when using JDBC. SQL is normally used with JDBC, since it provides a standard, well-understood mechanism for database interaction. Drivers can support other syntaxes, such as the database's stored-procedure syntax, but ANSI SQL must be supported. The SQL used by your application via JDBC can be tuned to create significant gains in overall performance. Tuning SQL is therefore critical for tuning JDBC. Three broad areas to consider when tuning SQL include:
These three categories are not completely separate. They overlap slightly, and sometimes produce conflicting optimizations that need to be balanced. We'll look at each category in turn. 16.2.4 Set-Based ProcessingWhen you create a Java application that uses JDBC, you don't tend to think about which procedures are operating on the same tables. Instead, you probably think about which data each method and object needs and which data needs to be updated in the database on a method-by-method and object-by-object basis. The resulting JDBC code tends to use row-by-row operations, with many JDBC-SQL calls inefficiently handling only one row of a table in the database. SQL is designed to process multiple rows in a table simultaneously. Relational-database vendors ensure that operations applied to multiple rows in one table can be executed as efficiently as possible. To take advantage of this, combine SQL statements. Don't query for one row at a time from a table; try to get all the rows you will need. This may require that you restructure objects in your application, but can result in significantly fewer application-to-database round trips. Even more usefully, SQL consists of more than SELECT and INSERT. Instead of using the database purely as a data store, you can construct sophisticated SQL statements that combine queries, processing, and updates without bringing lots of temporary data back to the Java application for local processing. Combining multiple operations is a good way to take advantage of the efficiencies available in relational databases, and we will discuss batching operations and stored procedures in later sections. 16.2.5 Database Server-Side ProcessingOn the database side of the JDBC communication, the database needs to process the data and received SQL instructions, execute the accesses and updates required to satisfy the SQL, and return any requested data. The more work the server has to do, the longer the operation takes. The performance-tuning target is to minimize the server's work without disproportionately burdening the Java application. Some techniques for reducing database work are relatively simple: avoid doing work that doesn't need to be done. This rule seems obvious, but I've seen unnecessary calls to uppercase data (using upper( )) too many times. Examine your SQL and the table structure to decide if the SQL adds unnecessary database-side work. Are those uppercase/count/etc. operations really required? Could you more efficiently uppercase the data on insertion? Sort the data differently? Avoid the count operation by maintaining a separate counter in a dedicated counter table? Do you really need to access the same table multiple times? Can you change existing rows of data rather than adding or deleting rows? Each time you avoid changing the table by moving, deleting, or adding rows, you've reduced the database workload slightly. If you can construct the application to use existing dummy rows instead of inserting rows, the SQL runs faster. Some databases can optimize read-only operations. If your connection is read-only, set the connection to read-only with the Connection.setReadOnly(true) method. If the database knows a connection is read-only, it does not need to cache new pages, maintain journal entries, or acquire write locks on the data. Some operations are much more expensive. Multiway joins invariably degrade performance, and performance gets worse as data volume increases. Try working with as few tables as possible within each SQL clause. Queries of large tables that do not use an index require far more processing than tables with indexes. Further, two checks can help you assess the efficiency of SQL statements. First, the Connection class includes a nativeSQL( ) method that converts the SQL you provide into the system's native SQL. This gives the driver an opportunity to convert standard SQL into SQL optimized for the particular database. This method rarely returns anything other than the SQL string you pass to it, but it's always worth checking. Second, many databases support the EXPLAIN SQL command. For databases that support EXPLAIN, you can preface a statement with the keyword EXPLAIN and the database produces an explanation of how it will execute the SQL, including whether an index is used. As the Red Hat database manual points out, executing: EXPLAIN SELECT * FROM foo in PostgreSQL on a table with no index produces the plan: Seq Scan on foo (cost=0.00..2.28 rows=128 width=4) telling you that the query is executed as a sequential scan, produces 128 rows, and requires 2.28 disk pages to execute. Querying the table again with an added index and using an equijoin condition: EXPLAIN SELECT * FROM foo WHERE i = 4 produces the plan: Index Scan using fi on foo (cost=0.00..0.42 rows=1 width=4) telling you that the query is executed with an index, produces 1 row, and requires 0.42 disk pages to execute. Quite a dramatic difference. 16.2.6 Minimizing Transferred DataTransferring and converting data represents a significant proportion of the cost in many JDBC operations. Parameters that minimize transfer costs are discussed under Section 16.2.9, and techniques to minimize data conversion costs are discussed in Section 16.2.11. However, rather than minimize the costs, try to completely avoid transferring the data. Efficient SQL lets you minimize the data that is transferred. The most glaring example of transferring too much data is the ubiquitous use of the SELECT * query: SELECT * FROM foo ... Using the * character in the SELECT clause asks the database to return all fields in the rows. I recently sampled some applications and found that out of thousands of requests to various databases, fewer than 10 requests needed all the fields in the rows queried. I don't really find the extensive use of SELECT * surprising, though. During development, it is much easier to use the SELECT * query, especially since the required fields can change. Then the fields are accessed from the ResultSet row by field index. Later, converting queries to request only the required fields means changing both the SQL and the indexes of the accessed fields in the ResultSet rows, a bug-prone change in all but the simplest JDBC applications. It can be difficult to estimate the cost of selecting too many fields without actually comparing the two queries in the context of the application—i.e., comparing: SELECT * FROM foo ... and: SELECT field1,field2 FROM foo ... Several factors come into play; there is no conversion overhead from the unused fields (in most drivers), and the transfer overhead depends on the size of the unused fields, the number of transferred rows, and the transfer batch size, too! There is no doubt that the latter, more precise SELECT is more efficient, but whether the effort to change the queries to the latter SELECT is worthwhile for your application and query is a judgment call. In highly scaled systems, with many queries going to the database, every field counts. A second type of inappropriate transfer, requesting too many rows, is less frequently a problem. It is easier in development to specify the required rows precisely by using the appropriate clauses. One common technique that results in transferring too many rows is sorting rows with a SORT BY clause, and then using only the top few rows. For example, suppose you want to see the top five URLs hit on your web site. You might use a SELECT url,hits FROM url_hits SORT BY hits clause, as in the following code: String sqlQuery = "SELECT url,hits FROM url_hits SORT BY hits"; Statement statement = connection.createStatement( ); ResultSet resultSet = statement.executeQuery(sqlQuery); String[ ] urls = new String[5]; int[ ] hits = new int[5]; //Assume we always have at least 5 urls for simplicity for(int i= 0; i< 5; i++, resultSet.next( )) { url[i] = resultSet.getString(1); hits[i] = resultSet.getInt(2); } This results in the transfer of all table rows (or actually, the batch size; see Section 16.2.9). At the expense of some costly SQL statements, you can specify the rows you want: select r.url, r.hits from url_hits r where 5 >= (select count(*) from url_hits rr where rr.hits >= r.hits); This SQL returns only five rows, with the top five URLs and the number of hits for each. It scans through the url_hits table and, for every row, it goes through the whole url_hits table again in an inner loop to see how many URLs have hits that are higher than the current row of the outer scan. This example is probably more useful as an example of how SQL can be made to produce precisely the result you want. It is less useful as a practical example of performance tuning because of the additional overhead such a query would add to the database server and because setting the batch size to 5 for the original query would effectively eliminate the transfer of extra rows. I saw a more realistic example of transferring too many rows in an application that had been taken over by different developers. Rather than create new queries in the JDBC section of the application, the new developers simply checked data items and discarded those that didn't apply to the new set of criteria. Effectively, they made a SELECT * query, and then executed a WHERE clause in Java by iterating through the ResultSet and collecting the rows that satisfied the new criteria. Finally, accessing fields by name (e.g., with resultSet.getString("field1")) is inefficient. The driver needs to map the name to the column number, which at best requires one extra round trip for the column mapping, and at worst can result in multiple round trips and significant amounts of extra data transferred from the database. (For example, MySQL and mSQL retrieve all rows from the database before accessing a value by name.) Use the column number to access fields, even though doing so may require extra information about table structures to be maintained in the application. 16.2.7 CachingCaching data in the client is a highly effective technique for speeding up applications using JDBC. JDBC has the overhead of network communication, data conversion, and server-side processing. Deciding to cache JDBC data is almost a no-brainer. The questions you need to ask are which data, and how to cache it. Deciding which data to cache is straightforward. Cache small read-only tables and tables that are updated infrequently. Even medium-sized tables are worth considering, though you'll need to test how much space they take up in their final cached form. Large tables and rapidly updated or frequently changing tables are clearly not good candidates, though select rows from large tables may be if they are rarely changed and are used more often than other rows. In some cases, a cache replacement policy, such as least-recently-used, could slowly replace the data being cached so that frequently required data from large tables can settle into the cache. How to cache the data is more complex. Bearing in mind that any processing through JDBC produces overhead, the best place to cache JDBC-obtained data is after you have processed it (i.e., after it is extracted from the ResultSet and converted into objects or object data). This is an especially good option if you are in the design or early implementation stage of your project, when you can most effectively build in object caching. If you are in a late stage of development, you may be able to modify your application so it caches objects, but that depends on the existing design, considering how the data is used and where it ends up. The second option is to cache at the ResultSet layer. You cannot use a ResultSet object as the cache object itself because it uses JDBC resources that need to be closed. However, the techniques discussed and used in the first section of the chapter show how to implement wrappers for most kinds of JDBC objects, including ResultSets. You can even create a mostly transparent cache with wrappers. Just add a Hashtable to the ConnectionWrapper that maps SQL query strings to StatementWrappers and ResultSetWrappers. The ResultSetWrapper itself should wrap the original ResultSet accessed from the database and simply iterate through all the rows, obtaining the data to hold locally. All fields can be accessed with the ResultSet.getString( ) method and converted as required when accessed with other methods. For example, you might read all the data looping through the ResultSet to collect all rows as an array of String fields: public static Vector createCache(ResultSet rs) { //We need to know how many columns there are ResultSetMetaData resultSetMetaData = resultSet.getMetaData( ); int numColumns = resultSetMetaData.getColumnCount( ); Vector rowsvector = new Vector( ); //Could use an Object[ ] and getObject( ) later in the loop //at the cost of more space taken up and more overhead during //this create cache routine. String[ ] oneRow = null; //Read through the rows one at a time, getting all the fields //as separate strings, and add them all into the vector while(resultSet.next( )) { oneRow = new String[numColumns]; for (int i = 0; i < numColumns; i++) oneRow[i] = resultSet.getString(i+1); rowsvector.addElement(oneRow); } resultSet.close( ); return rowsvector; } Bear in mind that you do not need to implement a generic framework to handle every possible case. Only the queries that lead to cached data need handling, and only those data types that will be accessed need to have get( ) methods implemented. For example, if you access only Strings and ints, here is a straightforward implementation of a ResultSet: public class ReadOnlyCachedResultSet implements ResultSet { Vector cacheOfRows; int currentIndex; public ReadOnlyCachedResultSet(Vector rowsCache) { cacheOfRows = rowsCache; currentIndex = -1; } public int getInt(int columnIndex) throws SQLException { String[ ] row = (String[ ]) cacheOfRows.elementAt(currentRow); return Integer.parseInt(row[columnIndex-1]); } public int getString(int columnIndex) throws SQLException { String[ ] row = (String[ ]) cacheOfRows.elementAt(currentRow); return row[columnIndex-1]; } public boolean next( ) throws SQLException { if (++currentIndex >= cacheOfRows.size( )) return false; else return true; } //... All other methods can be implemented as unsupported, e.g., //public Array getArray(String colName) throws SQLException { // throw new UnsupportedOperationException( ); //} The Statement wrapper returns this ResultSet when the application executes the relevant SQL query for the cached data. This framework is relatively easy to implement and understand, and it can be slipped in over the actual JDBC driver, as shown in the first section of this chapter. It has one drawback: the data is vastly expanded as it is held in the form of multiple Strings, one per field. With a little extra work, you can pack all the data into byte arrays and cache it in that form, so even the data size is not an overwhelming problem. Unfortunately, there is a more serious drawback. This simple mechanism does not support parameterized queries. It is possible to support parameterized queries using similar but considerably more sophisticated wrappers, but the effort is beyond what I can present here. Finally, an in-memory database product is another reasonable option that can provide relatively transparent database caching, though with a higher overhead. Several commercial versions and an open source product called hsqldb (available from http://hsqldb.sourceforge.net/) are available. The memory overhead for such products is small, and they work just like a normal relational database, with their own JDBC driver, so you can switch between databases by switching drivers for queries or by redirecting queries with Connection wrappers. Use such products to replicate the tables you want to cache and keep in memory. An in-memory relational database opens up another interesting performance-tuning option. You can use the database to update cached rows in exactly the same way as you would the proper database. Because the update mechanism is SQL, you can log the SQL statements locally to disk as you execute them on the in-memory database, a process called journaling. Then the SQL statements can be applied to the actual database separately and asynchronously, to synchronize it with the in-memory database. Since SQL is standardized, you can use the same SQL easily for both databases. 16.2.8 Prepared Statements" Use PreparedStatements." You'll hear this refrain again and again when dealing with JDBC performance-tuning tips. Fair enough, but if everyone figures you should use PreparedStatements, what is Statement for? Is it just a redundant interface? When a database executes an SQL statement, it performs two separate tasks. First, it searches through the data and indexes to determine which data items satisfy the SQL statement. You explicitly direct the database to perform this task. However, behind the scenes, the database has a second task: to work out exactly how to execute the SQL. The database needs to:
These and other tasks combine to form the query plan. When the database "prepares" a statement, it creates a query plan. Now the difference between Statement and PreparedStatement may be clearer: Statement prepares and executes the query plan each time, while PreparedStatement prepares the query plan once and then reuses the query plan. Preparing a statement is also referred to as precompiling a statement. If that were the whole story, then PreparedStatement would always be the statement of choice, and you would avoid Statement objects completely. But it's not quite the whole story. Statement has optimizations that the database can apply; mainly, the database knows that the Statement plan is executed immediately and thrown away. So the database handles Statement queries differently from PreparedStatements. Statement queries can be prepared and executed in one swoop, using the state of the database at the time, without allocating resources to keeping the plan around. PreparedStatements, on the other hand, need to allocate database resources to store and maintain the query plan and to ensure that the query plan is not invalidated by changes to the database. For example, the query plan would need to be updated or re-created in the case of some types of changes to the database, depending on how detailed the query plan is. Creating and executing a Statement once is faster than creating and executing a PreparedStatement once. But PreparedStatement has a one-off hit in its preparation stage, so after the preparation stage is completed, any single PreparedStatement execution should be faster than a single Statement execution. Now the options are clearer, though not completely clear-cut. If your application has an initialization phase during which you can create and execute your PreparedStatements, then all subsequent executions of those PreparedStatements provide the fastest query possible. (Note that I say to create and execute the PreparedStatement in the initialization phase. Although logically, the PreparedStatement should be prepared when the object is created, the API allows the driver to choose whether to prepare the statement at object-creation time or when the first execution takes place.) However, if you cannot prepare your statements during an initialization phase, the choice is more difficult. You can choose a Statement or PreparedStatement object, depending on which provides the fastest overall cumulative time for repeated queries. For example, in Java Programming with Oracle JDBC (O'Reilly), Donald Bales looks at the cost of table inserts using Oracle drivers and finds that the statement needs to be repeated 65 times before the cumulative time of PreparedStatements is faster than the cumulative time for repeating Statements. Inserts have an overhead different from that of access queries, and different drivers and databases have different overhead, so test your configuration to see where the crossover happens if you need to be precise. In general, for any particular SQL statement:
Bear in mind two things when using PreparedStatements. First, the SQL needs to be identical for the query plan to be reusable. So: SELECT f1,f2 FROM t1 WHERE f3=4 and: SELECT f1,f2 FROM t1 WHERE f3=5 are two different statements, and making the first query into a PreparedStatement doesn't help the second query run any faster. Fortunately, PreparedStatements support parameterized queries so you can repeat the same query with different parameters. To get the performance benefit of PreparedStatements, make it a habit to use parameterized statements. This code executes the previous two queries efficiently by using parameterized statements: String query = "SELECT f1,f2 FROM t1 WHERE f3=?"; PreparedStatement ps = conn.prepareStatement(query); ps.setInt(1, 4); //First query with parameter set to 4 ResultSet rs = ps.executeQuery( ); //Execute query ... //Read and process rows rs.close( ); ps.setInt(1, 5); //Second query with parameter set to 5 ResultSet rs = ps.executeQuery( ); //Re-execute query ... //Read and process rows rs.close( ); ps.close( ); The second thing to bear in mind is that the PreparedStatement is associated with a particular connection. Although it may be possible for the driver and database to ensure that any PreparedStatement that uses a particular SQL statement also uses the same query plan, you cannot count on it. Therefore, you need to use the same PreparedStatement object for repeated executions of an SQL statement to ensure that the SQL is executed with precompilations, which requires that you use the same Connection. This is especially important when using pooled connections. With pooled connections, you either hold a separate pool of prepared statements, or implement wrappers, as defined in the first section of this chapter, to hold a cache of SQL statements and their corresponding Connection and PreparedStatement objects and return them. Try to keep the appropriate connections available for the methods needing to reuse the PreparedStatement. This again points to separate connection pools. JDBC 3.0 supports pooled PreparedStatements that can be used independently of their originating Connection objects. The pooling happens completely under the covers, controlled by the driver. Indeed, a simplistic approach, which assumes that the JDBC/database automatically understands that the same SQL should use the same query plan without worrying about which Connection object and which PreparedStatement are used, can actually work with JDBC 3.0. Unfortunately, the JDBC 3.0 interface does not specify that any particular SQL statement must be cached, and the only parameter available is the maxStatements property of the JDBC 3.0 connection pool, which specifies only how many statements should be kept open. Moving to JDBC 3.0 might solve your particular PreparedStatement reuse issues. However, there isn't sufficient explicit support to determine which PreparedStatements are automatically cached in JDBC 3.0, so if you are a defensive programmer, you may wish to retain control of your pools, even with JDBC 3.0. Finally, a couple of warnings. First, the JDBC API does not specify how PreparedStatements or Statements should be implemented. The differences in performance between Statements and PreparedStatements are recommendations based on what I've seen, read about, and expect from documented features of various databases. But they are not guaranteed: always check the driver and database documentation and test for expected performance improvements. Second, the SQL you execute is a String. If you dynamically create the SQL Strings you execute by concatenating various items, you add all the String concatenation and conversion overhead discussed in Chapter 5. This overhead can be costly, especially for repeated or looped executions of SQL statements. Try to avoid extra String overhead. Parameterized SQL statements that are supported by PreparedStatements can help you avoid creating extra strings for repeated SQL statements. 16.2.9 BatchingRelational databases are designed to be operated optimally with SQL, which is a set-processing language. Consequently, relational-database performance is optimized when you combine access of multiple rows of a table and combine multiple updates to a table. Furthermore, operating on one table at a time is normally faster than interleaving operations between multiple tables. This combination of operations is called batching. You can take advantage of batching in JDBC in two ways: on access and on update. These two types of operations are batched very differently, so we'll look at them separately. You may need to check how any particular driver supports batching. Some have restrictions, such as only supporting update batching with one type of Statement (Oracle batches updates only with PreparedStatement) or one mode of ResultSet (DB2 doesn't batch when used with scrollable ResultSets). 16.2.9.1 Batching accesses by optimizing row fetchingJDBC includes parameters that support batching accesses from the database by specifying how many rows are fetched in one transfer. Essentially, the amount of data (or, more specifically, the number of rows) read by one call to the database is configurable. Transferring data efficiently is important because network communication time affects performance more than any other factor. For example, if you intend to read 10 rows from a table and each row requires a separate round trip to the database, it takes 10 times longer to access the data than if the 10 rows are read in one network round trip. In practice, JDBC access is already batched, with a typical default number of rows between 10 and 100. For example, Oracle uses a default of 10 and DB2 uses a default of 32. If you access the default number of rows, access is batched automatically. In practice, rather than being critical for performance, for most applications, adjusting access batch size is more like fine-tuning performance. If you retrieve only a few rows, the default fetch size may be too large. But there is not usually a large cost in having too large a fetch size as long as the data volume of the default fetch size is not so large that the network fragments the data packets. Typical table rows are not large enough to cause fragmenting. Nevertheless, reducing the fetch size to be in line with the number of rows you actually read should improve performance by reducing overhead slightly at both ends of the communication (in the JDBC layer and at the database-access layer). If you regularly access more than the default fetch size, then you can reduce the number of network trips by increasing the fetch size. This can make a big difference, depending on how much data you retrieve. If you regularly get 33 rows and the fetch size is 32, you incur an extra network call for the 33rd row. If you get 10,000 rows, then a 10-row fetch size requires 1,000 network trips. A 32-row batch reduces that amount by a third, but still requires 313 network trips. A fetch size of 512 requires just 20 network trips. Depending on how the data is processed in the application, this change could alleviate a significant bottleneck. The tradeoff to increasing the fetch size is increased memory use. All fetched data has to be held in the JDBC layer on the client, and this memory can add up excessively if you use a large default batch size that applies to every request. The fetched data is held in the ResultSet object generated by executing a query. If the fetch size is 10, then accessing the first 10 records simply iterates internally through the ResultSet data held on the client. The 11th access causes a call to the database for another 10 records, and so on for each group of 10 records. You can set the fetch size in several ways, depending on how widely you want the change to apply. You can set it at the Connection, Statement, or ResultSet level. However, in all cases, you rely on the database driver supporting the capability to change the fetch size. The database driver can ignore the change. To set the fetch size at the Connection level, use one of the Connection creation methods that accept a Properties object: DriverManager.getConnection(String url, Properties props) Driver.connect(String url, Properties props) The actual keyword to set the fetch size in the Properties object is driver-dependent. For example, DB2 uses block size, while Oracle uses defaultRowPrefetch: Properties p = new java.util.Properties( ); p.put("block size", "512"); //DB2 p.put("defaultRowPrefetch", "512"); //Oracle Connection c = DriverManager.getConnection("jdbc:dbname:path", p); In addition, some drivers (for example, Oracle) give Connection objects dedicated methods to set the fetch size: public static void setOracleFetchSize(Connection c, int size) { ((OracleConnection) c).setDefaultRowPrefetch(size); } The fetch size can also be set at the Statement or ResultSet level by using their setFetchSize( ) methods: Statement.setFetchSize(int size) ResultSet.setFetchSize(int size) These setFetchSize( ) methods also depend on the driver supporting the changing of fetch sizes: the API states that the driver can ignore the request. 16.2.9.2 Batching updatesBatching updates simply means sending multiple update statements in one transaction and one call to the database. JDBC supports this capability with the addBatch( ) and executeBatch( ) methods. The technique is simple and is illustrated here. Remember to explicitly commit the batch transaction: boolean autocommit = connection.getAutoCommit( ); connection.setAutoCommit(false); Statement s = connection.createStatement( ); s.addBatch("INSERT INTO SOCCER VALUES('Pele ', ' ', 'Brazil ')); s.addBatch("INSERT INTO SOCCER VALUES('Charlton', 'Bobby ', 'England ')); s.addBatch("INSERT INTO SOCCER VALUES('Maradona', 'Diego ', 'Argentina')); s.addBatch("INSERT INTO SOCCER VALUES('Cruyff ', 'Johan ', 'Holland ')); int[ ] results = s.executeBatch( ); //Check the results if ((results.length != 4) || ( (results[0] != 1) && (results[0] != -2) ) || ( (results[1] != 1) && (results[1] != -2) )|| ( (results[2] != 1) && (results[2] != -2) )|| ( (results[3] != 1) && (results[3] != -2) ) ) { throw new java.sql.BatchUpdateException("Something failed."); } connection.commit( ); connection.setAutoCommit(autocommit); And similarly using PreparedStatements: PreparedStatement ps = connection.prepareStatement( "INSERT INTO SOCCER VALUES(?, ?, ?) "); ps.setString(1, 'Pele '); ps.setString(2, ' '); ps.setString(3, 'Brazil '); ps.addBatch( ); ps.setString(1, 'Charlton'); ps.setString(2, 'Bobby '); ps.setString(3, 'England '); ps.addBatch( ); ... int[ ] results = ps.executeBatch( ); //Check the results if ((results.length != 4) || ( (results[0] != 1) && (results[0] != -2) ) || ( (results[1] != 1) && (results[1] != -2) )|| ( (results[2] != 1) && (results[2] != -2) )|| ( (results[3] != 1) && (results[3] != -2) ) ) { throw new java.sql.BatchUpdateException("Something failed."); } connection.commit( ); connection.setAutoCommit(autocommit); It is easiest to take explicit control when batching updates (i.e., to explicitly collect the data and statements, combine them into SQL calls, and call whichever execute method is most appropriate, normally executeBatch( )). executeBatch( ) is the only standard method for executing batches, but some drivers have additional proprietary methods that optimize batches in different ways, such as with the Oracle sendbatch( ) method. If this is not feasible because of the complexity of changing the existing implementation, you can use an underlying layer to batch the updates. One approach is to use RowSets, available with JDBC 3.0, which can be operated in a disconnected manner. Reading from the RowSet, you can collect the update statements and execute them together. A more transparent mechanism would use wrappers, constructed in a similar way to the wrappers covered in the first section of this chapter. The wrappers would collect statements and automatically execute them together according to an application-specific triggering mechanism. (For example, Oracle provides this mechanism in its driver as an alternative batching model.) 16.2.10 Data StructuresDatabases are designed to hold certain types of data more efficiently than others. The basic data types, such as numeric and string, normally have fixed sizes and often a fixed range of formats. These restrictions are mainly for performance; they enable the database to be efficient in access and update. Dealing with fixed-size data having a known format means that you don't need to search for terminators within the data, and knowing the format reduces the number of branch points necessary to handle many formats. Databases are optimized for these basic data types and formats. The closer you can come to holding data in your application in the same format required by the database, the more easily you can achieve good JDBC performance. However, this consideration usually compromises your application. Efficiently structured relational-database data is rarely compatible with good object-oriented design. There are other data-structure considerations. The data you currently use is frequently updated and accessed in the database. Other types of data, such as historic or archived data, are not accessed frequently by the database. However, if you use the same set of tables for both types of data, then the database has no choice but to pull both sets of data into its cache and search through both types of data. So separating tables of frequently and infrequently used data allows the database to optimize its handling of frequently used data. It also reduces the possibility that you will need to search through result sets for data items you need. 16.2.11 Minimizing Data-Conversion OverheadThe data in the database is stored in a different format than the data held in your Java application. Inevitably, this format discrepancy necessitates data conversion. A few techniques are available that minimize conversion overhead. Use the type-correct get( ) method rather than getObject( ). The ResultSet object has a large number of different get( ) methods. Each method typically accesses the underlying data item and converts the data into the required Java data type or object type. The closer the underlying data is to the resulting Java data type, the more efficient the conversion. A get( ) method that results in an Object typically costs more than one that converts into a primitive Java data type (e.g., int, float, etc.). Strings, as usual, have special considerations. Strings are usually held in relational databases as single-byte character data (ASCII) and get converted to Unicode two-byte Strings in Java. You can reduce the conversion cost by storing String and char data as Unicode two-byte characters in the database. But bear in mind that doing so will cause your database size to grow—in the worst case, doubling the size of the database. Finally, if you use data types that are not primitive data types, use the most appropriate data type available from the database. For example, almost every database has a date type, and you should store your dates in that date type and not in the more generic varchar data type. Avoid BLOB and CLOB types whenever possible. But note that relying on database-specific data types, although good for performance, can compromise portability. 16.2.12 MetadataMetadata is information about how data is structured in the database, such as the size of a column or its numeric precision. There are a number of considerations for dealing with metadata. First, metadata rarely changes. In fact, unless your application is specifically designed to deal with the possibility that the metadata will change, you can assume that the metadata for a particular table and metadata about database features are constant for the lifetime of a particular JVM. Metadata can be expensive to obtain from the database compared to most other types of queries. Thus, if you use metadata, obtain it once and cache the data. Many DatabaseMetaData methods are quite slow, and executing them repeatedly causes a bottleneck. Methods you should avoid calling include getBestRowIdentifier( ), getColumns( ), getCrossReference( ), getExportedKeys( ), getImportedKeys( ), getPrimaryKeys( ), getTables( ), and getVersionColumns( ). Avoid using null arguments in metadata queries. A null argument has one fewer criterion to restrict a search, which makes the search more intensive. ResultSetMetaData is more efficient than DatabaseMetaData, so try to use the former. For example, if you want to get column information about a particular table, you are better off getting the ResultSetMetaData object from a query, even a dummy query, on the table, rather than using the more generic DatabaseMetaData.getColumns( ) method. Avoiding implicit metadata queries can be difficult. For example, Don Bales points out that an Oracle SELECT statement (not a prepared statement) makes two round trips to the database: the first for metadata and the second for data. He suggests using the OracleStatement.defineColumnType( ) method to predefine the SELECT statement, thus providing the JDBC driver with the column metadata it needs. The query can then avoid the first database trip to query for metadata. Apart from optimizing metadata queries themselves, using metadata information also allows you to optimize standard queries. For example, the DatabaseMetaData.getIndexInfo( ) method allows you to identify which indexes exist for a table. Using this data allows you to optimize your query to use indexes in the query when possible. Another useful example is the DatabaseMetaData.getBestRowIdentifier( ) method, which identifies the optimal set of columns to use in a WHERE clause for updating data. The columns returned by this query can include pseudocolumns not available from the getColumns( ) method. Pseudocolumns can provide the database with a pointer to an internal database location, which allows the database to optimize the operations that include pseudocolumns. 16.2.13 Handling ExceptionsCatching and handling database warnings and exceptions is important. At the very least, record and analyze them for handling in future versions of your application if you cannot handle the exceptions directly. JDBC exceptions are often indicative of incorrectly configured resources, which, if ignored, can result in significant inefficiencies. In addition, not correctly handling JDBC exceptions can leave resources in use but idle, creating resource leakage that inevitably decreases performance over time. Note that JDBC exceptions often enclose other underlying exceptions, so don't forget to chain through them. 16.2.14 Stored ProceduresStored procedures are defined in the database and can be executed from a remote call similar to SQL. Stored procedures have some overhead associated with execution, but are precompiled and executed entirely on the database server. This means that they are similar to PreparedStatements, but without the preparation and per-call overhead. Stored procedures are more efficient than PreparedStatements and Statements when the stored procedure is fairly complex, equivalent to many complex SQL statements. Stored procedures are also relatively efficient if they circumvent intermediate round trips between the client and the database—for example, when a procedure would consist of several separate calls to the database while a stored procedure requires only one call. On the other hand, the procedure-call overhead for stored procedures indicates that they are inefficient compared with any single simple SQL call. Replacing each SQL call in your JDBC application with an equivalent call to a stored procedure with the same functionality, a one-for-one replacement of calls, would probably be inefficient, making performance worse. Another consideration is the relative processing availability of the database server and the database client. In some cases, the database server may be underutilized compared to the database client, especially when the database client is middleware such as an application server. In this case, stored procedures can move some of the processing to the database server, making better use of the available CPU power on both machines. Generally, a stored procedure can improve performance if it replaces any of the following:
Don't use a stored procedure to replace simple SQL calls. Stored procedures are best used to avoid moving data back and forth across the network. And stored procedures are not an option if full database portability needs to be maintained, as they are different for each database. Stored procedures are executed from JDBC using CallableStatements. Stored procedures are not standardized, so use the syntax specific to your database. Here's a simple example of a stored procedure that takes a parameter: String storedProc = "{call doBlah(?)}"; CallableStatement cstmt = connection.prepareCall(storedProc); cstmt.setString(1, "hello"); if(false = = cstmt.execute( )) throw new Exception("No ResultSet Returned"); ResultSet rs = cstmt.getResultSet( ); ... rs.close( ) ; cstmt.close( ) ; 16.2.15 Transaction OptimizationThe very best performance-tuning advice is to avoid doing what doesn't need to be done. Transactions are a lot of work for the database. A database needs to maintain all sorts of different resources to ensure that the ACID properties (Atomicity, Consistency, Isolation, and Durability) apply to a transaction, irrespective of any simultaneous operations and of whether the transaction ends with a commit or rollback. If you can avoid forcing the database to do some of that work, performance improves. 16.2.15.1 Auto-commit modeThe first way to minimize transaction costs is to combine multiple operations into one transaction. By default, the JDBC connection is in auto-commit mode, which means that every operation sent to the database is automatically its own transaction—that is, as if every Statement.execute( ) method is preceded with a BEGIN TRANSACTION command and followed by a COMMIT command. Turning off auto-commit and making your transactions explicit requires more work, but pays off if you can combine transactions (see "Batching" earlier in this chapter), especially when you scale the system. On heavily loaded systems, the overhead from transactions can be significant, and the lower that overhead is, the more the system can scale. Turning off auto-commit is done using: Connection.setAutoCommit(false); A Connection.getAutoCommit( ) method determines the current auto-commit mode. With auto-commit off, you will need to use the two transaction-completion methods: Connection.commit( ) and Connection.rollback( ). Taking manual control over transactions doesn't mean that you should extend the transactions for long periods to catch lots of operations in one transaction. Transactions should be as short as possible. Leaving transactions open keeps locks on rows, which affects other transactions and reduces scalability. If you have several operations that could be executed sequentially, then combine them into one transaction, but otherwise keep transactions as short as possible. Combining operations into one transaction may require extra conditional logic in the SQL statements and possibly even temporary tables. Even so, this is more efficient than not combining those operations because the database can obtain all the required locks and release them in one step. Multiple transactions result in more communication overhead, more lock-and-release phases, and a greater likelihood of conflict with other sessions. 16.2.15.2 Transaction levelsTransactions are defined as all-or-nothing operations. Everything occurs in a transaction as if no other operations are occurring in the database during the transaction; that's roughly what the ACID properties mean. As already noted, this gives the database a large amount of work. Thus, most databases have provided other levels of transactions—levels that are not real transactions, as they don't satisfy the ACID properties. Nevertheless, these transaction levels are useful for many types of operations because they provide better performance with acceptable transactional functionality. Transaction levels can be set using the Connection.setTransactionIsolation( ) method. Not all levels defined in the JDBC Connection interface are supported by all databases. The levels defined in JDBC are:
One transaction level is rarely ideal throughout your application. Look for read queries in the application, and consider for each whether any of the problems outlined in the sidebar adversely affect the query for the given data, data update pattern, and query. Reads of static tables or tables that are updated only by the same transaction that reads them can safely use the lowest transaction level. Basically, in transactions where there is no concurrent update possible, you can use TRANSACTION_NONE or TRANSACTION_READ_UNCOMMITTED safely and efficiently. TRANSACTION_SERIALIZABLE is the truly safe option and should be used unless you determine another acceptable level for any particular transaction. Many drivers are delivered with a default transaction level other than TRANSACTION_SERIALIZABLE because they are faster that way, which means that a simplistic evaluation shows the driver (and database) at its best. 16.2.15.3 User-controlled transactionsTransactions controlled by the user are another significant JDBC bottleneck. Any transaction that requires a user to execute an action (such as clicking OK or Cancel) to close the transaction (i.e., to commit or abort it) is a problem waiting to happen. Inevitably, one or more users forget to terminate the activity or are forced to leave the activity unfinished. Resources are left open in the application and database, and any concurrent activity can easily conflict with the locked resources, resulting in decreased performance throughout the system. Only single-user applications, or applications that are effectively single-user because they don't share any of the same resources between users, are immune to this problem. The main alternative to leaving a user in control of a JDBC transaction is to use optimistic transactions. Optimistic transactions collect information for update outside of a JDBC transaction and then use a mechanism to check that the update doesn't conflict with any other update that may have been processed in between. Mechanisms for checking for optimistic conflicts include timestamps, change counters, and checking for differences from an expected state. For example, when the application gathers user-input data for the update, the data can be sent to the database as a batch set of SQL statements that includes timestamped safeguards that make sure the original data in the database is the same as the data originally used by the client. A successful transaction updates the rows (including the timestamps) to indicate newly changed data. If another update that invalidates the user's changes is made by another user, the timestamps will have changed, and the current transaction needs to be rolled back instead of committed. Usually, no intermediate conflicting transactions have occurred, so most transactions succeed. When a transaction fails, the application displays the entered data along with the changes that caused the conflict. The user then alters and resubmits the data. Optimistic transactions are unsuitable for applications that consist of multiple transactions that frequently write data concurrently. This pattern of activity would result in frequently failing optimistic transactions and an increase in overall resource usage. 16.2.15.4 Savepoints and distributed transactionsJDBC 3.0 introduced savepoints, which let you mark a point within a transaction and roll back to that point instead of having to roll back the entire transaction. This feature looks very nice, and I would be surprised if designers using JDBC aren't champing at the bit to use it. Unfortunately, savepoints carry significant overhead. I have not yet been able to test how savepoints may affect performance, but I expect that they should be used sparingly or avoided altogether in any performance-critical sections of code. If you do use them, be sure to release their resources by using the Connection.releaseSavepoint(Savepoint) method. Some products support distributed transactions across multiple connections. The javax.sql package includes XADataSource and XAConnection interfaces to support distributed transactions. However, distributed transactions are significantly slower than normal transactions because of all the extra communication required to coordinate the connections. Avoid distributed transactions unless they are absolutely required. 16.2.16 Database LocationJDBC is a distributed communications protocol that allows the database to run on a different machine from the process initiating the SQL call—i.e., the JDBC client. It provides you with a tuning option of where to run the database. Running the database on the same machine as the JDBC client means that the JDBC call has a reduced overhead from avoiding a network call. On the other hand, it also means that all the processes run on the same machine, increasing the load on that machine. The disk I/O impact is even more important. If the client process needs to access the same disks or access through the same disk controller as the database, I/O can be compromised drastically. Applications that are not too large or complex, that have simple database access and update requirements, and that do not use up most of the machine's processing capacity and memory, are probably better off running the database on the same machine. But make sure that the database uses a dedicated disk controller with dedicated disks, so that database disk I/O is not reduced. Applications that are large or complex, such as application servers and other multi-user server systems, are usually better off running the database on a separate machine. Beware: testing both configurations at lower scales may show that collocating the database on the same machine as the application provides better performance. However, you can be sure that when the system is scaled up, the system resource requirements of the application and the database will conflict severely, causing reduced system scalability. |