11.6 The SQL Inside Your PL/SQL
Now that we've
explored calling PL/SQL from SQL,
let's turn the tables and explore the use of SQL
inside your PL/SQL code. SQL is great at manipulating large sets of
data, but there are situations where you need to work with data at
the row level. PL/SQL, with its looping and cursor control
capabilities, allows the flexibility to work at the set level using
SQL or at the row level using cursors. However, many PL/SQL
programmers forego the power of SQL and do everything at the row
level, even when it is unnecessary and time-consuming to do so.
As an analogy, imagine that you are working at a warehouse, and a
large shipment of parts arrives on the loading dock. Your job is to
separate the shipment by part type and distribute the pieces to
different areas of the warehouse. To make your job easier, the
warehouse owner has procured the best forklift money can buy. There
are two possible strategies to employ:
Pick up one box at a time, determine the type, and drive it to the
appropriate destination.
Spend some time analyzing the situation, determine that every box on
a pallet is of the same type, and drive entire pallets to the
appropriate destination.
While this analogy might be overly simplistic, it does serve to
illustrate the difference between set operations and row operations.
Allowing the Oracle server to manipulate large sets in a single
operation can often yield a performance improvement of several orders
of magnitude over manipulating one row at a time, especially on
systems with multiple CPUs.
When a procedural language is used for database access (whether it is
PL/SQL, C with OCI calls, or Java using JDBC), there is a tendency to
employ strategy #1. Perhaps programmers are accustomed to coding at a
low level of granularity when using a procedural language and this
spills over into their data access logic. This situation is
especially prevalent in systems that need to process and load large
amounts of data from external files, such as data warehouse load
utilities.
Imagine that you are charged with building an infrastructure to
accept files from multiple OLTP systems, perform various data
cleaning operations, and aggregate the data into a data warehouse.
Using PL/SQL (or C, Java, C++, Cobol, etc.), you could build
functionality that:
Opens a given file.
Reads a line, verifies/cleans the data, and updates the appropriate
row of the appropriate fact table in the data warehouse.
Repeats #2 until the file is exhausted.
Closes the file.
While this approach might work for small files, it is not uncommon
for large warehouses to receive feeds containing hundreds of
thousands or millions of items. Even if your code is extremely
efficient, processing a million-line file would take several hours.
Here's an alternate strategy that employs the power
of the Oracle server to make quick work of large data feeds:
Create a staging table for each unique data feed file format.
At the start of the load process, truncate the staging tables.
Use SQL*Loader with the direct path option to quickly load the data
file into the appropriate staging table.
Update all rows of the staging table to clean, verify, and transform
data, marking rows as invalid if they fail verification. Perform the
operation in parallel if possible.
Update the appropriate fact table using a subquery against the
staging table. Again, perform in parallel if possible.
In order for this strategy to succeed, you need to have adequate disk
space and sufficiently large rollback and temporary tablespaces. With
adequate resources and properly constructed SQL statements, however,
this strategy can yield a 10X improvement over the previous strategy.
So what role should PL/SQL play in such a scenario? In this case,
PL/SQL would be an excellent vehicle for executing steps 4 and 5 of
the previous list. Although the stored procedures might contain only
a single update statement, the SQL is likely to be complex and may
contain optimizer hints and other advanced features. Therefore, it
would be advisable to isolate the SQL from the rest of the
application so that it may be independently monitored and tuned.
In general, when dealing with complex logic involving large data
sets, it is advantageous to think in terms of data sets rather than
programming steps. In other words, ask yourself where your data is,
where it needs to move to, and what needs to happen to it during its
journey instead of thinking in terms of what needs to happen with
each piece of data to satisfy the business requirements. If you
follow this strategy, you will find yourself writing substantial,
efficient SQL statements that employ PL/SQL where appropriate, rather
than writing complex PL/SQL routines that employ SQL when needed. In
doing so, you will be providing the server with the opportunity to
split large workloads into multiple pieces that run in parallel,
which can greatly improve performance.
|