Data-Munging Example: An Inter-Database Transfer

Data-Munging Example: An Inter-Database Transfer

People often think of SQL*Loader as being the answer to all their data loading needs. But the reality is that running SQL*Loader might be the last step in a data load, not the only step. You might need to perform a number of additional steps to get data in a state fit for use by SQL*Loader. And sometimes you might not need SQL*Loader at all; often, Perl DBI works equally well as the last process stage for finally inserting data into a database. This is particularly true if you've used Perl exclusively to get to that last stage. Why add another process to manage, even one as good as SQL*Loader? Let's keep that data load as simple as possible.

We'll discuss the respective roles of Perl DBI and SQL*Loader later on in more detail. For now though, we'll introduce data munging conceptually by providing a very basic source-to-sink example in a single munge operation. This example emphasizes Perl DBI's ability to munge data across from one database type to another within a single Perl script. In this example, MySQL is the source and Oracle is the sink. We're going to munge the data from one datatype (MySQL) into another (Oracle), plus do a little date format munging on the side.

The MySQL Source

We'll assume in this example that the data you are loading into an Oracle database comes from a MySQL database. You can find out more about MySQL at the following sites:

http://www.mysql.com

http://sourceforge.net/projects/mysql

You might also like to check out Jochen Wiedmann's DBD::mysql driver; this driver is the interface that allows Perl programs to connect to MySQL databases via Perl DBI:

http://www.cpan.org/authors/id/JWIED

Assuming that a MySQL test database has already been created, let's go ahead and create the source data and prepare to transfer it to our Oracle database, orcl. These are the steps we followed:

We entered MySQL, and switched to the test database:

$ mysql --user=irish --password=lion
...
mysql> use test;
Database changed

We then created a new clone EMP table, emp_store:

mysql> create table emp_store (
    ->   empno numeric(4) not null,
    ->   ename varchar(10),
    ->   job varchar(9),
    ->   mgr numeric(4),
    ->   hiredate date,
    ->   sal numeric(7,2),
    ->   comm numeric(7,2),
    ->   deptno numeric(2) );
Query OK, 0 rows affected (0.02 sec)
  
mysql> describe emp_store;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| empno    | decimal(4,0) |      |     | 0       |       |
| ename    | varchar(10)  | YES  |     | NULL    |       |
| job      | varchar(9)   | YES  |     | NULL    |       |
| mgr      | decimal(4,0) | YES  |     | NULL    |       |
| hiredate | date         | YES  |     | NULL    |       |
| sal      | decimal(7,2) | YES  |     | NULL    |       |
| comm     | decimal(7,2) | YES  |     | NULL    |       |
| deptno   | decimal(2,0) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+
8 rows in set (0.00 sec)

Three test rows were inserted into our MySQL table, using the MySQL default date format of YYYY-MM-DD. (This is going to be the extra thing we'll have to munge, later, to smoothly transfer data from one database type to another.):

mysql> insert into emp_store
    -> values (1001, 'Groucho', 'Professor', 1, 
    ->         '2001-01-01', 100, 10, 10);
Query OK, 1 row affected (0.00 sec)
  
mysql> insert into emp_store
    -> values (1002, 'Chico', 'Minister', 2, 
    ->         '2001-01-02', 200, 20, 20);
Query OK, 1 row affected (0.00 sec)
  
mysql> insert into emp_store
    -> values (1003, 'Harpo', 'Stowaway', 3, 
->             '2001-01-03', 300, 30, 30);
Query OK, 1 row affected (0.00 sec)
  
mysql> select * from emp_store;
+------+---------+----------+-----+-----------+-------+------+--------+
| empno| ename   | job      | mgr | hiredate  | sal   | comm | deptno |
+------+---------+----------+-----+-----------+-------+------+--------+
|  1001| Groucho | Professor|   1 | 2001-01-01| 100.00| 10.00|     10 |
|  1002| Chico   | Minister |   2 | 2001-01-02| 200.00| 20.00|     20 |
|  1003| Harpo   | Stowaway |   3 | 2001-01-03| 300.00| 30.00|     30 |
+------+---------+----------+-----+-----------+-------+------+--------+
3 rows in set (0.00 sec)

Finally, we quit out of MySQL:
```
mysql> quit
Bye
```

The Oracle Sink

We'd like to transfer these three rows across to the EMP table under Oracle's orcl database. We'll do this via the munge script in Example D-1.

Inter-database transfers into Oracle — mySQLtoOracle.pl

#!perl -w
  
use strict;
use DBI;
  
# Step 1: Establish a MySQL source database handle, and
# an Oracle sink database handle.  Notice we can connect to two
# different databases, and database types, at the same time,
# in one Perl script.  Code Simplicities 'R' Us! :-)
  
my $mysql_dbh = DBI->connect('DBI:mysql:database=test;host=localhost', 
                             'irish', 'lion')
                   or die "Couldn't connect to MySQL database: " . 
                          DBI->errstr;
  
my $oracle_dbh = DBI->connect('DBI:Oracle:orcl', 'scott', 'tiger',
                              { RaiseError=>1, AutoCommit=>0 } );
  
# Step 2: Prepare and execute the selection statement taking
# data from our MySQL source.  Bind the columns, for efficiency.
  
my $select_sql = qq { SELECT empno,    ename,
                             job,      mgr,
                             hiredate, sal,
                             comm,     deptno
                        FROM emp_store };
  
my $mysql_sth = $mysql_dbh->prepare( $select_sql )
                   or die "Couldn't prepare selection statement: " . 
                      $mysql_dbh->errstr;
$mysql_sth->execute;
  
# Create the munge bind variables
my ($empno, $ename, $job, $mgr, $hiredate, $sal, $comm, $deptno );
$mysql_sth->bind_columns( \$empno,    \$ename,
                          \$job,      \$mgr,
                          \$hiredate, \$sal,
                          \$comm,     \$deptno );
# Step 3: Prepare our Oracle insert statement.
  
my $insert_sql = 
   qq{ INSERT 
         INTO emp ( empno,    ename,
                    job,      mgr,
                    hiredate, sal,
                    comm,     deptno )
           VALUES ( ?, ?,
                    ?, ?,
                    to_date( ? , 'YYYY-MM-DD'),  ?,
                    ?, ?) };
  
my $oracle_sth = $oracle_dbh->prepare( $insert_sql );
  
# Step 4: Select from MySQL and fill bound array, before populating
# Oracle EMP table.
  
while ($mysql_sth->fetch) {
  
  $oracle_sth->bind_param(1, $empno);
  $oracle_sth->bind_param(2, $ename);
  $oracle_sth->bind_param(3, $job);
  $oracle_sth->bind_param(4, $mgr);
  $oracle_sth->bind_param(5, $hiredate);
  $oracle_sth->bind_param(6, $sal);
  $oracle_sth->bind_param(7, $comm);
  $oracle_sth->bind_param(8, $deptno);
  
  # Insert!
  $oracle_sth->execute;
}
  
# Step 5: Clean up, commit the transaction, and finish.
  
$oracle_dbh->commit(  );
  
$mysql_dbh->disconnect(  );
$oracle_dbh->disconnect(  );

Let's see what's going on in this code:

We create our two database handles, one to draw data from the MySQL source and the other to pour the munged data into the Oracle sink.
We prepare the main selection statement to draw information from the source. This will fill our first known data structure.
Next, we prepare the matching Oracle INSERT statement, using our second data structure, which will push the data into the sink. Notice the to_date( ) function for munging the hiredate column. As we're mixing Perl and Oracle, we're unconcerned as to who does the munging, as long as the job gets done.
Note that there are several other ways we could have performed this date column munge operation in Perl. For instance, the following code could have been adapted to produce an "Oracle-friendly" date string that could be inserted directly into the database:
```
@date_array = reverse split /-/, '2001-01-02';
$date_array[1] = 
   ('JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN',
    'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC')[$date_array[1]--1];
$oracle_insert_date = join '-', @date_array;
print 'oracle_insert_date: >', $oracle_insert_date, "\n";
```
This code snippet would produce:
```
Oracle_insert_date: >02-JAN-2001<
```
However, we're not zealots. The munge problem Perl is helping us overcome here is the transformation of MySQL data into Oracle data. Because it's easier to let the Oracle database engine do the extra date column munge work in this particular case, that's the route we'll choose here. (Note that the join and split functions are mentioned in Appendix C; for more on these functions, try perldoc -f join and perldoc -f split. You can also try perldoc -f reverse for an explanation of this other built-in Perl function.)
Once everything's set, we begin the munge. As each row is drawn from the MySQL source, we pump it straight down into the Oracle sink, using the $oracle_sth->execute statement and the to_date( ) data transformation.
When the task is finished, we clean up and shut down the munge.

Running the script itself is straightforward:

$ perl mySQLtoOracle.pl

We can then check the orcl database. Notice that our earlier to_date( ) operation has given us the dates in the more usual Oracle-style DD-MON-YY format:

$ sqlplus scott/tiger@orcl
  
ORCL> select * from emp where empno < 2000;
  
    EMPNO ENAME      JOB       MGR HIREDATE     SAL   COMM    DEPTNO
--------- ---------- --------- --- --------- ------ ------ ---------
     1001 Groucho    Professor   1 01-JAN-01    100     10        10
     1002 Chico      Minister    2 02-JAN-01    200     20        20
     1003 Harpo      Stowaway    3 03-JAN-01    300     30        30
  
3 rows selected.
  
ORCL>

That concludes our simple example -- but wait a minute! There seems to be precious little in the way of actual data transformation going on except for the date munge. The main transformation here was from MySQL data to Oracle data, and the fact is that this transformation is extraordinarily simple to do in Perl. Nevertheless, the reality is that very few other languages could have managed this transformation so trivially, in so few lines of code.

We can add onto this simple example by layering on additional data-munging operations, depending on specific processing requirements. For instance, we could pull information from other databases to get hold of department descriptions, drag in other personnel information from remote HR databases, aggregate the salaries, substitute some of the data to match agreed-upon business rules, and so on. And all of this is easily done in Perl. For many more data-munging examples, refer to the more detailed sources mentioned at the beginning of this chapter.

The use of Perl for data munging gives us something else in addition to the excellent resources of Perl DBI. We also get the ability to use the 200-plus built-in operators, such as split, join, and reverse, binary-capable functions such as read, regular expressions (covered in Appendix C), and the 2000-plus object-oriented Perl modules available from www.cpan.org (or www.activestate.com). We can make use of all of these resources, in conjunction with Perl DBI, to carry out a wide range of the most difficult data-munging operations. In the rest of this appendix, we'll summarize what we consider to be the best of these 2000-plus data-munging modules. Whether you're regularly filling data warehouses with difficult-to-extrapolate aggregated data, managing the ever-increasing complexity of XML information transfer, or just moving small pieces of fiddly DBA data from one place to another, Perl is a comprehensive one-stop shop full of data-munging functionality.