7.1 Considerations When Importing Data

Most database systems available today have their own proprietary import mechanisms. Some of these, such as Microsoft's Data Transformation Services (DTS) or the Bulk Copy Program (BCP) are implemented as external programs where you specify the format (usually ASCII) of the files to be imported and the table into which the data should be inserted. Though DTS is more user-friendly because of its graphical user interface, its performance (in general) is not as good as that of the BCP.

Another popular way to import data is to use the BULK INSERT statement — an extension of the SQL dialect that is accessible from the console or other interactive interfaces. Using BULK INSERT, you specify the file to load from and the table in the database that you want to receive the data. The BULK INSERT statement is useful for quick imports of ASCII files.

There are two major problem areas that you will encounter when importing data: error handling and performance. You can request that a load be aborted if an error occurs during the load process, or you can allow a certain number of errors to occur before the import operation is aborted. By default, SQL Server allows 10 errors before aborting a load. Often, though, you will want more flexibility in how the load process treats bad records. Rather than just having bad records rejected, you may want them to be loaded anyway and flagged as errors. That way, you can more easily go in later and correct the problems.

With respect to performance, SQL tables are usually associated with a number of indices, foreign-key references, and other constraints that are designed to improve access efficiency and to preserve data integrity. However, when rows are loaded one-by-one, updating indices and checking constraints for each row can place quite a burden on the server. As a query language, SQL is optimized more for querying than for inserting. If you can manage it, and you are loading a large amount of data, it's often beneficial to drop indices and constraints prior to the load and recreate them afterwards. Index checking and constraint validation are thereby batched into one large operation instead of many small ones.

7.1.1 Bulk Importing

When you insert rows one-by-one, the normal operation of your database server is to trigger the updating of indices on the table after each insert. Bulk importing, through the use of the BULK INSERT statement or the BCP utility, inserts rows of data much faster than conventional methods, because indexing is temporarily suspended. You can further improve the performance of an import operation by using the TABLOCK hint to request table locks. Then the server locks the table exclusively, inserts all the new rows in one step, and rebuilds the table's indices at the end of the load. The performance improvement from doing this can be significant, and the result is usually a very fast load.

A bulk import is not exactly the same as manually disabling indices, loading the data, and manually recreating indices again. On the surface, the two approaches appear equivalent, but there are other efficiencies built into SQL Server's bulk-import mechanism. During a bulk operation, the optimizer uses the existing index structure to insert rows more efficiently and as a basis for rebuilding the indices more quickly after the import is complete.

In general, bulk importing is useful in an environment where the database does not need to be available 100% of the time. In such cases, we recommend that you use table locking to improve performance. To do a bulk import with the table locked, you need a time window during which you can be sure that no users will need to access the table or tables that you are loading. In a 24 x 7 environment, where you can't afford the downtime or risk other inconveniences from bulk inserts, you have to be innovative and use more sophisticated techniques. Alternatively, you can settle for the lower performance provided by tools that perform row-level importing.

7.1.2 Normalized Data

Data normalization is an important subject with respect to relational databases. Normalization refers to the technique of finding functional dependencies between attributes and of resolving those dependencies so that data is stored in an efficient and useful manner. Efficient and useful for the database, that is. Normalized data is seldom efficient from a human viewpoint. SQL, however, has been specifically designed to work best with normalized data.

Normalized tables have no groups of repeating, data nor do they have attributes that are dependant on one another. Redundancy and other anomalies in data are avoided. For example, look at the following comparative balance-sheet report covering the three years from 2000 to 2002:

CompanyId   Company    Category    2000    2001    2002
----------- ---------- ----------- ------- ------- --------
12445       ABC, Inc   Assets      36,755  37,472  38,973
12445       ABC, Inc   Liability   22,252  22,924  24,476
12445       ABC, Inc   Equity      14,230  14,548  14,497

The data in this report is presented in a non-normalized form. The CompanyId and Company columns are dependent on each other — each company ID number represents a specific company name, and there are three balance values in each row. Although this report is easy for a human to read, if you stored the data this way in your database, adding an additional year would be an expensive and quite complicated task. You would need to add and initialize an additional column for each year that you were adding, and that task would require each and every row to be read, expanded, and written back again. Not only does that present a problem with respect to I/O, but expanding a row to a larger size can lead to storage inefficiencies. Storing data, such as a company name, more than once is a waste of space.

7.1.2.1 Eliminate repeating columns

Restructuring the table into a more computer-friendly form makes processing much easier for SQL Server. The following table can store the data shown in the previous report, but in a more efficient manner:

CREATE TABLE BalanceSheets(
   CompanyId INTEGER,
   Company CHAR(200),
   Category CHAR(20),
   Year INTEGER,
   Dollars DECIMAL(12,2)
)

This table design results in one row for each company, category, and year combination. Adding more years, or more categories, is simply a matter of inserting new rows — existing rows do not need to be touched. The data for the report, as stored in the BalanceSheets table, looks like this:

CompanyId   Company   Category   Year   Dollars
----------- --------- ---------- ------ ---------
12445       ABC, Inc  Assets     2000   36,755
12445       ABC, Inc  Liability  2000   22,252
12445       ABC, Inc  Equity     2000   14,230
12445       ABC, Inc  Assets     2001   37,472
12445       ABC, Inc  Liability  2001   22,924
12445       ABC, Inc  Equity     2001   14,548
12445       ABC, Inc  Assets     2002   38,973
12445       ABC, Inc  Liability  2002   24,476
12445       ABC, Inc  Equity     2002   14,497

Good as it is, this table is still not normalized. The Company column is dependent on the CompanyId column (or vice versa, if you prefer). For any given company ID, the associated company name is always the same. You can gain efficiencies in both storage space and manageability by further normalizing this table design to factor out the company name.

7.1.2.2 Factor out dependent columns

The last step in normalizing the balance-sheet data is to factor out the company name so that it is not repeated in each row of data. You can further gain some storage efficiencies by representing the category as a code rather than as a text string. The final design for the table holding the balance-sheet data looks like this:

CREATE TABLE BalanceSheets2 (
   CompanyId INTEGER,
   CategoryId CHAR(1),
   Year INTEGER,
   Dollars DECIMAL(12,2)
)

The BalanceSheet2 table shown here can be referred to as a detail table, because it contains the detailed balance-sheet data. As you can see, some supporting master tables are needed to resolve company and category ID numbers into human-readable text strings. Those tables can be defined as follows:

CREATE TABLE CompanyMaster(
   CompanyId INTEGER,
   CompanyName CHAR(200),
   PRIMARY KEY(CompanyId)
)

CREATE TABLE CategoryMaster(
   CategoryId CHAR(1),
   CategoryName CHAR(20),
   PRIMARY KEY(CategoryId)
)

The data in the detail table looks like this:

CompanyId   CategoryId Year   Dollars
----------- ---------  ------ ---------
12445       A          2000   36,755
12445       L          2000   22,252
12445       E          2000   14,230
12445       A          2001   37,472
12445       L          2001   22,924
12445       E          2001   14,548
12445       A          2002   38,973
12445       L          2002   24,476
12445       E          2002   14,497

The data is now normalized. There are no more functional dependencies or repeating groups of data. Removing any additional data columns would result in a loss of information from the system.

Note that, while it is normalized, the readability of the data is reduced. A simple query to the BalanceSheets2 detail table produces results that are not nearly as readable as those shown earlier on the original balance-sheet report. However, the normalized data can easily be manipulated using SQL, and simple joins can be used to associate company and category names with their respective ID and code values. For example:

SELECT b.CompanyId, o.CompanyName, a.CategoryName,
   b.Year, b.Dollars
FROM 
   BalanceSheets2 b
   JOIN CategoryMaster a ON b.CompanyId=o.CompanyId
   JOIN CompanyMaster o ON b.CategoryId=a.CategoryId

This query returns one row for each company, category, and year combination. Using the Pivot table technique shown later in this chapter, you can write a SQL query that will return the data in the same form used in the report that you saw at the beginning of this section.

7.1.2.3 Reduced memory requirements

A normalized data structure not only is efficient for the database engine, it also saves memory and disk space. The first table, named BalanceSheets, uses approximately 240 bytes per row (CompanyId 4, Company 200, Category 20, Year 4, Dollars 12). The second table, named BalanceSheets2, is normalized and only requires 21 bytes per row (CompanyId 4, CategoryId 1, Year 4, Dollars 12). In our case, the 21 bytes required for each normalized row is only 9% of that required by the non-normalized data.

Think of the impact of normalization on memory pages^[1] and the resulting I/O needed to query the table. Assume that your server stores 8KB of data per memory page (the default for SQL Server). In the case of the non-normalized table, each page can store about 33 records (leaving some space for overhead). If each balance sheet requires 3 records (for assets, liabilities, and equities), then each page will store 11 yearly balance sheets. When the data is normalized, however, each memory page can hold more than 350 records (again, leaving some space for overhead), which is equal to about 115 yearly balance sheets.

^[1] A memory page is the smallest unit that can be saved internally in the database servers' memory. Usually, they are the smallest chunks read or written to the disk. Therefore, it is not important how big the tables rows are, but how many rows can be stored in one memory page.

Space requirements for master tables are usually insignificant compared to those required by detail tables. With only three categories, all the CategoryMaster rows will fit on one 8KB page.

If you have 10,000 balance sheets and you run a query to generate a statistical survey of that data, using the non-normalized approach will result in 900 page reads. Using the normalized approach will result in only around 86 reads for the data, plus a couple more to pull in the small amount of data from the CompanyMaster and CategoryMaster tables. This is a significant performance improvement in terms of I/O and is a result of efficiently structuring the data.

When you design your systems, try not to force users to think in normalized data structures. That is your job. Users should have a simple and friendly interface, with all sorts of available views. On the system level, however, data should be as clear and simplified as possible for it to be manipulated efficiently using SQL.

7.1.3 Master/Detail Framework

Most of today's database solutions use a master/detail framework. The concept is a natural one for relational systems since it comes directly from table normalization. The whole idea is to take columns with values that are repeated many times and replace them with a reference to a record in another table where each distinct value is stored only once. Such a table is referred to as a master table. The table containing the references is referred to as a detail table.

The master/detail table structure provides several advantages when it is implemented in a relational database. One advantage is the space that you save when you replace long character strings that are repeated in numerous rows with a short reference to master record. Another important advantage, this time related to manageability, is the ability to change the values in the master table easily. It's a lot easier and faster to change a company name if it is only stored once in a master table than if it is stored many times in a detail table.

7.1.4 Duplicates

A duplicate is a row that is repeated at least twice with the same values in the same columns. You will encounter duplicates often when importing data. This is especially true for imports involving a mix of both old and new data. Duplicates are often a product of inconsistencies, or of bad data conversions, and are sometimes the only way of detecting that something went wrong during the import.

There are two types of duplicates: full and partial. Full duplicates are rows that have the same values for all columns. Partial duplicates are rows with duplication only in a specific subset of columns.

A duplicate has an occurrence count that represents the number of times it occurs in a given table. You might, for example, have the same row duplicated 5 times. In that case, the duplicate count would be 5. Sometimes this occurrence count is referred to as a degree of duplication. All nonduplicated rows have an occurrence count of 1.

The process of eliminating duplicate values is called reduction, and the mechanism for it is built into standard SQL as the DISTINCT clause of the SELECT statement. A SELECT DISTINCT eliminates duplicate results from the result set. If two rows have the same column values, that set of values will only appear once in the final result.

7.1.4.1 Counting duplicates

The duplicate occurrence count can be useful when business rules specify, or limit, the number of occurrences of a row or of a given value. A typical, and simple, example of such a business rule is a general accounting principle that each credit must have a corresponding debit. For each item your company buys, there should be two accounting entries; however, they still reflect only one event.^[2] Each event has its own ID. Consequently, each ID is associated with two entries: a debit and a credit. It is very easy to test whether your accounting records comply with the dual-entry rule or not. You just have to check if any row has an occurrence count other than 2. For example:

^[2] This is not entirely true, but this is a book on SQL, not accounting, so ignore the details for a moment.

SELECT EventId FROM Balances 
GROUP BY EventId
HAVING count(*)<>2

As you move on to more complex cases, the code obviously gets more complicated, but the principle remains the same. You can use the vocabulary defined in this section to define duplication requirements for a table. In the accounting example, a table recording book entries must have an occurrence count of 2 on the column identifying events.