7.1 Considerations When Importing DataMost database systems available today have their own proprietary import mechanisms. Some of these, such as Microsoft's Data Transformation Services (DTS) or the Bulk Copy Program (BCP) are implemented as external programs where you specify the format (usually ASCII) of the files to be imported and the table into which the data should be inserted. Though DTS is more user-friendly because of its graphical user interface, its performance (in general) is not as good as that of the BCP. Another popular way to import data is to use the BULK INSERT statement — an extension of the SQL dialect that is accessible from the console or other interactive interfaces. Using BULK INSERT, you specify the file to load from and the table in the database that you want to receive the data. The BULK INSERT statement is useful for quick imports of ASCII files. There are two major problem areas that you will encounter when importing data: error handling and performance. You can request that a load be aborted if an error occurs during the load process, or you can allow a certain number of errors to occur before the import operation is aborted. By default, SQL Server allows 10 errors before aborting a load. Often, though, you will want more flexibility in how the load process treats bad records. Rather than just having bad records rejected, you may want them to be loaded anyway and flagged as errors. That way, you can more easily go in later and correct the problems. With respect to performance, SQL tables are usually associated with a number of indices, foreign-key references, and other constraints that are designed to improve access efficiency and to preserve data integrity. However, when rows are loaded one-by-one, updating indices and checking constraints for each row can place quite a burden on the server. As a query language, SQL is optimized more for querying than for inserting. If you can manage it, and you are loading a large amount of data, it's often beneficial to drop indices and constraints prior to the load and recreate them afterwards. Index checking and constraint validation are thereby batched into one large operation instead of many small ones. 7.1.1 Bulk ImportingWhen you insert rows one-by-one, the normal operation of your database server is to trigger the updating of indices on the table after each insert. Bulk importing, through the use of the BULK INSERT statement or the BCP utility, inserts rows of data much faster than conventional methods, because indexing is temporarily suspended. You can further improve the performance of an import operation by using the TABLOCK hint to request table locks. Then the server locks the table exclusively, inserts all the new rows in one step, and rebuilds the table's indices at the end of the load. The performance improvement from doing this can be significant, and the result is usually a very fast load.
In general, bulk importing is useful in an environment where the database does not need to be available 100% of the time. In such cases, we recommend that you use table locking to improve performance. To do a bulk import with the table locked, you need a time window during which you can be sure that no users will need to access the table or tables that you are loading. In a 24 x 7 environment, where you can't afford the downtime or risk other inconveniences from bulk inserts, you have to be innovative and use more sophisticated techniques. Alternatively, you can settle for the lower performance provided by tools that perform row-level importing. 7.1.2 Normalized DataData normalization is an important subject with respect to relational databases. Normalization refers to the technique of finding functional dependencies between attributes and of resolving those dependencies so that data is stored in an efficient and useful manner. Efficient and useful for the database, that is. Normalized data is seldom efficient from a human viewpoint. SQL, however, has been specifically designed to work best with normalized data. Normalized tables have no groups of repeating, data nor do they have attributes that are dependant on one another. Redundancy and other anomalies in data are avoided. For example, look at the following comparative balance-sheet report covering the three years from 2000 to 2002: CompanyId Company Category 2000 2001 2002 ----------- ---------- ----------- ------- ------- -------- 12445 ABC, Inc Assets 36,755 37,472 38,973 12445 ABC, Inc Liability 22,252 22,924 24,476 12445 ABC, Inc Equity 14,230 14,548 14,497 The data in this report is presented in a non-normalized form. The CompanyId and Company columns are dependent on each other — each company ID number represents a specific company name, and there are three balance values in each row. Although this report is easy for a human to read, if you stored the data this way in your database, adding an additional year would be an expensive and quite complicated task. You would need to add and initialize an additional column for each year that you were adding, and that task would require each and every row to be read, expanded, and written back again. Not only does that present a problem with respect to I/O, but expanding a row to a larger size can lead to storage inefficiencies. Storing data, such as a company name, more than once is a waste of space. 7.1.2.1 Eliminate repeating columnsRestructuring the table into a more computer-friendly form makes processing much easier for SQL Server. The following table can store the data shown in the previous report, but in a more efficient manner: CREATE TABLE BalanceSheets( CompanyId INTEGER, Company CHAR(200), Category CHAR(20), Year INTEGER, Dollars DECIMAL(12,2) ) This table design results in one row for each company, category, and year combination. Adding more years, or more categories, is simply a matter of inserting new rows — existing rows do not need to be touched. The data for the report, as stored in the BalanceSheets table, looks like this: CompanyId Company Category Year Dollars ----------- --------- ---------- ------ --------- 12445 ABC, Inc Assets 2000 36,755 12445 ABC, Inc Liability 2000 22,252 12445 ABC, Inc Equity 2000 14,230 12445 ABC, Inc Assets 2001 37,472 12445 ABC, Inc Liability 2001 22,924 12445 ABC, Inc Equity 2001 14,548 12445 ABC, Inc Assets 2002 38,973 12445 ABC, Inc Liability 2002 24,476 12445 ABC, Inc Equity 2002 14,497 Good as it is, this table is still not normalized. The Company column is dependent on the CompanyId column (or vice versa, if you prefer). For any given company ID, the associated company name is always the same. You can gain efficiencies in both storage space and manageability by further normalizing this table design to factor out the company name. 7.1.2.2 Factor out dependent columnsThe last step in normalizing the balance-sheet data is to factor out the company name so that it is not repeated in each row of data. You can further gain some storage efficiencies by representing the category as a code rather than as a text string. The final design for the table holding the balance-sheet data looks like this: CREATE TABLE BalanceSheets2 ( CompanyId INTEGER, CategoryId CHAR(1), Year INTEGER, Dollars DECIMAL(12,2) ) The BalanceSheet2 table shown here can be referred to as a detail table, because it contains the detailed balance-sheet data. As you can see, some supporting master tables are needed to resolve company and category ID numbers into human-readable text strings. Those tables can be defined as follows: CREATE TABLE CompanyMaster( CompanyId INTEGER, CompanyName CHAR(200), PRIMARY KEY(CompanyId) ) CREATE TABLE CategoryMaster( CategoryId CHAR(1), CategoryName CHAR(20), PRIMARY KEY(CategoryId) ) The data in the detail table looks like this: CompanyId CategoryId Year Dollars ----------- --------- ------ --------- 12445 A 2000 36,755 12445 L 2000 22,252 12445 E 2000 14,230 12445 A 2001 37,472 12445 L 2001 22,924 12445 E 2001 14,548 12445 A 2002 38,973 12445 L 2002 24,476 12445 E 2002 14,497 The data is now normalized. There are no more functional dependencies or repeating groups of data. Removing any additional data columns would result in a loss of information from the system. Note that, while it is normalized, the readability of the data is reduced. A simple query to the BalanceSheets2 detail table produces results that are not nearly as readable as those shown earlier on the original balance-sheet report. However, the normalized data can easily be manipulated using SQL, and simple joins can be used to associate company and category names with their respective ID and code values. For example: SELECT b.CompanyId, o.CompanyName, a.CategoryName, b.Year, b.Dollars FROM BalanceSheets2 b JOIN CategoryMaster a ON b.CompanyId=o.CompanyId JOIN CompanyMaster o ON b.CategoryId=a.CategoryId This query returns one row for each company, category, and year combination. Using the Pivot table technique shown later in this chapter, you can write a SQL query that will return the data in the same form used in the report that you saw at the beginning of this section. 7.1.2.3 Reduced memory requirementsA normalized data structure not only is efficient for the database engine, it also saves memory and disk space. The first table, named BalanceSheets, uses approximately 240 bytes per row (CompanyId 4, Company 200, Category 20, Year 4, Dollars 12). The second table, named BalanceSheets2, is normalized and only requires 21 bytes per row (CompanyId 4, CategoryId 1, Year 4, Dollars 12). In our case, the 21 bytes required for each normalized row is only 9% of that required by the non-normalized data. Think of the impact of normalization on memory pages[1] and the resulting I/O needed to query the table. Assume that your server stores 8KB of data per memory page (the default for SQL Server). In the case of the non-normalized table, each page can store about 33 records (leaving some space for overhead). If each balance sheet requires 3 records (for assets, liabilities, and equities), then each page will store 11 yearly balance sheets. When the data is normalized, however, each memory page can hold more than 350 records (again, leaving some space for overhead), which is equal to about 115 yearly balance sheets.
If you have 10,000 balance sheets and you run a query to generate a statistical survey of that data, using the non-normalized approach will result in 900 page reads. Using the normalized approach will result in only around 86 reads for the data, plus a couple more to pull in the small amount of data from the CompanyMaster and CategoryMaster tables. This is a significant performance improvement in terms of I/O and is a result of efficiently structuring the data.
7.1.3 Master/Detail FrameworkMost of today's database solutions use a master/detail framework. The concept is a natural one for relational systems since it comes directly from table normalization. The whole idea is to take columns with values that are repeated many times and replace them with a reference to a record in another table where each distinct value is stored only once. Such a table is referred to as a master table. The table containing the references is referred to as a detail table. The master/detail table structure provides several advantages when it is implemented in a relational database. One advantage is the space that you save when you replace long character strings that are repeated in numerous rows with a short reference to master record. Another important advantage, this time related to manageability, is the ability to change the values in the master table easily. It's a lot easier and faster to change a company name if it is only stored once in a master table than if it is stored many times in a detail table. 7.1.4 DuplicatesA duplicate is a row that is repeated at least twice with the same values in the same columns. You will encounter duplicates often when importing data. This is especially true for imports involving a mix of both old and new data. Duplicates are often a product of inconsistencies, or of bad data conversions, and are sometimes the only way of detecting that something went wrong during the import. There are two types of duplicates: full and partial. Full duplicates are rows that have the same values for all columns. Partial duplicates are rows with duplication only in a specific subset of columns. A duplicate has an occurrence count that represents the number of times it occurs in a given table. You might, for example, have the same row duplicated 5 times. In that case, the duplicate count would be 5. Sometimes this occurrence count is referred to as a degree of duplication. All nonduplicated rows have an occurrence count of 1. The process of eliminating duplicate values is called reduction, and the mechanism for it is built into standard SQL as the DISTINCT clause of the SELECT statement. A SELECT DISTINCT eliminates duplicate results from the result set. If two rows have the same column values, that set of values will only appear once in the final result. 7.1.4.1 Counting duplicatesThe duplicate occurrence count can be useful when business rules specify, or limit, the number of occurrences of a row or of a given value. A typical, and simple, example of such a business rule is a general accounting principle that each credit must have a corresponding debit. For each item your company buys, there should be two accounting entries; however, they still reflect only one event.[2] Each event has its own ID. Consequently, each ID is associated with two entries: a debit and a credit. It is very easy to test whether your accounting records comply with the dual-entry rule or not. You just have to check if any row has an occurrence count other than 2. For example:
SELECT EventId FROM Balances GROUP BY EventId HAVING count(*)<>2 As you move on to more complex cases, the code obviously gets more complicated, but the principle remains the same. You can use the vocabulary defined in this section to define duplication requirements for a table. In the accounting example, a table recording book entries must have an occurrence count of 2 on the column identifying events. |