Team LiB   Previous Section   Next Section

6.8 Minimizing Audit-Log Space Consumption

6.8.1 Problem

You are going to implement an audit log, but you need a way to minimize space consumption so that the resulting log uses disk space as efficiently as possible. It's still important to be able to construct an accurate snapshot of the base table as it appeared at any give point in time.

6.8.2 Solution

The log recipe shown earlier in this chapter implemented a log that contained a complete snapshot of a row for every operation performed on that row. While simple to implement and to query, the result is a log that contains a great deal of redundant information where update and delete operations are concerned. To minimize the space used by a row-level log, you can design it such that a column value is recorded only when it is changed. Thus, a change to one column in a row does not cause an entirely new copy of that row to be written to the log. Instead, only the before and after versions of the modified column are recorded in the log. Similarly, with deletes, only the delete action needs to be recorded.

For purposes of this recipe, assume that the Stock table to be audited can be created using the following statement:

CREATE TABLE Stock(
   ProductId CHAR(40),
   Qty INTEGER,
   Price INTEGER,
   PRIMARY KEY(ProductId)
)

Given this table, the solution requires that a log table is created and that triggers are created to record changes to the data in the Stock table's audit log.

6.8.2.1 Step 1: Create the audit table

Create an audit table that stores the audit data for the base table. The following statement creates a table named StockLog that can be used to log changes made to the table named Stock:

CREATE TABLE StockLog(
   ProductId CHAR(40) NOT NULL,
   Qty INTEGER NULL,
   Price INTEGER NULL,
   EventTime DATETIME NOT NULL,
   DeleteFlag CHAR(1) NULL,
   LogId INTEGER IDENTITY
)

As you can see, the audit table has a similar structure to the Stock table's, with the following exceptions:

  • One additional column is included to store the timestamp (the date and time of the event).

  • One column is included to represent DELETE events.

  • LogId column is added to uniquely identify events and support sequencing order between them.

  • All columns, except for the timestamp and primary-key columns, are explicitly allowed to be null.

6.8.2.2 Step 2: Implement the audit triggers

This set of triggers records all insert, update, and delete activity in the StockLog table, but the approach is slightly different from the previous audit-log recipe in this chapter. This time, column values are not recorded for deletes, and only modified columns are recorded with update log entries. Pay special attention to the CASE statements in the updateStock trigger as you read the code.

CREATE TRIGGER insertStock
ON Stock
FOR INSERT
AS
BEGIN
   INSERT INTO StockLog 
      (ProductId, Qty, Price, DeleteFlag, EventTime)
      SELECT i.ProductId, i.Qty, i.Price, null, CURRENT_TIMESTAMP
      FROM inserted I
END
GO

CREATE TRIGGER deleteStock
ON Stock
FOR DELETE
AS 
BEGIN
   INSERT INTO StockLog
      (ProductId, Qty, Price, DeleteFlag, EventTime)
      SELECT d.ProductId, null, null, 'D', CURRENT_TIMESTAMP 
      FROM deleted d
END
GO

CREATE TRIGGER updateStock
ON Stock
FOR UPDATE
AS 
BEGIN
      INSERT INTO stockLog
      (ProductId, Qty, Price, DeleteFlag, EventTime)
      SELECT d.productId, 
        (CASE WHEN d.Qty!=i.Qty THEN i.Qty ELSE NULL END),
        (CASE WHEN d.Price!=i.Price THEN i.Price ELSE NULL END),
        null, CURRENT_TIMESTAMP 
      FROM deleted d,inserted I
      WHERE d.ProductId=i.ProductId
  
END
GO

6.8.3 Discussion

Space can often become an issue if all details are to be stored in a log. This technique demonstrates how to store the information with minimum space consumption.

The concept underlying the solution shown here is that column values are only recorded in the log table when they change. An INSERT operation is the only operation where the log record will contain a complete copy of a row. For UPDATE operations, the log will only contain the new values for the columns that were changed and the value of the row's primary key. All columns not changed by an UPDATE statement are left set to NULL in the log record to avoid storing data redundantly. In this way, space consumption is minimized. Deletes are a bit different. Each log record has a delete flag in its descriptor. When a row is deleted, the delete flag is set in the corresponding log's record.

Consider, for example, the following DML statements:

INSERT INTO Stock(ProductID, Qty, Price)
   VALUES ('Bananas', 10, 112)

INSERT INTO Stock(ProductID, Qty, Price)
   VALUES ('Apples', 20, 223)

UPDATE Stock
   SET Qty=25
   WHERE ProductID='Apples'

UPDATE Stock
   SET Qty=30
   WHERE ProductId='Apples'

DELETE FROM Stock
   WHERE ProductId='Bananas'

The audit-log entries generated by these statements would be as follows:

SELECT LogId, ProductId, Qty, Price, DeleteFlag, EventTime
FROM  stockLog

GO

LogId   ProductId  Qty   Price  DeleteFlag EventTime                
------  ---------- ----- ------ ---------- ------------------------ 
1       Bananas    10    112    NULL       2000-06-27 23:01:25.943  
2       Apples     20    223    NULL       2000-06-27 23:01:25.950  
3       Apples     25    NULL   NULL       2000-06-27 23:01:25.970  
4       Apples     30    NULL   NULL       2000-06-27 23:01:25.970  
5       Bananas    NULL  NULL   D          2000-06-27 23:01:25.993  

The first two DML statements were INSERT statements, and, consequently, the log entries each contain a complete copy of the data being inserted. The next two DML statements, however, were UPDATE statements. Both of these updated only one column, and the log entries reflect that fact by setting the other columns to NULL. The final log entry represents a deletion of the Bananas record. The D in the DeleteFlag column indicates that the record was deleted. No other information need be recorded; consequently, both the Qty and Price columns have been set to NULL. Most database systems optimize the storage of NULLS such that the amount of space used for that purpose is negligible.

6.8.3.1 Snapshot generation

Generating a snapshot of the Stock table is more difficult when the log records changes to a single column than it is when each log record contains a complete copy of the row being changed. In the case of the audit log described in this recipe, you can use the following SELECTstatement to generate a snapshot of the base table as of a specific time. The @TIME variable in the example represents the "as of" time used for the snapshot generation. For testing purposes, you can replace it with CURRENT_TIMESTAMP to return the current snapshot.

SELECT t.ProductId,
   Qty=(SELECT TOP 1 tQ.Qty FROM StockLog tQ
      WHERE tQ.Qty IS NOT NULL 
         AND tQ.ProductId=t.ProductId 
         AND tQ.EventTime <= @TIME 
      ORDER BY tQ.LogId DESC),     
   Price=(SELECT TOP 1 tP.Price FROM StockLog tP
      WHERE tP.Price IS NOT NULL 
         AND tP.ProductId=t.ProductId
         AND tP.EventTime <= @TIME 
      ORDER BY tP.LogId DESC)
FROM StockLog t 
WHERE  t.EventTime<=@TIME
GROUP BY t.ProductId
HAVING NOT EXISTS(SELECT * FROM StockLog 
   WHERE productID=t.productID 
   AND logID=MAX(t.logID) AND DeleteFlag='D')

This SELECT statement contains a number of subqueries: one for each column in the Stock table to retrieve the most recent values for those columns and one at the end to eliminate all deleted rows from the report. Each subquery retrieves the most recent value of a column with respect to the date specified by the @TIME variable. The one exception is that the primary key column does not require a subquery because we are assuming that the primary key does not change.

To get the most recent value for a column, we execute a subquery. The subquery ignores rows where the column is NULL. The subquery sorts the rows in descending order by LogId and returns the topmost value, which is the most recent value of the column. The subquery respects the time you pass (@TIME) and will ignore values set later than that time. The following example shows the subquery used to return the most recent value for column Qty:

SELECT TOP 1 tQ.Qty FROM StockLog tQ
WHERE tQ.Qty IS NOT NULL 
   AND tQ.ProductId=t.ProductId 
   AND tQ.EventTime <= @TIME 
ORDER BY tQ.LogId DESC 

This same subquery is used for each column in the table, except for the columns in the record descriptor.

The WHERE clause in the main query limits the scope of the selection to the rows that were inserted before or on the date and at the time represented by @TIME. The ProductId column is then used to group the result set. Grouping is done because we are interested only in one instance (a snapshot) of each row in the original table.

6.8.3.2 Deleted records

The HAVING clause in the query eliminates Stock records that have been deleted, so that they don't appear in the snapshot. The HAVING clause looks like this:

HAVING NOT EXISTS(SELECT * FROM StockLog 
   WHERE productID=t.productID 
   AND logID=MAX(t.logID) AND DeleteFlag='D')

The subquery in the HAVING clause checks to see if the most recent log entry for a Stock record represents a delete. If it does, then that Stock record is not returned as part of the snapshot.

6.8.3.3 Dealing with NULL values

If the Stock table includes some columns that are allowed to be NULL, the solution shown in this recipe will not work properly. That's because the query to generate a snapshot has no way to know whether a NULL value in a particular column represents a NULL column value or if it just means that there has been no change for that column.

One possible way to deal with NULL values is to use a second column as a flag to indicate whether the first is a null. This flag column can be a BIT type if the database implementation allows it, or it can be a character type such as a CHAR(1). A BIT type allows for TRUE and FALSE values. The StockLog table is extended with the columns representing the NULL values:

CREATE TABLE StockLog(
   ProductId char(40) NOT NULL,
   Qty INTEGER NULL,
   Price INTEGER NULL,
   IsEmptyPrice BIT,
   IsEmptyQty BIT,
   EventTime DATETIME NOT NULL,
   DeleteFlag CHAR(1) NULL,
   LogId INTEGER IDENTITY 
)

Assuming that you implemented a flag column named IsEmptyQty and that you used 1 to represent a NULL value, and 0 otherwise, the subquery in the snapshot query would need to use a WHERE clause, such as the one shown here:

SELECT t.ProductId,
   Qty=(SELECT TOP 1 tQ.Qty FROM StockLog tQ
      WHERE tQ.IsEmptyQty=0
         AND tQ.ProductId=t.ProductId 
         AND tQ.EventTime <= @TIME
      ORDER BY tQ.LogId DESC ),
   Price=(SELECT TOP 1 tP.Price FROM StockLog tP
      WHERE tP.IsEmptyPrice=0
         AND tP.ProductId=t.ProductId 
         AND tP.EventTime <= @TIME
      ORDER BY tp.LogId DESC)
FROM StockLog t 
WHERE  EventTime<=@TIME
GROUP BY t.ProductId
HAVING NOT EXISTS(SELECT * FROM StockLog 
   WHERE productID=t.productID 
   AND logID=MAX(t.logID) AND DeleteFlag='D')

This solution requires only minimal changes to the SELECT statement used to generate snapshots. It also doesn't consume much space in an implementation where the DBMS supports the BIT datatype. Sybase and MS SQL Server both support BIT types. The storage of such BIT types is optimized by combining all the BIT columns in a row together into as few bytes as possible. If there are 8 bit columns, they will consume just 1 byte per row.

    Team LiB   Previous Section   Next Section