only for RuBoard - do not distribute or recompile |
Suppose you have a large collection of compact discs and you want to create a database to track them. The first step is to determine what data you are going to store. One good way to start is to think about why you want to store the data in the first place. In our case, we will most likely want to look up CDs by artist, title, and song. Since we want to look up those items, we know they must be included in the database. In addition, it is often useful to simply list items that should be tracked. One possible list might include: CD title, record label, band name, song title. As a starting point, we will store the data shown in Table 7-1.
Band name |
CD title |
Record label |
Songs |
---|---|---|---|
Stevie Wonder |
Talking Book |
Motown |
You Are the Sunshine of My Life, Maybe Your Baby, Superstition, ... |
Miles Davis Quintet |
Miles Smiles |
Columbia |
Orbits, Circle, ... |
Wayne Shorter |
Speak No Evil |
Blue Note |
Witch Hunt, Fee-Fi-Fo-Fum |
Herbie Hancock |
Headhunters |
Columbia |
Chameleon, Watermelon Man, ... |
Herbie Hancock |
Maiden Voyage |
Blue Note |
Maiden Voyage |
For brevity's sake, we have left out most of the songs. At first glance, this table seems to meet our needs since we are storing all the data we need. Upon closer inspection, however, we find several problems. Take the band named Herbie Hancock, for example. "Band name" appears twice: once for each CD. This repetition is a problem for several reasons. First, when entering data in the database, we end up typing the same name over and over. Second, and more important, if any of the data changes, we have to update it in multiple places. For example, what if "Herbie" were misspelled? We would have to update the data in each of the two rows. The same problem would occur if the name Herbie Hancock changes in the future (like Jefferson Airplane changed to Jefferson Starship). As we add more Herbie Hancock CDs to our collection, we add to the effort required to maintain data consistency.
Another problem with the single CD table lies in the way it stores songs. We are storing them in the CD table as a list of songs in a single column. We will run into all sorts of problems if we want to use this data meaningfully. Imagine having to enter and maintain that list. And what if we want to store the length of the songs as well? What if we want to perform a search by song title? It quickly becomes clear that storing the songs in this fashion is undesirable.
This is where database design comes into play. One of the main purposes of database design is to eliminate redundancy from the database. To accomplish this task, we use a technique called normalization. Before we discuss normalization, let's start with some fundamental relational database concepts: entities, attributes, and data models.
An entity is a thing or object of importance about which data must be captured. Not all "things" are entities, only those things about which you need to capture information. Information about an entity is captured in the form of attributes and/or relationships. If something is a candidate for being an entity and it has no attributes or relationships, it is not really an entity. A database entity appears in a data model as a box with a title that is the name of the entity.
An attribute describes information about an entity that must be captured. Each entity has zero or more attributes that describe it, and each attribute describes exactly one entity. Each entity instance (row in the table) has exactly one value, possibly NULL, for each of its attributes. An attribute value can be numeric, a character string, a date, a time, or some other basic data value type. In the first step of designing a database, logical data modeling, we do not worry about how the attributes will be stored.
|
Our example database refers to a number of things: CD titles, band names, songs, and record labels. Which of these are entities and which are attributes?
A data model is a diagram of our database design. It documents and communicates how the database is structured.
Notice that we capture several pieces of data (CD title, band name, etc.) about each CD, and we absolutely cannot describe a CD without those items. CD is therefore one of those things we want to capture data about and is likely to be an entity. To start a data model, we will diagram CD as an entity. Figure 7-1 shows our sole entity as a data model.
By common entity naming conventions, an entity name must be singular. We therefore call the table where we store CDs CD and not CDs. We use this convention because each entity names an instance. For example, "San Francisco 49ers" is an instance of "Football Team," not "Football Teams."
At first glance, it appears that the rest of the database describes a CD. This would seem to indicate that they are attributes of CD. Figure 7-2 adds them to the CD entity in Figure 7-1. In a data model, attributes appear as names listed in their entity's box.
This diagram is simple, but we are not done yet. In fact, we have only just begun. Earlier, we discussed how the purpose of data modeling is to eliminate redundancy using a technique called normalization. We have a nice diagram for our database, but we have not gotten rid of the redundancy as we set out to do. It is now time to normalize our database.
only for RuBoard - do not distribute or recompile |