Normalization is a design technique that is
widely used as a guide in designing relational databases. Normalization is
essentially a two step process that puts data into tabular form by removing
repeating groups and then removes duplicated data from the relational tables.
Normalization theory is based on the concepts of
normal forms. A relational table is said to be a particular normal form
if it satisfied a certain set of constraints. There are currently five normal
forms that have been defined. In this section, we will cover the first three
normal forms
Basic Concepts
The goal of normalization is to create a set
of relational tables that are free of redundant data and that can be
consistently and correctly modified. This means that all tables in a relational
database should be in the third normal form (3NF). A relational table is in 3NF
if and only if all non-key columns are (a) mutually independent and (b) fully
dependent upon the primary key. Mutual independence means that no non-key
column is dependent upon any combination of the other columns. The first two
normal forms are intermediate steps to achieve the goal of having all tables in
3NF. In order to better understand the 2NF and higher forms, it is necessary to
understand the concepts of functional dependencies and lossless decomposition.
Functional Dependencies
The concept of functional dependencies is the
basis for the first three normal forms. A column, Y, of the relational table R
is said to be functionally dependent upon column X of R if and only if each
value of X in R is associated with precisely one value of Y at any given time.
X and Y may be composite. Saying that column Y is functionally dependent upon X
is the same as saying the values of column X identify the values of column Y.
If column X is a primary key, then all columns in the relational table R must
be functionally dependent upon X.
A short-hand notation for describing a
functional dependency is:
R.x —>; R.y
which can be read as in the relational table
named R, column x functionally determines (identifies) column y.
Full functional dependence applies to tables
with composite keys. Column Y in relational table R is fully functional on X of
R if it is functionally dependent on X and not functionally dependent upon any
subset of X. Full functional dependence means that when a primary key is
composite, made of two or more columns, then the other columns must be
identified by the entire key and not just some of the columns that make up the
key.
Overview
Simply stated, normalization is the process
of removing redundant data from relational tables by decomposing (splitting) a
relational table into smaller tables by projection. The goal is to have only
primary keys on the left hand side of a functional dependency. In order to be
correct, decomposition must be lossless. That is, the new tables can be
recombined by a natural join to recreate the original table without creating
any spurious or redundant data.
Sample Data
Data taken from Date [Date90] is used to
illustrate the process of normalization. A company obtains parts from a number
of suppliers. Each supplier is located in one city. A city can have more than
one supplier located there and each city has a status code associated with it.
Each supplier may provide many parts. The company creates a simple relational
table to store this information that can be expressed in relational notation
as:
FIRST (s#, status, city, p#, qty)
where
s#
|
supplier identifcation number (this is the
primary key)
|
status
|
status code assigned to city
|
city
|
name of city where supplier is located
|
p#
|
part number of part supplied
|
qty>
|
quantity of parts supplied to date
|
In order to uniquely associate quantity
supplied (qty) with part (p#) and supplier (s#), a composite primary key
composed of s# and p# is used.
First Normal Form
A relational table, by definition, is in
first normal form. All values of the columns are atomic. That is, they contain
no repeating values. Figure1 shows the table FIRST in 1NF.
Figure 1: Table in 1NF
Although the table FIRST is in 1NF it
contains redundant data. For example, information about the supplier's location
and the location's status have to be repeated for every part supplied.
Redundancy causes what are called update anomalies. Update anomalies are
problems that arise when information is inserted, deleted, or updated. For
example, the following anomalies could occur in FIRST:
INSERT. The fact that a certain supplier (s5)
is located in a particular city (Athens )
cannot be added until they supplied a part.
DELETE. If a row is deleted, then not only is
the information about quantity and part lost but also information about the
supplier.
UPDATE. If supplier s1 moved from London to New
York , then six rows would have to be updated with
this new information.
Second Normal Form
The definition of second normal form states
that only tables with composite primary keys can be in 1NF but not in 2NF.
A relational table is in second normal form
2NF if it is in 1NF and every non-key column is fully dependent upon the
primary key.
That is, every non-key column must be
dependent upon the entire primary key. FIRST is in 1NF but not in 2NF because
status and city are functionally dependent upon only on the column s# of the
composite key (s#, p#). This can be illustrated by listing the functional
dependencies in the table:
s#
|
—> city, status
|
city
|
—> status
|
(s#,p#)
|
—>qty
|
The process for transforming a 1NF table to
2NF is:
Identify any determinants other than the
composite key, and the columns they determine.
Create and name a new table for each
determinant and the unique columns it determines.
Move the determined columns from the original
table to the new table. The determinate becomes the primary key of the new
table.
Delete the columns you just moved from the
original table except for the determinate which will serve as a foreign key.
The original table may be renamed to maintain
semantic meaning.
To transform FIRST into 2NF we move the
columns s#, status, and city to a new table called SECOND. The column s#
becomes the primary key of this new table. The results are shown below in
Figure 2.
Tables in 2NF but not in 3NF still contain
modification anomalies. In the example of SECOND, they are:
INSERT. The fact that a particular city has a
certain status (Rome
has a status of 50) cannot be inserted until there is a supplier in the city.
DELETE. Deleting any row in SUPPLIER destroys
the status information about the city as well as the association between
supplier and city.
Third Normal Form
The third normal form requires that all
columns in a relational table are dependent only upon the primary key. A more
formal definition is:
A relational table is in third normal form
(3NF) if it is already in 2NF and every non-key column is non transitively
dependent upon its primary key. In other words, all nonkey attributes are
functionally dependent only upon the primary key.
Table PARTS is already in 3NF. The non-key
column, qty, is fully dependent upon the primary key (s#, p#). SUPPLIER is in 2NF
but not in 3NF because it contains a transitive dependency. A transitive
dependency is occurs when a non-key column that is a determinant of the primary
key is the determinate of other columns. The concept of a transitive dependency
can be illustrated by showing the functional dependencies in SUPPLIER:
SUPPLIER.s#
|
—>
SUPPLIER.status
|
SUPPLIER.s#
|
—>
SUPPLIER.city
|
SUPPLIER.city
|
—>
SUPPLIER.status
|
Note that SUPPLIER.status is determined both
by the primary key s# and the non-key column city. The process of transforming
a table into 3NF is:
- Identify
any determinants, other the primary key, and the columns they determine.
- Create
and name a new table for each determinant and the unique columns it
determines.
- Move
the determined columns from the original table to the new table. The
determinate becomes the primary key of the new table.
- Delete
the columns you just moved from the original table except for the
determinate which will serve as a foreign key.
- The
original table may be renamed to maintain semantic meaning.
The results of putting the original table
into 3NF has created three tables. These can be represented in
"psuedo-SQL" as:
PARTS (#s, p#, qty)
Primary Key (s#,#p)
Foreign Key (s#) references SUPPLIER_CITY.s#
Primary Key (s#,#p)
Foreign Key (s#) references SUPPLIER_CITY.s#
SUPPLIER_CITY(s#, city)
Primary Key (s#)
Foreign Key (city) references CITY_STATUS.city
Primary Key (s#)
Foreign Key (city) references CITY_STATUS.city
CITY_STATUS (city, status)
Primary Key (city)
Primary Key (city)
Advantages of
Third Normal Form
The advantage of having relational tables in
3NF is that it eliminates redundant data which in turn saves space and reduces
manipulation anomalies. For example, the improvements to our sample database
are:
INSERT. Facts about the status of a city, Rome has a status of 50,
can be added even though there is not supplier in that city. Likewise, facts
about new suppliers can be added even though they have not yet supplied parts.
DELETE. Information about parts supplied can
be deleted without destroying information about a supplier or a city. UPDATE.
Changing the location of a supplier or the status of a city requires modifying
only one row.
0 comments: