Understanding Database Normalization

Intro

Database Normalization is a concept & design technique we follow to efficiently organize data and design proper schema, to avoid redundant & data duplication in multiple tables. It also sets up relations between the tables, to ensure data dependency and room for future additions.

If tables are not properly normalized, then may eat up extra memory space and resources, while causing insertions, updates and deletions anomalies and complex queries.

Case

Suppose we have an LMS System for the college and no normalization design was followed while making it, then we may have many difficulties, to name a few.

student_id

student_name

email

program_name

program_abbr

710002

Suraj Peth

suraj@eample.com

Bachelor of science

B.Sc.

710003

Shion Y

saion@example.com

Bachelor of science,

B.Sc.,B.Ph.

710004

Cycil L

cycil@example.com

Bachelor of science

B.Sc

710005

Deo Bobby

Deo@example.com

Bachelor of Physics

B.Ph.

710006

Albert Doo

albert@example.com

Bachelor of Commerce

B.Com.

In the above design of the student table, we have a few issues

Firstly, row number 2 of student_id 710003 has grouped values for an attribute(column) of program_name.

Secondly, we have added program-related attributes like program_name, and program_abbr which were unrelated to the student entity(table). We could have created a separate program table or entity and used its primary key to reference it here in the student's table and the rest program related attributes can be stored in the program table.

Thirdly, there is the repetition of data like program_name values, as many students have to opt for programs from a selected list, we can have avoided this repetition through separate program tables and maintaining the reference of keys.

While Inserting new students' data, every time the data will be repeated for program_name and abbr. And if someone had earlier made a mistake in program_name during its creation, then we will have to update each row of the said program_name in the current design.

Normalizations are guidelines and a way, for database schema & structure to be laid down.

Normally, the Third Normal Form is enough, however, you can extend it as you deem fit.

First Normal Form(1NF)

First Normal Form sets below a few guidelines for designing database table schema;

  1. Define only the required and related attributes(columns) in the table.

  2. Should have single-valued attributes/columns.

  3. Attributes(columns) should be unique in names and the order doesn't matter.

  4. Should also contain a column denoting as primary keys

Taking forward the previous table we used as an example,

id

student_id

student_name

email

program_name

program_abbr

1

710002

Suraj Peth

suraj@eample.com

Bachelor of science

B.Sc.

2

710003

Shion Y

aion@example.com

Bachelor of science,

B.Sc.,B.Ph.

3

710004

Cycil L

cycil@example.com

Bachelor of science

B.Sc

4

710005

Deo Bobby

Deo@example.com

Bachelor of Physics

B.Ph.

5

710006

Albert Doo

albert@example.com

Bachelor of Commerce

B.Com.

Let's re-design the structure fulfilling all the guidelines.

Looking at the above table, we can see that a student with student_id 710003 has applied for 2 programs and it is inserted in comma-separated format. As we should have single-valued attributes as per the guidelines, let's restructure it and insert the second program as a new row.

The table now looks like,

id

student_id

student_name

email

program_name

program_abbr

1

710002

Suraj Peth

suraj@eample.com

Bachelor of science

B.Sc.

2

710003

Shion Y

aion@example.com

Bachelor of science

B.Sc.

3

710003

Shion Y

aion@example.com

Bachelor of Physics

B.Ph.

4

710004

Cycil L

cycil@example.com

Bachelor of science

B.Sc

5

710005

Deo Bobby

Deo@example.com

Bachelor of Physics

B.Ph.

6

710006

Albert Doo

albert@example.com

Bachelor of Commerce

B.Com.

By doing so, although a few values are getting repeated, values for the program_name column are now atomic for each record/row.

Using the First Normal Form, data redundancy increases, as there will be many columns with the same data in multiple rows but each row as a whole will be unique.

Second Normal Form(2NF)

The transformation of 1NF to 2NF involves the removal of partial dependencies. If a partial dependency exists, we remove the partially dependent attribute(s) from the relation by placing them in a new relation along with a copy of their determinant.

To further normalize and ensure non-repetition of data, we can break the above output table in First Normal Form into 3 different tables, as the single table above also contains information regarding program attributes that are not primary.

CREATE TABLE STUDENT(
    STUDENT_ID INT NOT NULL,
    STUDENT_NAME VARCHAR(100) NOT NULL,
    EMAIL VARCHAR(100),
    ACTIVE CHAR(1) NOT NULL DEFAULT('Y'),
    PRIMARY KEY(STUDENT_ID)
);

The above table will have the following recordset,

student_id

student_name

email

710002

Suraj Peth

suraj@eample.com

710003

Shion Y

aion@example.com

710003

Shion Y

aion@example.com

710004

Cycil L

cycil@example.com

710005

Deo Bobby

Deo@example.com

710006

Albert Doo

albert@example.com

CREATE TABLE PROGRAM(
    PROGRAM_ID INT NOT NULL,
    PROGRAM_NAME VARCHAR(500),
    PROGRAM_ABBR VARCHAR(50),
    ACTIVE CHAR(1) NOT NULL DEFAULT('Y'),
    PRIMARY KEY(PROGRAM_ID)
);

And the program table will have the following data,

program_id

program_name

program_abbr

5120036

Bachelor of science

B.Sc

5120042

Bachelor of Physics

B.Ph

5120040

Bachelor of Commerce

B.Com

And, finally, we will create another table for storing students and their applied Programs, as we allow students to enroll in multiple programs.

CREATE TABLE STUDENT_PROGRAM(
    STUDENT_ID INT NOT NULL,
    PROGRAM_ID INT NOT NULL,
    PRIMARY KEY(STUDENT_ID, PROGRAM_ID),
    FOREIGN KEY(STUDENT_ID) REFERENCES STUDENT(STUDENT_ID),
    FOREIGN KEY(PROGRAM_ID) REFERENCES PROGRAM(PROGRAM_ID)
);

student_id

program_id

710002

5120036

710003

5120036

710003

5120042

710004

5120036

710005

5120042

710006

5120040

Third Normal Form(3NF)

The rule for 3NF is that the table should be in Second Normal Form and we have to ensure that non-prime attributes should not be transitively dependent on the super key(primary key). It violates the rule of the third normal form.

Again taking forward the examples we used previously, let's add state and country-related attributes to the student table.

CREATE TABLE STUDENT(
    STUDENT_ID INT NOT NULL,
    STUDENT_NAME VARCHAR(100) NOT NULL,
    EMAIL VARCHAR(100),
    STATE VARCHAR(100),
    COUNTRY VARCHAR(155),
    ACTIVE CHAR(1) NOT NULL DEFAULT('Y'),
    PRIMARY KEY(STUDENT_ID)
);

student_id

student_name

email

state

country

710002

Suraj Peth

suraj@eample.com

Maharashtra

India

710003

Shion Y

aion@example.com

Karnataka

India

710003

Shion Y

aion@example.com

Karnataka

India

710004

Cycil L

cycil@example.com

Maharashtra

India

710005

Deo Bobby

Deo@example.com

Kerala

India

710006

Albert Doo

albert@example.com

Tamil Nadu

India

Here the State and Country in the above table are dependent on the student, so to normalize it as per Third Normalized Form, we can have another table storing the state & country list and referencing it with the id in the student's table. below is the normalized form,

CREATE TABLE STATE_COUNTRY(
    ID INT NOT NULL,
    STATE VARCHAR(100),
    COUNTRY VARCHAR (155),
    PRIMARY KEY(ID)
);

id

state

country

1

Maharashtra

India

2

Karnataka

India

3

Kerala

India

4

Tamil Nadu

India

CREATE TABLE STUDENT(
    STUDENT_ID INT NOT NULL,
    STUDENT_NAME VARCHAR(100) NOT NULL,
    EMAIL VARCHAR(100),
    STATE_ID INT NOT NULL,
    ACTIVE CHAR(1) NOT NULL DEFAULT('Y'),
    PRIMARY KEY(STUDENT_ID),
    FOREIGN KEY (STATE_ID) REFERENCES STATE_COUNTRY(ID)
);

student_id

student_name

email

state_id

710002

Suraj Peth

suraj@eample.com

1

710003

Shion Y

aion@example.com

2

710003

Shion Y

aion@example.com

2

710004

Cycil L

cycil@example.com

1

710005

Deo Bobby

Deo@example.com

3

710006

Albert Doo

albert@example.com

4

BCNF Normalization Form:

BCNF stands for Boyce-Codd Normal Form, which is an extension of 3NF. In BCNF, a table must not only contain only one type of data, have a unique identifier, and not have any transitive dependencies, but it must also not contain any repeating groups. This means that data in the table must not be repeated.

The benefit of BCNF is that it helps to improve data integrity. By eliminating repeating groups, it ensures that data is stored in the most efficient way possible. This helps to reduce data inconsistency and improve query performance.

Summary

  1. Database normalization has both pros and cons. The pros include improved data integrity, reduced data redundancy, and improve query performance. The cons include increased complexity, as normalizing a database can make queries look complex due to an increase in multiple tables. It can also lead to increased storage costs, as more data may need to be stored to maintain data integrity.

  2. As mentioned earlier in the start, Database normalization is an important process that helps to maintain the integrity of a database. It helps to reduce data redundancy, minimize data inconsistency, and improve query performance. It can be broken down into four major forms of normalization: 1NF, 2NF, 3NF, BCNF and more.

  3. There are also pros and cons to database normalization, and it is important to understand them before implementing a database normalization strategy. Also, note that the Over-normalization of database schema also leads to an increase in complexity.

  4. Denormalization is one of the last resorts for accelerating data retrieval in normalized databases when techniques like indexing are insufficient.

Finally, there are some tips and examples of database normalization that can be used to help achieve the best results.