Clustered Table and Heap Table in SQL Server

In this post, we will be exploring the meaning of clustered tables and heap tables along with their working and logical structures. I will also be sharing my understanding of them and their scenarios. This understanding helps us to make better queries and affects the performance of the queries. To understand the logical structuring or arrangement of a heap table or clustered table, we need to understand what a page is and what is meant by IAM (Index Allocation Map) page and IAM chains.

A Page is a basic unit of data storage in SQL Server and the data files are fully comprised of pages numbered consecutively from 1 to n. The read-write operations that we execute or perform are all made on these pages and they are of various types. An IAM page is a type of page that stores information about the extent (manage space) used by the table or index allocation. It tracks approximately 4GB worth of space in a single file. If a file is more than 4GB or has multiple data file entities, then we need multiple IAM pages to track all the space usage. As some entity requires multiple IAM pages to track their space usage, these IAM pages are linked together and are called IAM chains.

Heap Tables:

In a literal sense, it can be defined as an unordered and unstructured collection of objects placed on top of each other like a pile of books, where books are placed on top of each other without any ordering or anything. Similarly, heap tables are table that holds records or rows without any ordering or any structuring. Since it's just a pile of rows then to find any particular row or record in a table, requires us to perform a scan of rows till we find what we were looking for. This scanning of rows for finding a record in a heap table is known as TableScan Operation. Think of looking for a particular book in a pile of unordered and unstructured books placed on top of each other. And again do note that data order is unpredictable and not guaranteed to retain order. We can add non-clustered indexes to the heap table to speed up the data retrieval process for finding a row. Since the ordering of rows (underlying pages) is absent, heap tables are suitable for heavy data insertions or bulk records. It will just allocate a page and write the data in it and put the page with other pages.

Now with the understanding of the page and IAM page given above, it will be simpler to visualize and understand the logical structuring of the heap table. The heap table is comprised of IAM pages and these IAM pages that are used to manage space allocated to the heap also helps SQL Server to traverse through the heap. The Data Pages and rows within them are not in any specific order and these data pages are not linked to each other. The only connection between these data pages is their information recorded on IAM pages. All reads and writes must contact the IAM first to read all pages within a heap.

Due to the absence of innate ordering or structuring of the pages or data in the heap table, it is suitable for the scenario where heavy insertion activity is required. Each Individual row in the heap table is identified by row identifier (RID). For large unordered insertions, the heap is preferred as we don't need any particular ordering. Thus, insertions in the heap table are generally faster compared to the same insertion in the clustered table. And as discussed earlier, if a heap table does not have any unclustered indexes, then an entire table scan is required to find any row or record.

If we need the data in a particular sorted order frequently, then it is better to not use heap and then sort row sets on return with ORDER BY clause. A Heap table is also not advisable where data grouping or querying for data ranges is frequently made. If a large table without any non-clustered indexes is needed to be queried with or without any particular sorting, grouping, or order condition, in that case, it is suitable to not use a heap. For Frequent data updates, the heap table is not suitable.

Clustered Table

It is an ordered and structured collection of objects. The Clustered index provides underlying ordering and structuring of the data set in the table. The columns for the ordering can be defined at the time of table creation as Primary Key or by defining a clustered index on an existing heap table. To use the earlier example of a pile of books, this can be said as a set of ordered books arranged in either ascending or descending order based on the name of the book or author's name or publish date or combinations of it. As this is an ordered set of books, then finding any particular book requires us to perform seek operation and locate the required book we were finding.

In the clustered table the record is stored in a Binary Tree structure (B-Tree), wherein the anchor point or starting point is a root node and expands to intermediate nodes and leaf nodes till it covers the entire table for index. These nodes provide the pointers to the previous node and next node in a data set and the rows or records are only stored at the leaf nodes which are at the lowest level of the clustered index structure and are ordered by the clustered index columns. Due to this structuring and ordering of the data set, any insert, update or delete operation of the underlying ordering of the data is retained.

As mentioned earlier with its innate ordering, finding a particular row in an ordered table is to seek operations at the required multiple levels and locate the required rows. Every child node it traverses to reach the leaf node, reduces and filters out the rows which are not needed to be scanned or sought. This is why seek operation is only available in the clustered index table and not in the heap table, making the read or retrieval speed of data faster compared to the heap table. We can create clustered indexes on a single column or multiple columns together as a composite key. Each table can only have one clustered index as one order sorting criteria can be used.

Since the rows are ordered and sorted as per the criteria mentioned with the required columns during the creation of the clustered index, the clustered table is suitable for scenarios where heavy read activity is necessary. It stores the data rows sorted based on the clustered index key values.

For heavy bulk insertions, the clustered table is needed to maintain and retain the clustered index while doing insert, update or delete operations. This is because if the data rows which are to be inserted are not in sorted order then, sorting the data set will be done first, which adds up the extra cost. But this need sorting data is not needed while inserting the same data rows in the heap table. Since heap tables don't care about the ordering of the rows. It will only find the space within each page and then write on those pages. Let's try a simple example for analyzing the performance of both clustered tables and heap tables.

-- creates room_series table with clustered index on id column
drop table if exists room_series
create table room_series(
    id int identity(1,1) primary key,
    day int,
    start_time time(0),
    end_time time(0),
    time_id int,
    room_id int
);

with weeks (day) as (
    select 1
    union all
    select day + 1
    from weeks
    where day < 7
), five_min_series (start_time, end_time, time_id) as (
    select cast ('00:00:00' as time(0)) as start_time, cast ('00:05:00' as time(0)) as end_time, 1 as level
    union all 
    select
        cast(DATEADD(mi, 5, start_time) as time(0)) as start_time,
        cast(DATEADD(mi, 5, end_time) as time(0)) as end_time,
        time_id + 1
    from five_min_series t1
    where time_id < 288 -- there are 288 blocks of 5 minutes in 24 hrs.
), rooms (room_id) as (
    select 1
    union all
    select room_id + 1
    from rooms where room_id < 100
)
insert into room_series (day, start_time, end_time, time_id, room_id) 
select day, start_time, end_time, time_id, room_id from weeks, five_min_series, rooms
 option(maxrecursion 0); -- populating records in room_series table

drop table if exists clustered_room_series;
create table clustered_room_series(
    id int not null primary key, -- doesnot have identity(1,1) property
    day int,
    start_time time(0),
    end_time time(0),
    time_id int,
    room_id int
);

drop table if exists heap_room_series;
create table heap_room_series(
    id int not null,
    day int,
    start_time time(0),
    end_time time(0),
    time_id int,
    room_id int
);

For the illustration first, we created a room_series table that holds our data to simulate insertions. Then we created two tables above, one is a clustered table named clustered_room_series and the other begins a heap table named heap_room_series. The clustered table has a primary key on the id column, so it will create a clustered index on the clustered_room_series table. Now we will be performing inserts from a table named room_series where we have 201600 rows as data. It has clustered index on the id column. We will simulate insertions to see how both tables respond.

insert into clustered_room_series(id, day, start_time, end_time, time_id, room_id) select id, day, start_time, end_time, time_id, room_id from room_series;

insert into heap_room_series(id, day, start_time, end_time, time_id, room_id) select id, day, start_time, end_time, time_id, room_id from room_series;

-- Table 'clustered_room_series'. Scan count 0, logical reads 5998, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
-- Table 'room_series'. Scan count 1, logical reads 1132, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

-- (201600 rows affected)

-- (1 row affected)
-- Table 'heap_room_series'. Scan count 0, logical reads 202726, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
-- Table 'room_series'. Scan count 1, logical reads 1132, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

-- (201600 rows affected)

-- (1 row affected)

-- Completion time: 2022-03-24T05:09:40.6912041+05:30

execution plan for insertions when source table has clustered index.PNG

As we can see with the above image of the query plan for both insertions, it can be understood that the cost is the same. Since the room_series table, our source table is already clustered table, it doesn't need to spend any extra to sort the rows before insertion in the clustered_room_series table. There is a difference in logical reads for both insertions. Insertion in clustered table required 5998 logical reads and heap table required 202726 reads.

With a low logical reads count, we can say that clustered table works better than a heap table when the source data set is also properly indexed. Various combination of indexes gives various output. For the table with the clustered index, a new data row must be inserted into the correct location as per the column definition in an index. And in the heap table, the new data row itself can be inserted into any page on which space is available, or on a new page at the end of the chain.

Now let's simulate the above script again but this time the source table i.e. room_series will be a heap table. And we will be performing the insertions in both clustered and heap tables.

insert into clustered_room_series(id, day, start_time, end_time, time_id, room_id) select id, day, start_time, end_time, time_id, room_id from room_series;

insert into heap_room_series(id, day, start_time, end_time, time_id, room_id) select id, day, start_time, end_time, time_id, room_id from room_series;
--Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
--Table 'room_series'. Scan count 9, logical reads 1127, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
--Table 'clustered_room_series'. Scan count 0, logical reads 5998, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

--(201600 rows affected)

--(1 row affected)
--Table 'heap_room_series'. Scan count 0, logical reads 202726, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
--Table 'room_series'. Scan count 1, logical reads 1127, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

--(201600 rows affected)

execution plan for insertion when source table is heap table.PNG

With the involved statistics and image above. we can see one major difference and it's the change in the process flow. Since this time the source data, i.e. room_series table is not indexed in any way, the insertion operation for the clustered table in clustered_room_series will sort operation after table scan, and then only clustered index insert can take place.

While with the heap table, it does not matter the ordering of the row sets in any case. The operation is fairly straightforward for heap tables, doing table scans and then table inserts. The query cost is also different from the then first scenario where the source table was clustered table. In this second scenario, the query cost for insertion in the clustered table is about 65 % and in the heap table, it is about 35 %. The logical reads have been the same as the first scenario.

With both scenarios' we can say that with different variants we get different statistics to analyze. In the case of logical I/O, the heap table is at disadvantage compared to the clustered table. Space taken by each table index is also a factor to be considered. The statistics change as we make changes and test out with various modifications, for better performance with optimum cost.