Identify and Delete Duplicate records in SQL Server
How to get and delete duplicate rows / records in SQL Server
As a developer, you already know that any database table should not contain duplicate rows. But sometimes, we overlook these principles due to certain requirements or when we are importing data from different data-source like csv, databases, json etc. These scenarios lead to duplicate rows or records of data in a table.
There are many ways to get or identify duplicate rows and delete duplicate rows, we have to consider if the given table has a unique index or not. I do it below
When Table has Unique Index:
Get and Identify duplicate records or rows:
Lets create a table and queries for reference,
create table customers(
id int identity(1,1) primary key,
first_name varchar(40) not null,
last_name varchar(40),
email varchar(100) not null,
active bit not null default(1)
);
insert into customers(first_name, last_name, email) values
('Ken', 'Sanchez', 'ken@Sanchez.com'),
('Terri', 'Duffy', 'terri@duffy.com'),
('Roberto', 'Tamburello', 'rob@tam.com'),
('Rob', 'Walters', 'rob@walters.com'),
('Gail', 'Erickson', 'gail@eric.com'),
('Jossef', 'Erickson', 'jossef@eric.com'),
('Dylan', 'Miller', 'dylan@miller.com'),
('Diane', 'Margheim', 'diane@margheim.com'),
('Ken', 'Sanchez', 'ken@Sanchez.com'),
('Ken', 'Sanchez', 'ken@Sanchez.com');
select * from customers -- get's all rows from the customers table
select
first_name, last_name, email, count(*) as duplicate_count
from customers
group by first_name, last_name, email; -- return table with duplicate_count
As our customers table is ready, let's identify duplicate rows in the customers table, as we know it has duplicate rows and their count.
select
a.*
from customers a
join
(
select id, row_number() over(partition by first_name, last_name, email order by id) as ranking from customers
) b
on a.id = b.id
where b.ranking > 1;
The above query will return the below row set id first_name last_name email active 9 Ken Sanchez ken@Sanchez.com 1 10 Ken Sanchez ken@Sanchez.com 1
These rows are duplicate rows, baring the first occurrence of the same data in the table.
Delete duplicate records or rows:
delete
a
from customers a
join
(
select id, row_number() over (partition by first_name, last_name, email order by id) as ranking from customers
) b
on a.id = b.id
where b.ranking > 1
Here in the above query, with help of SQL Windows Function, we got the id and row_number (here referred to as ranking) from the customers table partitioned by the columns we want to group firstly, then we performed inner join with the customers table outside on condition of both the id but in where clause, using b.ranking > 1 we only got those records whose ranking is greater than one. Since the ranking will increase as many times as the records are duplicated in the table. In our case, it will keep the first record but will delete two duplicate records or rows.
When the Table does not have a Unique Index
When the table doesn't have any unique index or primary key, then we can do it using CTE. let's create a table and insert data and demonstrate it.
create table customers(
first_name varchar(40) not null,
last_name varchar(40),
email varchar(100),
active bit not null default(1)
);
insert into customers(first_name, last_name, email) values
('Ken', 'Sanchez', 'ken@Sanchez.com'),
('Terri', 'Duffy', 'terri@duffy.com'),
('Roberto', 'Tamburello', 'rob@tam.com'),
('Rob', 'Walters', 'rob@walters.com'),
('Gail', 'Erickson', 'gail@eric.com'),
('Jossef', 'Erickson', 'jossef@eric.com'),
('Dylan', 'Miller', 'dylan@miller.com'),
('Diane', 'Margheim', 'diane@margheim.com'),
('Ken', 'Sanchez', 'ken@Sanchez.com'),
('Ken', 'Sanchez', 'ken@Sanchez.com');
select * from customers; -- get all customers
Identify duplicate records or rows:
select
*, row_number() over(partition by first_name, last_name, email order by first_name) as row_num
from customers;
In the above query, we used the row_number() function on customers table by partitioning the required column and as we can see, the row_num column has a value of 1 except for the duplicate rows, whose row_num values are greater than 1. So, the query below will return us only duplicate rows
with customers_cte as (
select
*, row_number() over(partition by first_name, last_name, email order by first_name) as row_num
from customers
)
select * from customers_cte where row_num > 1; -- runts duplicate rows only
Delete duplicate records or rows:
To delete the duplicate rows, we will use the CTE and windows function similar to the above, we did for identifying the duplicate rows.
with customers_cte as (
select
*, row_number() over(partition by first_name, last_name, email order by first_name) as row_num
from customers
)
delete from customers_cte where row_num > 1;
There are many other ways to do a similar thing, I just jotted down the way I prefer to do it.