Compare and Find Row changes in SQL Server with CHECKSUM(), BINARY_CHECKSUM(), HASHBYTES()

In this post, we will be looking at how to detect and identify row changes in SQL Server with CHECKSUM(), BINARY_CHECKSUM(), and HASHBYTE() functions. There are use cases where comparing or detecting row changes in the table is necessary like when a long transaction is taking place on a certain row set. We want to be sure that those certain row set hasn’t been updated or changed before by other sessions before we do the final commit. Of course, the 'rowversion' data type is the most suitable option for that use case and we had covered it in this post with examples. Another cause may be, checking if the incoming data set is already present in the table, if both the data are the same then we want to ignore the incoming data and if the data is different, we may wanna update the row (conditional updates). We will be discussing how to use these functions to find row changes.

We will be exploring these TSQL functions and understanding the differences between them with examples.

CHECKSUM()

The CHECKSUM() function returns the hash index for an expression or over a row. This hash value can be stored in another column for the row in the entire table, and later can be used for comparison in case of update operations or detecting row changes. As with every update operation, the hash value is also updated mostly based on the expression or column inputs.

CHECKSUM ( * | expression [,...n ] ) 
Two lists return the same values if their types, and attribute values are the same and they are specified in the same order. However, the function does not accept values of these types:

  1. CURSOR,

  2. IMAGE,

  3. NTEXT

  4. TEXT

  5. XML If of same data types ten NULLS are considered as equal for equality operator check. The dash character is ignored for NCHAR and NVARCHAR data types. And trailing white spaces are also ignored.

As mentioned on the Microsoft SQL Server documentation page “we recommend the use of CHECKSUM only if your application can tolerate an occasional missed change. Otherwise, consider using HASHBYTES instead. With a specified MD5 hash algorithm, the probability that HASHBYTES will return the same result, for two different inputs, is much lower compared to CHECKSUM.”

select checksum('test          ');
select checksum('Test')

Returns the same hash value for the above to two checksum statements (assuming that we have a case-insensitive installation of the instance).

BINARY_CHECKSM()

Similar to CHECKSUM(), returns the binary checksum value computed over a row or list of expressions. It is more precise compared to CHECKSUM(), and hence computation overhead is more than CHECKSUM(). Like CHECKSUM(), BINARY_CHECKSUM() value can be stored in another column for the row in the entire table, and later can be used for comparison in case of update operations or detecting row changes.

BINARY_CHECKSUM ( * | expression [ ,...n ] )

As BINARY_CHECKSUM() computation returns the same value as long as the row isn’t modified later. Unlike CHECKSUM(), the collisions will be less in the case of BINARY_CHECKSUM(). Two lists return the same value if both corresponding attributes have the same data type and are equal when compared using the equals (=) operator or more precise byte representation. However, the function does not accept values of these types:

  1. CURSOR,

  2. IMAGE,

  3. NTEXT

  4. TEXT

  5. XML

The Microsoft SQL Server Documentation page also mentions similar thoughts for BINARY_CHECKSUM "so to detect whether values have changed, we recommend the use of BINARY_CHECKSUM only if your application can tolerate an occasional missed change. Otherwise, consider using HASHBYTES instead. With a specified MD5 hash algorithm, the probability that HASHBYTES will return the same result, for two different inputs, is much lower than BINARY_CHECKSUM."

select BINARY_CHECKSUM('test');
select BINARY_CHECKSUM('Test');

Both above statements return a different value. Do note that BINARY_CHECKSUM() is case-sensitive and it does not ignore dash characters like CHECKSUM().

HASHBYTES()

The HASHBYTES() function is primarily used for hashing values for use within cryptographic situations. It can be also used for comparison or detecting changes in rows. It returns the hash values based on the specified algorithm as input.

HASHBYTES ( '', { @input | 'input' } )

::= MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512

As HASHBYTES is a more accurate hashing function unlike BINARY_CHECKSUM, it also has a high overhead cost for computations as compared to CHECKSUM() and BINARY_CHECKSUM(). As of SQL Server 2016, all the algorithms are deprecated, except SHA2_256 and SHA2_512. It works on all data types and treats NULL the same as CHECKSUM() and BINARY_CHECKSUM().

select HASHBYTES('SHA2_256', 'test');
select HASHBYTES('SHA2_256', 'Test');

It is also case-sensitive like BINARY_CHECKSUM(). Due to overhead costs, it is intensive, but also suitable for change detection. Though it was limited to 8000 bytes in earlier versions. Fortunately, in SQL Server 2016 and later, that limitation was removed. However, it is correct that the more powerful the hashing algorithm is the more remote are changes of the collisions for different inputs, but this also increases the computational and overhead costs of the resources.

Summary

  1. We have to make choices as per the requirement and restrictions based on data types.

  2. If want cross-platform implementations, then better to use HASH_BYTES().

  3. If dash character or case sensitivity does matter then go for BINARY_CHECKSUM() as it also has lower collisions than CHECKSUM().

  4. HASH_BYTES algorithm has its computational overhead, so use it only if necessary.

As mentioned in the post above, the post for the 'rowversion' data type to detect and find changes in rows can be found HERE