Quantcast
Channel: SQL Server
Viewing all articles
Browse latest Browse all 3819

Blog Post: Hashing for Indexes

$
0
0
Hashing functions. I’m going to talk about two ways to hash your data inside SQL Server and one way outside SQL Server. CHECKSUM options Let’s talk about   BINARY_CHECKSUM () and CHECKSUM (). These are both vulnerable to collisions but they’re nice and quick. What’s the difference between these two? CHECKSUM is collation aware but BINARY_CHECKSUM is not. So strings which evaluate as equal – for example, lower case ‘a’ and uppercase ‘A’ – produce equal hashes when CHECKSUM but different BINARY_CHECKSUM. Either of these are good for sanity tests (in test frameworks) and for hash indexes. But I wouldn’t count on such an index to enforce uniqueness, the frequency of hash collisions is too high. HASHBYTES options HASHBYTES is another good option. You’ve got all kinds of hash algorithms with this one function. You can specify any of the following hash algorithms, MD2, MD4, MD5, SHA, SHA1, or SHA2. Even with the limitation that you can only hash strings up to 8000 bytes, I like HASHBYTES a lot. I don’t use it for security, but I use it for unique hash indexes for database rows. I’ve described how to be careful when dealing with unicode strings in Careful Hashing Here’s an example, consider CREATE TABLE Dim_Tweet ( TweetId INT IDENTITY PRIMARY KEY , Tweet NVARCHAR ( 140 ) UNIQUE , Frequency int ) It’s got a pretty wide primary key. I’d prefer something like CREATE TABLE Dim_Tweet ( TweetId INT IDENTITY PRIMARY KEY , Tweet NVARCHAR ( 140 ) , TweetHash as HASHBYTES ( 'SHA1' , Tweet ) PERSISTED UNIQUE , Frequency int ) just for the narrower index. But what about hash collisions? Lordy, lordy, this again? I’m firmly in the camp that the chance of collisions is so small that the risk is negligible. I mean, the birthday paradox doesn’t even take effect here. Just let me do some math, what if you inserted a unique row to this table once a second for forty years, you’d still be way more likely to win the lottery than to encounter a collision. MD5 and SHA1 isn’t exactly cutting edge cryptography, but it’s still good enough for what I need it to do (At least I’d choose it over the Hill Cipher ). Hashing Outside SQL Server You’re not storing your files in the database right? That means that if you want to generate a checksum or hash for files, you’re probably going to want to do that processing outside SQL Server. I’ve bookmarked a C# solutions from stackoverflow How do I do a SHA1 File Checksum in C#? It’s a good start.

Viewing all articles
Browse latest Browse all 3819

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>