Over the weekend I read a paper titled Column-Stores For Wide and Sparse Data by Daniel J. Abadi. Below is a quick summary of the points made by Abadi followed by some thoughts / application for SQL Server’s ColumnStore index.
Wide Tables & Column “Elimination”
I’ve already discussed (in a previous post) why ColumnStores are an ideal storage structure for wide tables (compared to RowStores) from the perspective of query performance.
ColumnStores only retrieve data (from storage) necessary to satisfy the query
This notion is inherent in the column-oriented architecture…data pages in column-oriented storage only contain data from a single column. Therefore, data pages containing only the columns referenced by the query need to be pulled from storage. In row-oriented storage, data pages contain entire table rows. So the data pages containing data for all columns in the row have to be retrieved from storage. Then, the subset of columns referenced by the query can be extracted (which has an additional cost…no free lunches here) and operated on. So to put it another way…the I/O for column oriented data isn’t “faster”, but just more efficient.
Sparse Data & Compression
When data is described as being sparse, it just means there are a lot of empty fields – where a field is the intersection of a column and row. So a sparse column is one with many NULL values. The issue now is why should space and IO be wasted for “empty values”?
Note: There are few scenarios where the fact that something didn’t happen or that a value does not exist is of interest, but for most applications, the point of interest revolves around values that do exist and events that did occur.
A common way to handle sparse data in SQL Server is to use the SPARSE column property in the table definition. When this property is specified for a column, the data on disk is stored in a way that reduces the amount of storage space needed for NULL values. There are, of course, trade offs that you should be aware of… so be sure to read the entire MSDN page before implementing.
One of the trade offs with using SPARSE column property is that you can’t put a ColumnStore index on a SPARSE column. Here’s what it looks like if you try…
CREATE TABLE dbo.t1 (
id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
c1 VARCHAR(20) NOT NULL,
c2 VARCHAR(20) SPARSE NULL
CREATE NONCLUSTERED COLUMNSTORE INDEX NCCS_IDX_dbo_t1_v1
ON dbo.t1 (id, c1, c2)
Msg 35309, Level 16, State 1, Line 1
CREATE INDEX statement failed because a columnstore index cannot be created on a sparse column. Consider creating a nonclustered columnstore index on a subset of columns that does not include any sparse columns.
I ran into this issue recently with a blended transaction table that will have a few columns sparse enough to warrant use of the SPARSE column property. Begrudgingly, I removed the SPARSE column properties from the table definition (removing the ColumnStore index was NOT an option). However, after reading this paper, I realize now that sparse data is not really a problem with ColumnStores.
According to Abadi…
Storing data from the same attribute domain together increases locality and thus data compression ratio (especially if the attribute is sorted). Bandwidth requirements are further reduced when transferring compressed data.
So if you think of NULL values as just another distinct value in a column, then sparse columns (columns having lots of NULL values) should benefit from incredibly high compression ratios. And high compression means more bang for your buck when it comes to IO. And simpler compression algorithms mean less CPU cost when decompressing the data. Win-Win!