For the vast majority of SQL Server professionals, it wasn’t until earlier this year, with the release of SQL Server 2012, that concept of “ColumnStore” (as opposed to RowStore) popped up on the radar. And now, as part of the research for a current project, I’m trying to find details on Microsoft’s implementation of the ColumnStore index…but since this is such new technology, I’m finding this task a bit of a challenge. On top of that, during Ashvini Sharma and Allan Folting’s Optimizing Your BI Semantic Model for Performance and Scale presentation at TechEd North America 2012, it was mentioned that a lot of the details of compression in the implementation of xVelocity (which is also used for the ColumnStore index in the database engine) are considered proprietary knowledge…so the details may never be released.
Thankfully, column-oriented, database systems have been around for a while (since before the 80’s). So I’ve adjusted my research plan and am now going to learn about how ColumnStore has been implemented in other systems and then circle back around to see if I am able to deduce some of the details specific to Microsoft’s ColumnStore index.
To start, I recently read a paper titled ColumnStores vs. RowStores: How Different Are They Really?. And to answer the title-question in a word: “very!”
The purpose of the paper was 2-fold:
- evaluate the performance of various RowStore designs with the goal of matching the performance of ColumnStore
- assess/quanitify the performance impact of various ColumnStore optimization techniques
Seeing as the overwhelming majority of my personal experience is with traditional RowStore systems, I found this much less interesting. However, for sake of a completeness, the RowStore designs used were:
- vertical partitioning: each column in a separate table along with the PrimaryKey (or SurrogateKey in the case of a composite PK)
- index-only: single table with a separate NonClustered index on each column. This is very similar to vertical partitioning, but avoids some of the overhead.
- materialized views: this involves simply creating an indexed view to satisfy each possible query. While this is the optimal case for a RowStore design, it is rarely feasible in terms of disk space and maintenance.
What I found much more interesting was the discussion of some common optimization available for column-oriented storage. Below is a list with a few comments/considerations:
- Translates into less disk space and better i/o performance
- RowStores can also be compressed, but ColumnStores typically achieve much higher compression ratios since the database pages only contain data for a single column…which makes it much more likely to have similarities in the data allowing more straightforward compression techniques such as run-length encoding.
- Comes at a cost. Eventually, compressed data must be decompressed which requires CPU cycles.
- In the most basic ColumnStore systems, data from each column is read from storage and reconstructed into a row before any operations (ex. filters, aggregations, etc) take place..this process of reconstruction is referred to as materialization.
- Late-materialization implies the query engine performs certain operations on the data first…before it is decompressed and reconstructed into a row. This is more efficient for a few reasons:
- less rows may need to be reconstructed
- more efficient use of cache-structures since data is still compressed
- fixed-width columns (ex. INT) can be isolated from variable-width columns (ex. VARCHAR)…which is helpful because there are additional optimizations for fixed-width columns
- With RowStores, column values have to be extracted from each row before they can be worked on…this extraction process has a cost. With ColumnStores, there’s no extraction overhead (with Late-Materialization) because the column data is already separated from the rest of the row.
- Fixed-width columns can be processed in array structures for which there are typically optimizations at the CPU/instruction set level
- Basically this involves converting joins into WHERE-clause predicates…which I’d almost argue is a query-writing-hack/hint as opposed to a full-on optimization.
- See the example below for an very basic illustration…
The first is a pretty standard looking star-schema query:
FROM FactSales fs
INNER JOIN DimCustomer dc ON dc.CustomerKey = fs.CustomerKey
INNER JOIN DimDate dd ON dd.DateKey = fs.OrderDateKey
WHERE dd.CalendarYrMth = 201202 AND
dc.Customer = ‘Walmart'[/sql]
,CalendarDate = (SELECT CalendarDate FROM DimDate WHERE DateKey = fs.OrderDateKey)
FROM FactSales fs
WHERE fs.CustomerKey IN (SELECT CustomerKey FROM DimCustomer WHERE Customer = ‘Walmart’) AND
fs.OrderDateKey IN (SELECT DateKey FROM DimDate WHERE CalendarYrMth = 201202)[/sql]
The rest of the paper goes on to walk through the setup and results of the ColumnStore vs RowStore performance comparisons (yes, ColumnStore “brings the ruckus” – dominating the various RowStore designs, none of which come close to approaching the performance of ColumnStore…not much surprise here). It also discusses the details of how the performance benefits of each individual optimization where measured…which I’ve briefly summarized below…
- Late-Materialization (3x improvement)
- Compression (2x improvement)
- Invisible Join (50%)
- Block Iteration (negligible – but likely due to some issues they had measuring this one)
All in all, a very informative paper. Expect to see a future post focusing on the “Invisible Join” query-pattern with accompanying performance tests using SQL Server 2012 ColumnStore indexes.