On 1/14/2014, I presented Bus Matrix…the foundation of your Dimensional Data Model via the Free Training on the T’s offered by PragmaticWorks. If you weren’t able to attend the live session, you can watch the recording here.
This presentation touched on a lot of complex concepts (dimensional modeling, conformed dimensions, star schemas) and it was also my first time delivering it, so I was definitely concerned whether or not I would be able to deliver the key points in a clear/concise manner. Judging by the caliber of questions at the end of the session, I’d say everything was successfully received…now whether that was because of a smart/experienced audience, my awesome presentation/communication skills, or a combination of both…only time will tell.
Unfortunately, there wasn’t enough time left to answer all the questions. Thankfully, the thoughtful folks at PragmaticWorks were kind enough to email a list of all the questions I could answer them in this blog post 😉
Q: when you conform dimensions how do you know where the data comes from? basically, if employee exists in two systems with different id and you conform into one record, how do you track this?
When I answered this question during the presentation, I think I missed the second part which completely changes the answer.
This question touches on one of the more common issues when conforming dimensions across multiple systems. In my experience, it is really rare for two completely separate source systems to use identical keys for the same ‘entity’ (e.g. employee). So in order to conform the data between these two systems, you have to match the entities (usually by a combination of attributes) and build (and maintain) a mapping of keys between the two systems.
Thomas Kejser (b | t) has a nice ‘roll-your-own’ approach laid out in the following two posts: here and here. The process of creating the initial mappings and maintaining them over time can be expedited via the use of Master Data Services (MDS) and Data Quality Services (DQS).
Q: can you get into more details about a automated version of Bus Matrix?
Please check out the posts by Alex Whittles (b | t)…here, here, and here…where he walks you through the design and then provides you with some scripts and sample reports to implement this solution in your own environment. By the way, if you follow SQL-related blogs, Alex’s is a good one – he posts really creative solutions to a variety of BI-problems.
Q: how are conformed dimensions used for drilling across to other fact tables?
The exact mechanism for drilling across from one fact table to another will ultimately be determined by the reporting tool. However, the capability hinges on the 2 fact tables sharing common dimension keys. So if you are looking at sales amounts from the FACT_SALES table for a list of products on a specific date and you want to know what the inventory quantity was for those products on that same date, then both FACT_SALES and FACT_INVENTORY need to have the same keys for product and date.
To break this down a bit more, consider the following queries:
/* SalesAmount By Product By Date */ SELECT DateKey ,ProductKey ,TotalSalesAmount = SUM(Sales_Amount) FROM FACT_SALES WHERE DateKey BETWEEN 20130101 AND 20131231 GROUP BY DateKey ,ProductKey
/* Inventory By Product By Date */ SELECT DateKey ,ProductKey ,QuantityOnHand = SUM(QuantityIn - QuantityOut) FROM FACT_INVENTORY WHERE DateKey BETWEEN 20130101 AND 20131231 GROUP BY DateKey ,ProductKey
The first one shows total sales amount by date and product for 2013. The second one shows QuantityOnHand by date and product for 2013. Because the dimensions are conformed, we could construct the following query which joins the results from the above 2 queries in order to match 2 measures side by side…
SELECT sales.DateKey ,sales.ProductKey ,sales.TotalSalesAmount ,inv.QuantityOnHand FROM ( /* SalesAmount By Product By Date */ SELECT DateKey ,ProductKey ,TotalSalesAmount = SUM(Sales_Amount) FROM FACT_SALES WHERE DateKey BETWEEN 20130101 AND 20131231 GROUP BY DateKey ,ProductKey ) AS sales INNER JOIN ( /* Inventory By Product By Date */ SELECT DateKey ,ProductKey ,QuantityOnHand = SUM(QuantityIn - QuantityOut) FROM FACT_INVENTORY WHERE DateKey BETWEEN 20130101 AND 20131231 GROUP BY DateKey ,ProductKey ) AS inv ON inv.DateKey = sales.DateKey AND inv.ProductKey = sales.ProductKey
I have another example using a cube in this blog post.
Q: question…so the table should be design base on the question we want to ask the DW? What I meant was the Bus Architecture is based on the questions we are going to be asking the Datawarehouse, right?
This is an awesome question because it touches on some issues/ideologies that I’ve been working through for the past few years. Yes, you want to design the dimensional model based on the business: how the business users see the business, the questions they want to be able to ask and answer, etc. The typical approach is through a series of interviews with business users and subject matter experts. This is a really important process and will uncover potential issues before you paint yourself into a corner…for example, maybe the source system doesn’t contain all the information needed to answer the questions the business users want to ask and which means changes to the source system need to take place. Eventually you end up with a dimensional model that answers the questions the business wants to ask and supports the metrics the business wants to track. Life is good.
But what happens down the road when the business evolves. Now the existing dimensional model doesn’t answer the new questions the business needs to ask. Or maybe some of the source systems have been replaced coinciding with changes to business processes – now the new source system doesn’t contain all the information needed to populate the existing dimensional model. At this point, the dimensional model needs to be altered or recreated entirely to meet the new needs of the business. This can be painful.
Fear not, for this is where architectures like Data Vault really shine. A data vault model is very very flexible and easily adapts to these types of changes. I would argue that a data vault more closely models the underlying source systems…more so than it models the business. This is why a dimensional model (via data marts and/or cubes) is so important. It is the area between the Data Vault DW and the data marts where the business rules are applied and the data is transformed into a dimensional model. In fact there are quite a few tools that can help automate the process of moving the data between a Data Vault DW and a dimensional model (e.g. WhereScape RED). The benefit is that when the business does evolve (because it will), the dimensional model can be quickly redesigned and repopulated directly from the Data Vault DW instead of having to dip back into the source systems (which may not contain all of the historical information needed to regenerate the dimensional model).
Q: Can you have both conformed and unconformed dimensions?
Yes, but I can’t really think of a scenario where you would want to have an unconformed dimension.
If this was your question, please shoot me an email so that we can dig into this further – perhaps I’m not understanding what it is you are asking.
Q: Since Datawarehouse is all about Denormalized data and all these days we are using normalized data in RDBMS, from going forward which one we should consider for our day to day business Normalized or denormalized data.
You can, but I wouldn’t recommend it. There are some very good reasons to use a normalized model for line of business databases that support a specific business process. Wikipedia has a pretty good list of the objectives of normalization.
And here’s an excerpt from Introduction to Data Normalization: A Database “Best” Practice:
The advantage of having a highly normalized data schema is that information is stored in one place and one place only, reducing the possibility of inconsistent data. Furthermore, highly-normalized data schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of promoting high cohesion and loose coupling between classes results in similar solutions (at least from a data point of view). This generally makes it easier to map your objects to your data schema.
Q: combine them in one SSAS project? when we combine stars, do we combine them in one project?
Before I get into the answer, I’d like to define the development hierarchy of a SSAS project. A SSAS project represents a single SSAS database…and a SSAS database can contain many SSAS cubes…and a SSAS cube can contain many “stars”.
Based on the hierarchy defined above, I think the real question is: How do you determine when to combine stars in a single SSAS cube? Another way to phrase that question is: Which fact tables (aka star schemas) should I include in the same cube?
The quick and dirty answer is to include multiple stars in a single cube when you know you need to analyze data between these stars…aka cross-process-analysis.
The longer answer ultimately boils down to a question of whether to create one large cube with everything in it or create many small cubes specific to each business process. This is a question that has been debated for years. Below are a few links that outline the pros/cons of each approach:
- One Cube vs Multiple Cubes
- SSAS: One giant cube or many small ones?
- SSAS: One Database, many Cubes Vs. One Database per Cube
If you think of any additional questions, please feel free to reach out to me directly or in the comments and I’ll add them to this post.