Building from my last post, Performance Metrics: Average, Mean, Median, and Outliers, where I covered the topic of when to use Median instead of Average, in this post I’m going (to try) to show you how to calculate the Median of a set of values in T-SQL, MDX, and DAX.
Calculate Median in MDX
Let’s start with MDX because it’s the easiest (for once) since we can use the built in Median function!
The following query calculates the Internet Sales Amount by Customer State-Province. The calculated measure, [Measures].[Median Sales Amount], calculates the median sales amount based on the items Customer Geography hierarchy, but on the level below the current level displayed on rows. Since the level displayed on rows is [State-Province], the level used to calculate the median-based measure is [City].
[sql]
WITH
MEMBER [Measures].[Median Sales Amount] AS
MEDIAN (
[Customer].[Customer Geography].CurrentMember.Children
,[Measures].[Internet Sales Amount]
)
,FORMAT_STRING="$#,##0.00"
SELECT
NON EMPTY {
[Measures].[Internet Sales Amount]
,[Measures].[Median Sales Amount]
} ON 0,
NON EMPTY {
[Customer].[Customer Geography].[State-Province].AllMembers
} ON 1
FROM
[Adventure Works]
WHERE
[Date].[Calendar Year].&[2007]
[/sql]
…and here’s what the output looks like:
Calculate Median in T-SQL
This one is a bit more tricky than MDX because there isn’t a built-in MEDIAN function. Luckily, this isn’t the first time someone has tried to ride this pony, so I was able to steal find a nice post by Adam Machanic (b | t) who, in turn, stole found an even better way to attack the problem in Joe Celko’s Analytics and OLAP in SQL.
The following is a query that calculates the median sales amount by the age of the customer for 2007. The logic is the same as that used in Adam Machanic’s post…I’m simply using CTE’s to break up the pieces because I think the query is more readable that way…personal preference 🙂
[sql]
WITH
CTE_SalesOrdersByCustomerAge AS (
SELECT CustomerAge = 2007 – YEAR(dc.BirthDate)
,fis.SalesOrderNumber
,SalesAmount = SUM(SalesAmount)
FROM dbo.FactInternetSales fis
INNER JOIN dbo.DimCustomer dc
ON dc.CustomerKey = fis.CustomerKey
WHERE fis.OrderDateKey BETWEEN 20070101 AND 20071231
GROUP BY 2007 – YEAR(dc.BirthDate)
,fis.SalesOrderNumber
),
CTE_MedianSubSet AS (
SELECT rs1.CustomerAge
,rs1.SalesAmount
,ROW_NUMBER() OVER (
PARTITION BY rs1.CustomerAge
ORDER BY rs1.SalesAmount ASC, rs1.SalesOrderNumber ASC
) AS RowAsc
,ROW_NUMBER() OVER (
PARTITION BY rs1.CustomerAge
ORDER BY rs1.SalesAmount DESC, rs1.SalesOrderNumber DESC
) AS RowDesc
FROM CTE_SalesOrdersByCustomerAge AS rs1
)
SELECT x.CustomerAge
,AVG(x.SalesAmount)
FROM CTE_MedianSubSet x
WHERE x.RowAsc IN (x.RowDesc, x.RowDesc – 1, x.RowDesc + 1)
GROUP BY x.CustomerAge
ORDER BY x.CustomerAge;
[/sql]
The first CTE groups the sales amount by sales order (the grain of FactInternetSales is OrderLine) and age of the customer. The next CTE uses the ROW_NUMBER() window function to partition and sort the results from the first CTE up and down. Finally the base query filters for the row or rows (with an even number of items, the median is simply the average of the two middle values) to find the Median.
UPDATE: H/T to Stuart Quinn (t) for suggesting the use of the new SQL Server 2012 T-SQL window functions: PERCENTILE_DESC / PERCENTILE_CONT. Below is a quick rewrite of the query using the PERCENTILE_CONT (continuous):
[sql]
;WITH
CTE_SalesOrdersByCustomerAge AS (
SELECT CustomerAge = 2007 – YEAR(dc.BirthDate)
,fis.SalesOrderNumber
,SalesAmount = SUM(SalesAmount)
FROM dbo.FactInternetSales fis
INNER JOIN dbo.DimCustomer dc
ON dc.CustomerKey = fis.CustomerKey
WHERE fis.OrderDateKey BETWEEN 20070101 AND 20071231
GROUP BY 2007 – YEAR(dc.BirthDate)
,fis.SalesOrderNumber
)
SELECT DISTINCT CustomerAge
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SalesAmount)
OVER (PARTITION BY CustomerAge) AS MedianCont
FROM CTE_SalesOrdersByCustomerAge
ORDER BY CustomerAge
[/sql]
A quick test shows negligible performance difference (on a small dataset), but there’s no denying the code is much cleaner. These two windowing functions should prove quite useful in profiling data and building appropriate metrics in T-SQL.
Calculate Median in DAX
Unfortunately, this is the part where the post comes up short. Like T-SQL, there’s no built in Median function in DAX. And after spending a bit of time trying to reproduce the logic from the T-SQL version I gave up and started digging around on Google. This produced a few very detailed posts (that really started to make my brain hurt) by DAX-superheroes Marco Russo (b | t), Alberto Ferrari (b | t), and Javier Guillen (b | t). And while each was able to put something together, I’m afraid I wouldn’t classify any of them as elegant.
If you’re interested in the solutions by the DAX superheros above:
Russo: Median calculation in DAX
Ferrari & Guillen: Median of a measure
Guillen: Median calculation in PowerPivot / DAX
Guillen: Quartile, Percentile and Median in PowerPivot
I thought DAX was supposed to be easier than MDX??
UPDATE 20151208: H/T to Sourav Agasti for reminding me in the comments that DAX v2.0 (Excel 2016, Power BI, and SSAS 2016) includes a Median function out of the box. He’s also included a link to an alternative solution for DAX v1.0
3 replies on “How to Calculate the Median in T-SQL, MDX, and not DAX”
[…] just came across this blog post by Bill Anton where he discusses several approaches to calculated the Median of a given set in […]
LikeLike
Hello. You would probably need to update the blog as DAX 2016 supports percentile calculation. Also, I have a workaround for calculating percentiles(median is 50th percentile) in lower versions of DAX. You can find it in this SO post. http://stackoverflow.com/questions/30634488/calculating-percentile-in-dax-solved
LikeLike
Hi Sourav – very true…I will update the post and check out your solution on SO later this week. Thank you!
LikeLike