The Power Query engine was first introduced as an Excel plugin (back when it was referred to as “Data Explorer”)… then we saw it again in Power BI Desktop… then again earlier this year in SSAS Tabular… and now (or very soon) as a cloud service.
Quite frankly, I’ve been ignoring the whole Power Query story for a while, as I only viewed it as a tool for the Information-Worker/Savvy-Business-Analyst to build better excel workbook reports. In my mind, IT-developers/BI-professionals would continue using more formal ETL tools (e.g. SSIS) for enterprise solutions. But then I saw the Power Query “get-data” interface added to SSAS-tabular along w/ an interesting PoC by Kay Unkroth… and it gave me pause.
This is happening, isn’t it? Guess I better brush up. So I bought a book and attended some PowerQuery/M sessions at SQL Saturday Atlanta… and I’m now convinced that the PowerQuery engine is going to revolutionize the Business Analytics space in the coming years because it brings a very crucial that’s been missing for years: “data exploration” for the masses!
Remember 5 years ago…
…when folks outside of the hyper-scale web startup demographic started talking about Hadoop and Big Data?
Traditional database conferences (read: PASS Summit) were being infiltrated by an army of open-source java-coding mercenaries trying to convince you & me that relational databases were going away and that we better come to their session, learn how to build a hadoop-based recommendation engine, or get left behind in the dust.
Fortunately/unfortunately (depending on your perspective and incentives) it didn’t really go that way, did it? Admittedly, I bought into the hype for a few months (a year?) and rewrote my entire continuing education plan centering it around these new technologies. During that period of time I learned a few things…
- Hadoop/Hive is slow and never going to work as a backend for adhoc analytical queries that need to occur at “speed of thought” (really don’t like that term but it flows better than “at a speed that doesn’t make me want to claw my eyes out of my skull”)
- Data Lake is a very appealing architecture (and strategy)
- schema on-demand is NOT a feature
The picture you were painted was this:
- Put all the data in the (data) lake
- Unleash the invincible army of data analysts to start exploring all this data
- Discover opportunities/business-value
- build (or incorporate it into) your traditional DW/BI solution
Invincible Invisible Armies
Who is this army of data analysts that understand hadoop/hive let alone have the skills needed needed to analyze all this unstructured/semi-structured data sitting in the lake?
In my opinion, they didn’t exist. From what I could tell, there were only a handful of folks with the skills needed to be effective in this type of role and make this type of framework truly viable. Even now, while the number of skilled/enabled resources has grown and the tools have gotten a bit easier to use and deal with… I still think this number is not nearly large enough support a massive surge in big data architectures such as the data lake.
But I think that will change rather quickly w/ Power Query.
With only a brief intro to Power Query and the M language, I’ve found the ease at which it can be leveraged to wrangle unstructured/semi-structured data is nothing short of magic!
Having to interface w/ REST APIs, ODATA, XML/JSON files, and other non-traditional data sources is no longer a task to fear. Power Query is Microsoft’s answer to the Invisible Armies issue… as it solves the problem of data exploration and lack of skilled resources.
- Connectivity to wide-range of data sources and interfaces
- Interactive and intuitive UX for cleaning & mashing up disparate data sources
- Built on extensible M-language
There’s still a long road ahead for Power Query and the M language, but there’s no doubt in my mind that it’s going to shape the data wrangling landscape.