There has been a lot written in the past several years about the possible death of the traditional data warehouse as we know it. Having been involved with the rise (and potential fall) of such systems for the majority of my professional career, I find it interesting to explore some of the factors, technologies, and changing business models that are driving this fundamental shift. But first, it helps to put it in context as it relates to my career experiences in the space.
I began my career at MCI (later MCI WorldCom, WorldCom, MCI Inc, and finally acquired by Verizon in the mid 2000’s) in the MCI small business group, where my team was in charge of building and maintaining a lead generation system to support several outbound call centers across the US, targeting small and medium size businesses with corporate calling plans. One very important part of our landscape was our data warehouse, which we named “ATLAS”, short for “Advanced Telemarketing and Lead Analysis System”. This analytics platform provided the first line entry point for our data strategists to begin segmenting customers along various dimensions before deciding to actually move forward with a strategic calling campaign. Given the timeframe of the late 1990’s, we were obviously not fortunate to have the vast number of choices for data management / data warehousing platforms that are so prevalent in today’s market.
After extensive research and due diligence, we decided on Sybase IQ as our data warehouse engine, a choice not easily made given the seemingly radical shift in design from what all of us were so familiar with in more traditional database technologies. The notion of a “columnar based relational database” seemed very foreign to most of our team, but I was open minded and persistent enough to convince us all that it made complete sense for our particular use case. Today, one would have a difficult time finding a data management technology that did not incorporate this concept in one way or another.
My experience with IQ was actually the impetus that brought me to leave MCI WorldCom in 1998 and join Sybase. I had experienced what I believed was a real “diamond in the rough” and I was eager to share it with the world! Fast forward more than a decade now to the late 2010’s, a decade which saw a flurry of activity in the area of specialized database technologies (think Hadoop), data warehouse appliances (think Greenplum, Netezza, Vertica), massive increases in compute power, memory density, and storage capacity. During that timeframe, there was also another relatively unknown research project going on in the labs at SAP, which would ultimately play a vital role in this notion of the dying traditional data warehouse. That project was what the world now knows as SAP HANA.
The first product shipped in late 2010, just months after my career story converged with SAP as a result of the acquisition of Sybase in May of the same year. This investment was a calculated move by SAP, as the company knew of the imminent release of HANA, and needed to start widening their offerings in the data management and associated technologies. Products such as Sybase Event Stream Processor (ESP) and Sybase IQ have, and will continue to play a very important role as complementing technologies which are being integrated more and more tightly with the SAP HANA platform.
Fast forward again to present-day, when we can again reflect on a game-changing innovation brought to market by SAP – this time the notion of a fully in-memory data platform suitable for both OLTP (on-line transactional processing) as well as OLAP (on-line analytical processing) workloads. It should also come as no surprise that Sybase’s original idea to commercialize a columnar-based relational database (Sybase IQ) is a fundamental design concept in SAP HANA as well!
It does not require deep understanding of HANA to realize that turning SAP’s original ideas for the product into a reality relied heavily on many of the advances in hardware that had been taking place for the past 20 years – in particular cheaper, higher-density memory and radical advances in parallel on-chip compute power and caches. However, what I notice is that many audiences do find it more difficult to understand why HANA is more than just a brute-force database that makes things run faster by keeping them in memory. In reality, it is infinitely more than that! With credit to an ex-colleague, Henry Cook, I love to describe HANA as a “Massively Parallel, Hyperthreaded, Column Based, Dictionary Compressed, CPU Cache Aware, Vector Processing, General Purpose Processing, ACID compliant, Persistent, Data Temperature Sensitive, transactional, analytic, Relational, Predictive, Spatial, Graph, Planning, Text Processing, In Memory Database!” Each and every one of these notions contributes to the transformational platform that SAP has built. For obvious reasons, we shortened that for convenience to “SAP HANA Platform”, but we should never forget that there is a whole lot more to it than just “In-Memory.”
Now that we have laid the foundation, let us return to our original question concerning traditional data warehousing. It is clear that advances in vendor solutions, as well as the pure proliferation of new ways of tackling the traditional requirements that data warehouses were built to address have radically changed the game forever. In fact, we rarely hear the term “data warehouse” used any longer; instead terms like “big data” and “data lake” are much more commonplace these days. So the obvious question is what really differentiates the two?
Wikipedia describes the two as follows:
“A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered as a core component of business intelligence environment. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.”
“Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.”
No doubt that any number of interpretations and comparisons are possible between the two, but fundamentally, a data warehouse still plays the role of supporting and enabling the analysis of big data, at least according to Wikipedia’s definitions. There is no shortage of experts who agree with this simplified view, which also implies that the two remain complementary in nature, rather than the notion that somehow big data is a replacement for data warehousing.
One key notion that stands out is simply the volumes, variety, and velocity of data that organizations now typically have at their disposal in today’s world compared to just a few years ago. In addition, the rates these classic three “V’s” have been accelerating far outpaces even the most sophisticated advancements in technology when applying Wikipedia’s definition that ”DWs are central repositories of integrated data.” Though not entirely impossible, most organizations would struggle to have any single, consolidated, central repository. In fact, this is one of the reasons which led to dissatisfaction with data warehouse principles in general – the reality that one DW was rarely enough! As business units became unhappy with what the corporate data warehouse could (or could not) provide for them, they logically started building separate data marts or even their own data warehouses. History tells us that this sort of data redundancy inevitably leads to massive increases in complexity, manageability, and consistency.
The SAP HANA Data Management Suite is the latest branding from SAP that encompasses the set of technologies and big data frameworks required to overcome the major challenges that more traditional data warehousing systems have faced. As the most advanced data platform to build intelligent and live applications focusing on real-time analytics, advanced data processing leveraging the powerful native engines in SAP HANA such as graph, geospatial, text, predictive, time series or streaming, or developing a holistic, secure and agile data landscape – the SAP HANA Data Management Suite covers off the requirements.
One of the key innovations that truly separates the SAP solution from others is its’ capabilities around logical data models and virtualization capabilities working in conjunction with the native power of the platform itself. This allows data to be queries wherever it resides, or to have high-value data ingested from various sources with smart data integration capabilities. To further extend these capabilities, in-database machine learning which are integrated with popular SDKs and frameworks provides real-time machine learning capabilities on the data where it resides for all users and applications.
For those who require complete freedom in cloud deployment options, the suite provides one of the only truly hybrid and multi-cloud data platform available to avoid vendor lock-in and provide integration with other cloud-native services.
Another unique feature of the SAP HANA Data Management Suite in relation to big data solutions is that it allows the combination of refined big data with enterprise and corporate master data, to enable a trusted, unified view for advanced analytics across organizations.
So, in summary – to answer our original questions about the future of traditional data warehouses – it is clear that companies now have much more to consider than just the idea of a complex, monolithic and difficult to manage single-purpose system. Next generation platforms like the SAP HANA Data Management Suite, built on a radical, ground-breaking idea that many thought could never be realized – are reshaping the discussion when it comes to the future of these types of projects!