Moving data—based on type, operational usefulness, frequency of access, and security requirements—to storage locations that best fit an organization’s needs is an ongoing business challenge, especially given the daily volume growth of corporate data.
This tiering of data not only helps balance overall
SAP HANA database performance but, if performed effectively, it can also reduce cost and complexity throughout the enterprise. Our data tiering blog series has so far illuminated the SAP HANA options for both hot and warm data tiering and today we turn our focus to the lowest temperature tier: cold data.
Cold data tiering refers to the storage of less frequently, or sporadically accessed data in low cost media such as HDFS (Hadoop Distributed File System) and cloud storage options including Amazon Web Services (AWS), Google Cloud Platform (GCP), and Azure Data Lake Storage (ADLS) that are managed separately from the SAP HANA database, but still accessible at any time. Separating cold data from the SAP HANA database reduces the database footprint with tables or partitions moved from SAP HANA to external storage with mostly read-only data access and separate high availability, disaster recovery, encryption, and admin functionality.
There are two approaches to access SAP HANA cold storage options:
SAP Data Hub and the
SAP HANA Spark Controller. Let’s take a closer look at both.
SAP HANA Cold Data Tiering with SAP Data Hub
Deployed in a Kubernetes cluster, the SAP Data Hub Distributed Runtime engine (also known as Vora) can persist cold data in disk-based, streaming tables. Technically, these streaming tables are viewed as virtual tables by SAP HANA. SAP HANA queries that involve virtual tables from the SAP Data Hub are executed via the VoraODBC adapter with SAP HANA Smart Data Access (SDA) as illustrated below.
The
Data Lifecycle Management tool (DLM) of the
SAP Data Warehouse Foundation (DWF) software facilitates the bi-directional movement of data between SAP HANA in-memory (hot),
SAP HANA Dynamic Tiering and Extension Nodes (warm), and the Vora streaming tables (cold). Our next blog will elaborate more on the features of the SAP DWF/DLM tool.
SAP HANA cold data tiering delivers optimized integration by leveraging the
SAP HANA Wire protocol for data movement and pushdowns between SAP HANA and Vora streaming tables. SAP HANA Wire supports a wide range of data types to better align with existing SAP HANA data types.
Finally, a SQL-export feature within SAP Vora is also available to copy data from a streaming table to an external cloud storage option.
SAP HANA Cold Data Tiering with HANA Spark Controller
A second option used by SAP customers for SAP HANA in-memory access to cold data exists through the SAP HANA Spark Controller. Spark Controller—assembled, installed, and configured on a Hadoop distribution such as MapR, Cloudera Distribution Hadoop, SAP Cloud Platform Big Data Services, Hortonworks Data Platform and Azure HDInsight—is available within the SAP HANA platform and runs in a familiar Spark cluster environment to provide access to cold data stored in external HDFS and ADLS data files.
Spark Controller allows SAP HANA to access cold data through the SparkSQL SDA adapter. This adapter moderates query execution and data transfer by enabling SAP HANA to fetch data in a compressed columnar format. It also supports SAP HANA-specific query optimizations and secure communication.
By using SAP DWF/DLM, cold data that has been relocated to a Distributed File System (DFS) can be accessed by SAP HANA through the Spark Controller and hot/warm data can be relocated or aged to Hadoop and directly stored in a DFS.
So, Which Cold Tiering Option Should You Choose?
While both cold data tiering options allow for cold data to be combined with more frequently accessed corporate data in a way that is both simple and fast, SAP recommends the use of the SAP Data Hub over the Spark Controller.
Why? First, the Spark Controller depends on the external open source community for version updates whose changes can affect the use of the Spark Controller. The SAP Data Hub on the other hand is fully integrated to and optimized with SAP HANA. And, more importantly, the Data Hub allows for cold data to be accessed at the record level compared to file level only access using Spark Controller. SAP Data Hub allows for cold data to be inserted, persisted, deleted and updated while Spark controller is mostly supports read-only scenarios.
Having a distributed runtime in place, also allows SAP Data Hub to read and analyze its cold data and/or to combine the data with other cold data stored in a third party external data lake.
However, it’s important to note that neither the Spark Controller nor the Data Hub offer the support for advanced SAP HANA data types. The DWF/DLM tool takes care of using supported data types for data relocation between SAP HANA and Spark Controller/Data Hub . And, while considered the lowest cost data tiering options available for SAP HANA—with the ability to deploy to flexible and scalable commodity hardware—database performance under either option can be affected with several seconds or up to a minute of latency.
Spark Controller as a Service
The SAP HANA Spark Controller will be deployed as managed service in the SAP Cloud Platform Big Data Services (BDS) by the end of Q4FY18. Well-suited for storing large volumes of infrequently accessed cold data, BDS provides users with the ability to query that data, when needed, through SAP HANA SDA data virtualization capabilities via a secure and encrypted connection. And, with fully managed integration between the two products, SAP offers the simplicity of a unified Big Data solution from a single provider—instead of struggling with the complexities of managing Hadoop and scalable cloud data storage solutions from multiple vendors.
A Trusted Data Tiering Approach
By moving data to the temperature tier with the cost and performance characteristics best suited for that data, SAP provides a trusted data tiering approach for intelligent enterprises facing the challenges of balancing and managing ever growing volumes of corporate data. With the bulk of data within organizations persisting in external cold data storage, SAP HANA customers have quick and simple access through the SAP Data Hub or the SAP HANA Spark Controller to access cold data and combine it in-memory for as-needed business analysis and interpretation.
Stay tuned for our next blog featuring the tiering options provided by the Data Lifecycle Manager (DLM) tool within the SAP HANA Data Warehousing Foundation.
Which cold data tiering options is your organization currently pursuing? We’d love to hear about your projects in the comments below.