Let’s review Hadoop and SAP Predictive Analytics features in details and how these solutions can be utilized in the scenario below.
In this scenario customer’s online interaction with the various products needs to be captured from several weblogs to build predictive models for product recommendation in the future to the right group of customers. To harness the power of Big Data, one is required to have an infrastructure that can manage and process huge volumes of structured and unstructured data in real-time, and can protect data privacy and security.
In the above scenario the weblog dataset can be stored in a Hadoop environment. So, what is Hadoop? Let’s have a look at this technology first.
Data storage: Hadoop |
Hadoop is an open-source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment (clusters). It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop is becoming widely popular among the big data user community mainly because of:
Hadoop makes it possible to run applications on systems with thousands of nodes analyzing terabytes of data. It’s distributed file system facilitates rapid data transfer among nodes and allows the system to continue operating uninterrupted in the case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
Massive parallel processing:
Hadoop has been a game-changer in supporting the enormous processing needs of big data. A large data procedure which may take hours of processing time on a centralized relational database system, may only take few minutes when distributed across a large Hadoop cluster of commodity servers, as processing runs in parallel.
Distributive Design:
Hadoop is not a single product, it is an ecosystem. Same goes for Spark. Let’s cover them one by one.
Source: -Wikipedia
Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Part of the core Hadoop project, YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
MapReduce Framework (MR)
MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
Hive
Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
Source: Hortonworks
Easy to use Automated Analytics
Using Hive ODBC driver you can connect SAP Predictive Analytics to Hadoop.
Apache Hive is the most popular SQL-on-Hadoop solution, including a command-line client. Hive server compiles SQL into scalable MapReduce jobs. Data scientists or expert business users having good SQL knowledge can build analytical data set in Hive metastore (service which stores the metadata for Hive tables and partitions in a relational database) by joining multiple Hive tables. Users can then connect to Hadoop system through Automated Analytics Hive Connector for building predictive models using the analytical datasets (already persisted in Hadoop).
DataDirect ODBC driver for Hive is prepackaged with the SAP Predictive Analytics installation and lets you connect to a Hive server without the need to install any other software. Hive technology allows you to use SQL statements on top of a Hadoop system.
For detailed information on how to set up your Hive connectivity, see the Hive section in the Connecting to your Database Management System on Windows or Connecting to your Database Management System on UNIX guide on the SAP Help Portal.
Create very wide datasets using Data Manager
The options below enable users to prepare datasets in Hadoop for predictive modelling.
An analytical dataset can be prepared by joining multiple Hadoop tables via Data Manager in Automated Analytics. Using Hive ODBC connector SAP Predictive Analytics can connect to Hive data store and you can build predictive models on Hive table dataset.
· To create an analytical dataset, open SAP Predictive Analytics application and click on the Data Manager section. Then select the Hive connection to connect to the Hadoop server.
· Select the Hive tables and join them with the keys to create the analytical dataset.
Improve operational performance of predictive modelling
1. Supports full end to end operational predictive life-cycle on Hadoop .
2. (Data preparation through Data Manager->model development through Modeler ->deployment through generating SPARK/Hive SQL, Java code).
3. Automated Analytics support – through HIVE and SPARK SQL ODBC drivers. No coding required.
4. Expert Analytics support through Hive connectivity.
5. Technical details:
• Support of Hive
• Supports data preparation & scoring directly in Hadoop (no data transfer).
With new release of SAP predictive analytics, model training can be performed on spark. .
For more details on this topic please stay tuned for the upcoming blog by my colleague Priti Mulchandani.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
23 | |
11 | |
10 | |
9 | |
8 | |
6 | |
6 | |
6 | |
5 | |
5 |