Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member
25,165

..

“SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA. Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.”

Reference:  http://help.sap.com/hana/Whats_New_SAP_HANA_Platform_Release_Notes_en.pdf    Section 2.4.2

..

Currently Supported databases by SAP HANA smart data access include:

  1. Teradata Database: version 13.0SAP
  2. Sybase IQ: version 15.4 ESD#3 and 16.0
  3. SAP Sybase Adaptive Service Enterprise: version 15.7 ESD#4
  4. Intel Distribution for Apache Hadoop: version 2.3 (This includes Apache Hadoop version 1.0.3 and Apache Hive 0.9.0.)

Also Refer to:

SAP Note 1868209: Additional information about SPS06 and smart data access

SAP Note 1868702: Information about installing the drivers that SAP HANA smart data access supports

UPDATE (Dec 04 2013): As of SPS07 Hortonworks HDP1.3 (When's HDP 2.0 coming?) appears to have been added to the official list, and remote caching of HADOOP Sources has been added, which should hopefully speed queries up for those tables in HADOOP that aren't changing frequently.

http://www.saphana.com/servlet/JiveServlet/previewBody/4296-102-1-9005/HANA_SPS07_NEW_SDA.pdf

UPDATE (Jan 29 2014):  SAP HANA Academy now has a great collection of videos using Smart Data Access.  Thanks SAP :smile:

e.g.

SAP HANA Academy | SAP HANA

SDA: HADOOP -- Configuring ODBC drivers | SAP HANA

SDA: HADOOP -- using the remote HADOOP data source | SAP HANA

Using Smart Data Access (SDA) with HADOOP seems to me a great idea for balancing the strengths of both tools.  Unfortunately for real-time responsiveness  HIVE SQL currently isn't the most optimal tool in HADOOP [instead it's better used for batched SQL commands].  Cloudera's Impala, Hortonworks Stinger initiative  and  MapR's Drill  are all trying to address real-time reporting.

I've only tested Impala so far, but I've noticed speeds of 10 to 100 times improvement over standard HIVE SQL queries. With that in mind I thought it would be interesting to test them both in HANA using SDA.

Unfortunately I’m using Cloudera's open-source Apache Hadoop distribution (CDH), which isn’t on SAP's approved list yet. However since SDA uses ODBC I’ve managed to get it working using a third party ODBC driver from Progress|DataDirect. http://www.datadirect.com/products/odbc/index.html

NOTE: Since CDH is not currently on this list I’m sure SAP will NOT recommend you using this in a production environment. If you do though get it working in a sandbox environment why not help by adding your voice for it be certified and added to the ‘official’ list.

With the disclaimers out of the way this is how SDA works.

Remote Data Sources: 

Once you have your ODBC drivers install properly Remote Sources can be added for both HIVE and IMPALA

Expanding the Remote Sources shows the tables that can be access by HANA.


NOTE: For me expanding the HIVE1 tree takes almost 20 seconds each time expanding a node (perhaps it uses mapreduce?),  IMPALA1 nodes in the hierarchy expanded quickly.

In the above screen shots you will notice that both HIVE1 & IMPALA1 share the same tables as they use the same HADOOP metastore. Data is NOT replicated in HIVE tables and IMPALA tables. The metastore just points to the tables files location within the HADOOP ecosystem, whether stored as text files, HBASE tables or column store PARQUET files (to list just a few).

There are some tables types (file types) that can only be read by HIVE or IMPALA, but there is a large overlap and this may converge over time.

Virtual Tables:

Select Create virtual tables, from your Remote Source, in the schema of your choice.


NOTE: I've previously created an 'HADOOP' schema in HANA to store these virtual tables.

Once created you can open the definition of the new virtual tables, as per normal HANA tables.

Run some queries:

Simple HIVE query on my extremely small and low powered HADOOP cluster (23 Seconds)

NOTE: In the HADOOP system, you can see above the HIVE's map reduce is kicked off

Simple IMPALA query on my extremely small and low powered HADOOP cluster (reading the SAME table as HIVE) (< 1 Second)

NOTE: Impala does not use MAP/REDUCE

With Impala the source table type may impact speeds as well as these 2 simple examples demonstrate.

IMPALA  HBASE table  (40K records in 4 seconds) :

IMPALA  PARQUET Column Store (60 Million Records in 3 Seconds)

HADOOP HBASE source tables are better for small writes and updates, but are slower at reporting.

HADOOP IMPALA PARQUET tables use Column store logic (similar to HANA column tables) which need which take more effort to write too efficiently, but are much faster at reads (assuming not all the fields in a row are return, not that dis-similar to HANA Column tables as well).

You can think of Parquet tables, like the part of the HANA column table after MERGE DELTA, whereas the HBASE table is more like the uncompressed part of a HANA column table PRIOR to MERGE DELTA.

HADOOP tables are still stored on Disk (using HDFS) rather than in memory, however they are making progress in caching tables into memory on the nodes, to better improve performance of queries.

SQL for creating HADOOP Remote Source:

Unfortunately Hadoop remote source can't be manually configured yet.  They do not appear in the drop down:

Since the HADOOP adapter doesn't appear in the list, use the HANA SQL editor to create the HADOOP Remote Sources:

e.g.

DROP REMOTE SOURCE HIVE1 CASCADE;

DROP REMOTE SOURCE IMPALA1 CASCADE;

CREATE REMOTE SOURCE HIVE1 ADAPTER "hiveodbc" CONFIGURATION 'DSN=hwp'

    WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hive;password=hive';

CREATE REMOTE SOURCE IMPALA1 ADAPTER "hiveodbc" CONFIGURATION 'DSN=iwp' 

   WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hive;password=hive';

CDH Driver Installation:

Unfortunately Cloudera doesn’t yet provide ODBC drivers for SAP.

I tried some of their other ODBC drivers for Micro Strategy without success.

Fortunately a third party, Progress | Data direct supplies  ODBC drivers for  HIVE and IMPALA running on CDH.

http://www.datadirect.com/products/odbc/index.htmlhttp://www.datadirect.com/products/odbc/index.html

Dowload their 15 day trial and follow their steps for compiling it for HANA in Linux:

e.g.

wget http://www.datadirect.com/download/files/evals/connect64_odbc/712/PROGRESS_DATADIRECT_CONNECT64_ODBC...
gunzip PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar.Z
tar -xf PROGRESS_DATADIRECT_CONNECT64_ODBC_7.1.2_LINUX_64.tar
./unixmi.ksh

In the $HOME directory of  your 'hdbadm' user you need to add odbc settings.

Create 2 files:  

  .customer.sh   which adds the location of your new driver to the library path

  .odbc.ini         which define the ODBC DSN connections used need when creating a Remote Source

My 2 files appear as follows:

.customer.sh

-----------------

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib:/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib

export ODBCINI=$HOME/.odbc.ini

.odbc.ini

-----------

[ODBC Data Sources]

iwp=DataDirect 7.1 Impala Wire Protocol

hwp=DataDirect 7.1 Apache Hive Wire Protocol



[ODBC]

IANAAppCodePage=4

InstallDir=/opt/Progress/DataDirect/Connect64_for_ODBC_71

Trace=0

TraceFile=/tmp/odbctrace.out

TraceDll=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddtrc27.so



[iwp]

Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddimpala27.so

Description=DataDirect 7.1 Impala Wire Protocol

ArraySize=1024

Database=default

DefaultLongDataBuffLen=1024

DefaultOrderByLimit=-1

EnableDescribeParam=0

HostName=[Put the IP address of your HIVE gateway here]

LoginTimeout=30

LogonID=

Password=

PortNumber=21050

RemoveColumnQualifiers=0

StringDescribeType=-9

TransactionMode=0

UseCurrentSchema=0



[hwp]

Driver=/opt/Progress/DataDirect/Connect64_for_ODBC_71/lib/ddhive27.so

Description=DataDirect 7.1 Apache Hive Wire Protocol

ArraySize=16384

Database=default

DefaultLongDataBuffLen=1024

EnableDescribeParam=0

HostName=[Put the IP address of your main IMPALA node here]

LoginTimeout=30

LogonID=hive

MaxVarcharSize=2147483647

Password=

PortNumber=10000

RemoveColumnQualifiers=0

StringDescribeType=12

TransactionMode=0

UseCurrentSchema=0

WireProtocolVersion=0



UPDATE (Dec 04 2013):  Here is an example of the new SPS07 remote caching syntax:

Select * from hive_activity_log where incident_type = ‘ERROR’ and plant =’001’ with hint (USE_REMOTE_CACHE)

47 Comments
Labels in this area