Connecting SAP DataServices to Hadoop Hive is not as simple as connecting to a relational database for example. In this post I want to share my experiences on how to connect DataServices (DS) to Hive.
The DS engine cannot connect to Hive directly. Instead you need to configure a Hive adapter from the DS management console which will actually manage the connection to Hive.
In the rest of this post I will assume the following setup:
The DS server will not be installed within the Hadoop cluster, but it will have access to it. Therefore Hadoop need to be installed and configured appropriately on the DS server. I wont’ go to much into detail for this step, because the given environment may be quite different from my tested environment.
Roughly, there are two approaches for installing Hadoop on the DS server:
Before you can start the wizard you need to enable a passwordless login from the Ambari server to the DS server using an SSH private key.
Later in the wizard process, you need to configure the DS server as a client host; this way all required Hadoop libraries, Hadoop client tools and the Ambari agent will be installed and configured on the DS server. Ambari will also monitor the availability of the DS server.
The DS jobserver needs some Hadoop environment settings. These settings mainly specify the Hadoop and Hive home directories on the DS server and some required Hadoop JAR files through CLASSPATH settings. DS provides a script that sources these environments, please check DS Reference Guide 4.2, section 12.1 Prerequisites.
Important: for Hortonworks HDP distributions, DS provides another script than the documented script hadoop_env.sh. For HDP 2.x distributions you should use the script hadoop_HDP2_env.sh instead (this script is only available from DS version 4.2.2 and later).
On the DS server you should be able to start hive and test the connection. For instance by running the HiveQL command show databases:
Finally, restart the DS jobserver so that it has the same environment settings as your current session. Also make sure that the hadoop_env.sh script will be started during the DS server startup process.
You may check the following documentation to set up the DS Hive adapter:
In my experience these two subjects usually will not work without problem. The tricky part here is to set CLASSPATH correctly for the adapter. This task is not well documented and depends on the Hadoop distribution and version. Therefore you might end in a series of try-and-error configurations:
These steps are described in more detail in the following sections. The Troubleshooting section in the blog Configuring Data Services and Hadoop on SCN may also help.
For the initial setup of the Hive adapter I used the CLASSPATH setting as described in Configuring the Data Services Hive Adapter on SCN. For instance, the initial configuration of the Hive adapter looked like this in my test environment:
In the DS designer I created a Hive datastore. The first error message I got when saving the datastore was:
The same error message should be visible in the Hive adapter error log:
The error message means that the Hive adapter tries to use the Java object org.apache.hadoop.hive.jdbc.HiveDriver but cannot find it. You will need to find the corresponding JAR file that contains this Java object and add the full path of this JAR file to the Hive adapter; then return to step 3.1
There will probably be more than one Java object missing in the initial CLASSPATH setting. Therefore you may very well end up in an iterative process of configuring and re-starting the adapter and testing the adapter by saving the datastore in DS designer.
How do I find the JAR file that contains the missing Java object?
In most cases all required Java objects are in JAR files that are already provided by either the DS installation or the Hadoop installation. They are usually located in one of these directories:
I have developed a small shell script that will search for a Java object in all JAR files in a given directory:
[ds@dsserver ~]$ cat find_object_in_jars.sh
#!/bin/sh
jar_dir=$1
object=$2
echo "Object $object found in these jars:"
for jar_file in $jar_dir/*.jar
do
object_found=`jar tf $jar_file | grep $object`
if [ -n "$object_found" ]
then
echo $jar_file
fi
done
exit 0
For example, the script above found the object org.apache.hadoop.hive.jdbc.HiveDriver in the file $HIVE_HOME//lib/hive-jdbc-0.12.0.2.0.10.0-1.jar. The full path of this file need to be added to the CLASSPATH setting of the Hive adapter:
Note: the script above is using the jar utility. You need to install a Java development package (such as Java Open SDK) if the jar utility is not available on your DS server.
DS need to connect to a hive server. The hive sever actually splits the HiveQL commands into MapReduce jobs, accesses the data on the cluster and returns the results to the caller. DS 4.2.2 and later versions are using the current hive server called HiveServer2 to connect to Hive. This version of the hive server is the default for most Hadoop clusters.
The older version of hive server, simply called HiveServer, is usually not started per default on current versions of Hadoop clusters. But DS version 4.2.1 only works with the old hive server version.
The old version of hive server can easily be started from the hive client tool, see HiveServer documentation.
The default port number for HiveServer is 10000. But because HiveServer2 might already be running and listening on this port you should define another port for HiveServer, let’s say 10001:
You can start the HiveServer on a node in the Hadoop cluster or on the DS server. I recommend to start it on the same node where HiveServer2 is already running – this is usually a management node within the Hadoop cluster.
It is also worth to test the HiveServer connection from the hive command line tool on the DS server. When calling the hive CLI tool without parameters it does not act as a client tool and does not connect to a hive server (instead it then acts as a kind of rich client). When calling the hive CLI tool with the host and port number parameters it will act as a client tool and connect to the hive server. For instance:
After upgrading either DS or Hadoop you might need to reconfigure the Hive connection.
Upgrading from DS 4.2.1 to 4.2.2 requires a switch from HiveServer to HiveServer2, see section 4 above.
Furthermore, newer releases of DS might use other Java objects. Then the CLASSPATH of the Hive adapter need to be adapted as described in section 3.
After upgrading Hadoop most often the names of jar files have changed because their names contain the version number. For example:
For instance, when upgrading HDP from version 2.0 to 2.1 the HDP upgrade process replaced some jar files with newer versions. The CLASSPATH setting of the Hive adapter need to be modified accordingly. I found this approach quite easy:
Search for the new file name in the relevant file system directory and modify the CLASSPATH of the Hive adapter accordingly. Save again…
Note: the HDP installation and upgrade process maintains symbolic links without version number in their names. The symbolic links point to the latest version of the jar file. It is of course a good idea to reference these symbolic links in the CLASSPATH setting of the Hive adapter. But unfortunately not all of the required jar files do have symbolic links.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
9 | |
7 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 | |
3 |