SAP HANA provides numerous advanced multi-model analytical capabilities. One such seamlessly integrated core functionality is SAP HANA Graph. A series of step-by-step
tutorials explain in detail on how to get started with graph analytics on SAP HANA. In this article, we discuss different shortest path problems, walk you through a simple SAP HANA Graph program implementation, and evaluate the performance on SAP HANA Cloud.
SAP HANA, The Business Data Platform for All Applications (Image Source)
Graph analytics leverages graph structures to comprehend and assess relationships in a network. A recent market
report suggested that the global graph analytics market size is expected to grow to USD ~2.5 billion by 2024. Another
report listed it as one of the key topics in the 2020 Data Science Dictionary. With the advent of Big Data and IoT, need for an easy on-boarding of new data, uncovering insights, and understanding the relationships within the data objects is an ever-increasing requirement. From Google’s famous Page Rank algorithm and mobile navigation systems to LinkedIn’s recommendation engine, one can find numerous applications of graph algorithms and graph analytics in daily life.
One of the most fundamental graph problems is to find shortest paths. Algorithms to find shortest paths are applied on a graph, which is a collection of nodes and edges. An edge can be undirected, directed, and can carry some characteristics such as weight. If you consider an organization chart, you can think of a node as an employee, a directed edge will be an employee reporting to his/her manager, and an undirected edge will be the friendship relation of the employee with his/her colleagues.
What is the Shortest Path problem?
The Shortest Path problem is about finding a path between two points with the minimum cost. Common cost factors are travel distance and time. You must have used Google or Apple Maps at some point, which use efficient algorithms to tackle the shortest path problem to guide you through the fastest and cost-effective route to your destination. Another application is analyzing the fastest news or information spread in a social network such as Facebook or Twitter.
Types of Shortest Path Problems:
- Shortest Path One-to-One: This is the problem of finding the shortest path between two points. Example (i) represents such a path (shown in green) from start node A to target node D, i.e. path A→B→D.
- K-Shortest Path: This is the problem of finding more than one shortest path between two points. If K = 1, then it is same as the above variant. You can think of this problem as finding multiple fastest routes to your home and also look for a route which includes an ice-cream shop. Example (ii) shows the solution of K-Shortest Paths problem with K = 3 from point A to point D, namely paths A→B→D, A→C→D, and A→C→B→D.
- Shortest Path One-to-All: This problem aims at finding shortest paths to all points from the start point. Example (iii) shows the shortest paths from start node A to every other node in the graph. (i) One-To-One (ii) K-Paths (iii) One-To-All
The simplest shortest path calculation is hop-based, where a shortest path is the minimum number of edges needed to reach the target. More advanced variant is when you introduce costs to the paths, and hence the shortest path is the path with minimum cost to the target. So having covered some basics, we are now ready to jump to the performance investigation. Let us begin by introducing the data set we have used for the calculations.
Test Setup
Dataset
Pokec is the most popular social networking platform in Slovakia. The dataset comprises of ~1.6 million people considered as nodes, which have ~30million relationships between them considered as edges of the graph. An
article, as part of ArangoDb’s open-source performance benchmark series compares the performance of different algorithms using Pokec dataset. For calculations in this post, we also decided to use the Pokec social network data. The data is available in two flat files containing nodes and edges respectively, which can be easily imported as SAP HANA Tables.
Hardware
For the calculations, we provisioned an SAP HANA Cloud system with following specifications:
- Memory: 45GB
- CPU: 3 vCPUs
- Storage: 160 GB
Test Description
In SAP HANA, we use a database procedure to execute a shortest path function.
CREATE OR REPLACE PROCEDURE "POKEC"."GS_SP_LENGTH"(
IN "SOURCE" BIGINT, IN "TARGET" BIGINT, IN i_dir NVARCHAR(10),
OUT o_len BIGINT)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
GRAPH g = Graph("POKEC","GRAPH");
VERTEX v_s = Vertex(:g, :SOURCE);
VERTEX v_t = Vertex(:g, :TARGET);
WeightedPath<BIGINT> p = Shortest_Path(:g, :v_s, :v_t, :i_dir);
o_len = LENGTH(:p);
END;
We start by defining an SAP HANA GraphScript procedure, which is similar to an SAP HANA SQLScript procedure but needs to have GRAPH as a mandatory language identifier. Inside the BEGIN-END block, we have access to the high-level interface for graph analytics, and most of the analytics is performed here. Graph g is the graph object created by the imported vertex and edge tables of the data set.
So, let us focus on the Shortest Path function call. A simple function call takes a graph (g), source vertex (v_s), target vertex (v_t) and direction (i_dir) as parameters. And the result of the calculation is stored as a path p, which is basically a combination of nodes and edges.
WeightedPath<BIGINT> p = Shortest_Path(:g, :v_s, :v_t, :i_dir);
And that is all you need to run the simplest Shortest Path call! If you ever tried to write your own shortest path implementation using any of the conventional algorithms like Dijkstra or Bellman-Ford, you would know that it is quite a hassle to take care of multiple loops, conditional statements, etc. We take care of it all on our side and provide you with a simple function call which achieves one of the best performances.
More functional features
SAP HANA Graph provides you additional
built-in graph algorithms such as K-Shortest Paths and Shortest Path One-To-All to solve different shortest path problems. You also have the flexibility to use some additional features within a shortest path call, like declaring a weight function or choosing a direction parameter for traversal. With a weight function, you can assign weight to an edge, and also apply bounds on both hop distance and weighted distance, in order to filter for paths that do not satisfy the conditions. One prominent use case of such feature is supply chain optimization, where the usual objective is to plan routes with the minimum fuel consumption and using a weight function, you can penalize routes with heavy traffic.
Results
Single Call
Let us call this procedure. We randomly choose two points and provide them as input parameters for source and target, followed by direction, and the last placeholder is for the output of the procedure.
CALL POKEC.GS_SP_LENGTH(8414, 743803, 'OUTGOING', ?);
CALL POKEC.GS_SP_LENGTH(8414, 743803, 'INCOMING', ?);
CALL POKEC.GS_SP_LENGTH(8414, 743803, 'ANY', ?);
We notice that it takes merely 1ms, 2ms, and 2ms respectively, to find a path between two points for different direction parameters in a network of 30 million edges!
1000 Sequential calls
Next, let us see how the function performs for different hop distances. For this, we choose 1000 random pairs of source and target vertices and store them in a table. Afterwards, we use an SAP HANA SQLScript procedure to loop over these table entries and call the shortest path procedure sequentially. You can check out the complete code on our
github page.
The above bar chart summarises the result of this calculation. For different hop distances, we notice the average runtime is between 3ms-6ms (represented by yellow bars), the minimum runtime is between 2ms-4ms, and the maximum runtime is between 6ms-12ms (represented by lower and upper bar of the vertical line, respectively). You see these range of values for same hop distance is because some of the shortest path calculations encounter hub vertices i.e. a vertex with lots of edges. For instance, if you consider Twitter data, a famous celebrity like Rihanna has lots of followers, making her a hub vertex.
1000 Parallel Calls
Lastly, let us look at an even faster way to run 1000 calls. SAP HANA SQLScript provides an operator called
MAP_MERGE which replaces sequential-for loops with parallelization. Executing such a call only took ~180ms, i.e. 0.18ms per query if we compare bluntly. This performance is unparalleled.
Conclusion
In conclusion, in this blog post we understood the Shortest Path problem, walked over a simple SAP HANA Graph Shortest Path implementation and measured its performance on SAP HANA Cloud. For more detailed information on various graph algorithms that can be performed with a single line of code, check out the
SAP HANA Cloud Graph Documentation.
If we have aroused your interest in SAP HANA Graph, also check out the
Devtoberfest video series or our
github repository to learn about other graph examples. If you try out our examples, we would like to hear your thoughts in the comments section below. We will keep adding more examples to our github repository and add a comment here, so please click the 'Follow' button to be updated. In case you have questions about SAP HANA Graph or SAP HANA in general, you can ask them at
SAP Community page for SAP HANA.