Hi Folks,
I've been mulling over a series of documents for a while now and finally have the bandwidth to put some of them together.
I've decided to group several of them under the idea of "Thinking in HANA". As a massively-parallel, column-oriented, in-memory platform, HANA affords many opportunities to come up with new and creative solutions to seemingly well-understood problems. Moreover, while HANA fully supports SQL-92, a majority of analytical use cases require efficient and often complex data models that support demanding BI requirements, which lead to several challenges:
This first article will review point 4 above - how to model database set operators (UNION, EXCEPT, INTERSECT) though a straightforward Calculation View "pattern". I also intend on fleshing out some of the theoretical implications of "set operator" thinking in HANA. Also, while set operators such as MINUS and INTERSECT aren't commonly required for meeting analytical needs on HANA, they do have implications for other more frequent requirements. I'll elaborate on these in another article.
The assumption is that the readers have a basic understanding of SQL, Set Theory, and data modeling in HANA.
The following Venn Diagram gives a very simple example of how two sets might relate, having some elements in common but some unique to each set.
In the business world there are many cases where two data sets want to be compared - i.e. planned and actual sales figures, sales amounts and invoiced amounts, and many others.
In terms of comparing two sets of data, the following questions can be asked (and answered):
1) What are all the elements in both data sets?
2) Which elements are exclusive to SET 1?
3) Which elements are exclusive to SET 2?
4) Which elements are in both sets?
5) Which elements are in SET 1 (non-exclusive)?
6) Which elements are in SET 2 (non-exclusive)?
7) Which elements are in SET 1 or SET 2?
The questions above are answered below, with solutions in: 1) plain SQL, 2) 'alternate' SQL, and 3) Calculation View approach. The solutions are illustrated with their respective Venn diagrams and associated set operators.
'Alternate' SQL is SQL that reaches the same answer but uses the same logic flow as the Calculation View. It facilitates the "jump" from traditional SQL to HANA data modeling.
Please see the appendix of this document for the SQL to generate both data sets.
The first four questions posed above are answered (solved) below. The solutions leverage the basic set operations of UNION, MINUS, and INTERSECT. (Here no distinction is made between UNION and UNION ALL). The last three of the seven questions above are solved in a “compound” fashion, built with more than one of the basic set operators.
1. What are the elements in both data sets?
-- UNION (ALL)
SELECT *
FROM
(
SELECT ELEMENT
FROM SET_1
UNION ALL
SELECT ELEMENT
FROM SET_2
)
ORDER BY ELEMENT;
None
The Calculation View simply requires combining both data sets via UNION node. Despite the name, the UNION behaves as UNION ALL.
2. Which elements are exclusive to SET 1?
-- Complement of Set 2 with respect to Set 1
SELECT *
FROM
(
SELECT ELEMENT
FROM SET_1
MINUS -- EXCEPT also works
SELECT ELEMENT
FROM SET_2
)
ORDER BY ELEMENT;
-- Complement of Set 2 with respect to Set 1
-- UNION (ALL)
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) = 2
ORDER BY ELEMENT;
The Calculation View accomplishes the same logic as the Alternate SQL – adding constant values to each data set, aggregating, and then filtering for result records where aggregate value of FLAG column is equal to the constant for the desired data set.
3. Which elements are exclusive to SET 2?
-- Complement of Set 1 with respect to Set 2
SELECT *
FROM
(
SELECT ELEMENT
FROM SET_2
EXCEPT
SELECT ELEMENT
FROM SET_1
)
ORDER BY ELEMENT;
-- Complement of Set 1 with respect to Set 2
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) = 3
ORDER BY ELEMENT;
Note that the solution approach is the same as previously except with a filter for SET 2’s constant value rather than that of SET 1
4. Which elements are in both sets?
-- intersection of Set 1 and Set 2
SELECT *
FROM
(
SELECT ELEMENT
FROM SET_1
INTERSECT
SELECT ELEMENT
FROM SET_2
)
ORDER BY ELEMENT;
-- Intersection of Sets 1 and 2
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) = 5
ORDER BY ELEMENT;
Note that the solution approach is the same as previously except with a filter for FLAG = 5 to capture only those records that fall under both (i.e. 2+3 = 5, hence capturing only overlapping (intersecting) records).
Solutions for question 5-7 are solved by applying more than one basic set operator. As such, these solutions are referred to as “compound set operations”.
SELECT *
FROM
(
(
SELECT ELEMENT
FROM SET_1
EXCEPT
SELECT ELEMENT
FROM SET_2
)
UNION ALL
(
SELECT ELEMENT
FROM SET_1
INTERSECT
SELECT ELEMENT
FROM SET_2
)
)
ORDER BY ELEMENT;
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) IN (2, 5)
ORDER BY ELEMENT;
SELECT *
FROM
(
(
SELECT ELEMENT
FROM SET_2
EXCEPT
SELECT ELEMENT
FROM SET_1
)
UNION ALL
(
SELECT ELEMENT
FROM SET_1
INTERSECT
SELECT ELEMENT
FROM SET_2
)
)
ORDER BY ELEMENT;
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) IN (3, 5)
ORDER BY ELEMENT;
SELECT *
FROM
(
(
SELECT ELEMENT
FROM SET_1
EXCEPT
SELECT ELEMENT
FROM SET_2
)
UNION ALL
(
SELECT ELEMENT
FROM SET_2
EXCEPT
SELECT ELEMENT
FROM SET_1
)
)
ORDER BY ELEMENT;
SELECT ELEMENT
FROM
(
SELECT ELEMENT, 2 AS FLAG
FROM SET_1
UNION ALL
SELECT ELEMENT, 3 AS FLAG
FROM SET_2
)
GROUP BY ELEMENT
HAVING SUM(FLAG) != 5
ORDER BY ELEMENT;
A few points are worthy of additional discussion:
The astute reader will note that the solutions for problems (questions) 6 and 7 could easily be achieved by simply querying SET_1 or SET_2 by themselves. This is true. However, from a theoretical perspective, capturing duplicates would still require inspecting SET_1 and SET_2. According to the HANA SQL guide, EXCEPT/MINUS/INTERSECT can operate “as such”, or with DISTINCT keyword. Unfortunately, they still function as distinct without the DISTINCT keyword – making it so that solutions 6 or 7 can’t be properly achieved if duplicates are required for any reasons. This is a bit pedantic though – as duplicates almost never need to be reported (and they could still be captured through more complex SQL).
The number of possible “locations” of an element is 3 – i.e. in SET_1, SET_2, or both. An element can either exist, or not exist, in those three locations. This means that 8 outcomes are possible, calculated as 2^3, where 2 is the number of states (exists or not exists) and 3 is the number of “locations”. In the above examples, only 7 questions are addressed as the “empty set” is not a relevant question in the context of analytics for this discussion.
The eight possibilities can also be captured in the following table. Empty set is grayed out:
In SET 1 | In SET 2 | In both sets | |
1 | FALSE | FALSE | FALSE |
2 | FALSE | FALSE | TRUE |
3 | FALSE | TRUE | FALSE |
4 | FALSE | TRUE | TRUE |
5 | TRUE | FALSE | FALSE |
6 | TRUE | FALSE | TRUE |
7 | TRUE | TRUE | FALSE |
8 | TRUE | TRUE | TRUE |
The following discussion is for the sake of theoretical curiosity rather than an actual pragmatic solution. The problem is partially solved, left to readers smarter than myself to answer.
Consider the following Venn diagram, particularly the sections in green.
How would you go about finding the elements associated with the green sections above? Visually we can see that it’s A, G and D – but what if each set had many elements and we had to solve this situation with SQL?
First note that set operators like UNION and INTERSECTION correlate with the Boolean operations OR and AND. Moreover, databases have helpful operations like IN and BETWEEN which essentially reduce to combinations of ANDs and ORs. In solutions highlighted above in this paper, some “Boolean logic” was implemented in set operators (UNION, MINUS, etc), and some in the filter condition to meet the final output. The question is – can all of the logic be implemented in set operations, with a filter on a single value that identifies all possible combinations? In our diagram above – rather than building confusing SQL (or Calc Views) with various set operations, and filtering further with Boolean logic – could we simply identify a single value assigned to that combination, and filter on that value? The answer is yes. It’s not pragmatic, but is interesting to note that it’s possible.
Here’s a simplified version that shows the direction of the solution – although not yet fully implemented. Note that this simplest-case becomes very complex – highlighting unpractical it is – but also illustrating the “elegance” assigned to mathematics that’s often easy to miss.
Consider the following Venn diagram:
As noted previously, there are eight possible situations for any single element. In previous solutions we assigned “constant values” to each base set, combined them, and then filtered in order to arrive at a desired result. You may have noticed that the numbers used, 2 and 3, are prime numbers – which indicates the direction of a more robust solution.
Simply aggregating numbers, even when prime, does not guarantee a unique outcome. If three sets are available, a developer were to use the first three prime numbers: 2, 3, and 5 – it’s clear that aggregating 2 and 3 would equal 5 – causing problematic results (potentially) if a similar approach were used as previously in this paper.
The solution approach (unfinished) then is as follows:
Run the following SQL, and note how a filter on ID = 6 would give you the “complement” of the sets, as one example.
DROP TABLE SETS_WITH_FLAGS;
CREATE COLUMN TABLE SETS_WITH_FLAGS AS
(
SELECT ELEMENT, MAX(FLAG_1) AS FLAG_1, MAX(FLAG_2) AS FLAG_2, (MAX(FLAG_1)*MAX(FLAG_2)) AS M_FLAG
FROM
(
SELECT 'A' AS ELEMENT, 2 AS FLAG_1, 1 AS FLAG_2 FROM DUMMY UNION ALL
SELECT 'B' AS ELEMENT, 2 AS FLAG_1, 1 AS FLAG_2 FROM DUMMY UNION ALL
SELECT 'B' AS ELEMENT, 1 AS FLAG_1, 3 AS FLAG_2 FROM DUMMY UNION ALL
SELECT 'C' AS ELEMENT, 1 AS FLAG_1, 3 AS FLAG_2 FROM DUMMY
)
GROUP BY ELEMENT
);
SELECT T2.ELEMENT, T1.M_FLAG * T2.M_FLAG AS ID FROM SET_FLAG T1 CROSS JOIN SET_FLAG T2 ORDER BY T1.M_FLAG * T2.M_FLAG;
Following is the source data SQL used for the examples in this document.
DROP TABLE SET_1;
CREATE COLUMN TABLE SET_1
(
ELEMENT CHAR(1),
PRIMARY KEY (ELEMENT)
);
DROP TABLE SET_2;
CREATE COLUMN TABLE SET_2
(
ELEMENT CHAR(1),
PRIMARY KEY (ELEMENT)
);
INSERT INTO SET_1 VALUES ('A');
INSERT INTO SET_1 VALUES ('B');
INSERT INTO SET_1 VALUES ('C');
INSERT INTO SET_1 VALUES ('D');
INSERT INTO SET_1 VALUES ('E');
INSERT INTO SET_2 VALUES ('D');
INSERT INTO SET_2 VALUES ('E');
INSERT INTO SET_2 VALUES ('F');
INSERT INTO SET_2 VALUES ('G');
INSERT INTO SET_2 VALUES ('H');
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
9 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 |