As explained in this blog on Omni-channel consumer engagement, the focus is getting bigger on solutions that can solve unique but complex usecases to cater to the expectations of the Marketing Analysts or the E-Commerce applications, which is to make sense out of sheer volume of data screaming in from various channels (think 3Vs : Volume / Variety / Velocity ).
This is one area where HANA can be used innovatively to solve hitherto unknown problem domains. In this blog, I tried to trace thru one such problem that we solved based on the above-mentioned blog, how we expanded the usecase, how different approaches were experimented and how a unique algorithm was invented on HANA to meet the requirements.
The problem originated from a simple question: What if an e-Commerce site were to recommend to the user what people with similar “interests/Likes” have already bought. This looked like an easy problem to solve, but then, when we look at it realistically from a social media point of view, there can be hundreds of users for each similar Like, each having bought completely different product, we may end up recommending the entire product catalog to this user!
Then, after some more research, we realized that the behavior of the user can be closely matched if we were to combine more than one Like together and see which users overlap this combination the most, i.e, people with likes similar to Wildlife, National Geography and Camera might have more things in common in terms of choosing a lens for the camera, than individual Likes. This solves two problems: identify behaviorally similar set of people, and reduce the choice of recommendation in a rational way.
It finally resulted in the problem statement: “If current user x have y number of Likes (Lx1..Lxy), and if there are other users in the same domain with their own Likes (L11..Lmn), what could be the most overlapped group of Likes, considering both on the number of Likes in that group and the number people following that group.”
Following example gives an illustration: Have a look at the current user’s Likes, and how many other users in the target group share how many similar Likes.
Out of the problem domain mentioned above, the key challenge to be solved is “How can we find out, out of the whole space, the (top most) combinations that are common”. Of course, in order to do that, we may need to calculate all combinations of the current user with each and every combination of every other user which would be permutationally taxing. So, we needed to look for a quicker and more performant approach.
Also, based on specific usecases, some would give more weightage to number of Likes in the group, or, some to the number users per group. Then we can extend the problem to support both metrics. Say, if there 5 users having 10 similar likes that of the current user, it would indicate a small number of users but a strong similarity (we call this Match). Or, there could be 100 users having 3 similar likes which indicates huge trend with a limited similarity (we call this Support). Based on the usecases and the Likes involved, it simply depends, and hence we need provide both metrics in a sorted order, so that an analyst can make a better choice.
The following diagram explains the entire workflow of the usecase we want to solve, from the bottom where have the raw data to the top where we have the product recommendations:
We started looking into the Apriori algorithm as it tries to solve a similar problem in the domain of market basket analysis, where individual baskets are recorded as the users buy them, and it can predict the probability of the likely products to be bought.
Say, if we record the following purchases with Apriori
- {P1, P2, P3},
- {P2, P3},
- {P1,P2} )
Now, we can ask the probability of someone buying {P2} if they have already bought {P1} and calculate how much support and confidence we have in that.
In this example,
For {P1} -> {P2} => Support is 2, Confidence is 100%
i.e, people always buy P2 if they buy P1 based on 2 occurrences in the recordset.
But Apriori does not have a concept for the input set, as required by our problem (i.e, the likes of the current user and compare the rest of the sets for matchings), but if somehow we have to figure out a way to bring {OctoberFest, Ferrar, Photography, Nikon} on LHS, it can automatically spit out the necessary support combinations on RHS, which will solve our primary problem of bubbling up all the combinations. But it cannot happen as atleast one value out of that set has to be on RHS (see the Normal method in the below diagram)!
And, one trick we used to make that happen is to add a dummy like called “Target”, which will ensure we get all 4 like on LHS. From here, Apriori will take over and actually gives us all the support combination. In fact it will give all combination, but we can set a pre-defined RHS filter which is “Target”. It works but with a major disadvantage: still the algorithm has to run through all combinations beyond what is necessary for our purpose. And, we were even thinking about extending the Apriori itself at code level until we stuck upon a different path altogether.
As we were brainstorming for a better solution which is economical and performant, we thought why not solve the problem from the DB/SQL point of view rather than as a programmatic algorithm, which anyway suits us fine to showcase the capabilities of HANA and its number-crunching abilities.
So, we evolved base of this solution on the premise of providing a unique identity to each Like which will result in a unique identity to each combination, then simply count them. In the above example, let us map each of the input likes to L1, L2, L3 and L4, then the resulting calculations would look like below. With this we have reached calculating the direct combination quite simply and quickly.
Though this aggregation would give us the direct combinations of likes, the “hidden” likes needs to be found out: for example, if we have combinations of
{L1, L2,L3} = 10 occurrences
{L1, L2, L4} = 20 occurrences,
Then we have to effectively deduce that {L1, L2} = 30 occurrences. We overcame this problem using L-script within HANA with a binary comparison method.
The key advantage of this approach is performance when especially run on a flat data model without joins. In our experiments on a typical development box, for 1 million likes transactions can be completed under a second to calculate the matches and supports (with L-Script for hidden matches obviously taking most of the time).
Once we reach this combination, it’s only a matter of getting the right products used by this combination and parameterize it so it can be calculated for high support and/or a high match.
So, the key take-away at least for us is, we can really take the seemingly simple problems but extend their scope in such a way that the problem itself is as innovating as the solution that can be made possible on HANA.
Your thoughts are welcome especially around additional usecases around this space both functionally and technically which can enrich the marketing domain…
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
14 | |
11 | |
9 | |
9 | |
7 | |
6 | |
6 | |
5 | |
5 | |
5 |