In this blog, I will discuss an algorithm used to create an analytic application that would predict risks using ML algorithm.
Note: This is my first blog ever, so if you find any mistakes, please leave a comment. I will try to correct it to the best of my ability.
Overview
There are two types of the algorithm discussed in this blog post.
1. Aggregation algorithm: This is the main algorithm aggregating the data residing in the database.
2. Machine learning algorithm: Bayes theorem is used in this blog primarily to demonstrate the flexibility of using any algorithm that fits the requirement.
This blog discusses combining these two algorithms in SAC analytic applications to create a tax prediction tool.
Figure 1: It shows the prediction tool in SAC; Source: Author’s own illustration
As seen in the figure (Figure 1), It is a risk prediction tool created in the SAP analytics cloud which lets the user select a hypothesis, risk type and risk name based on this information, and it gives a prediction using Bayes theorem.
The main advantages of such a tool are as follow:
- The application automatically adjusts the output as new data is ingested into the database. It saves the need to train the data to get insight. The end user can leverage the benefit of such tools without knowing much about machine learning.
- It provides real-time prediction, i.e. the changes are reflected in the application as soon as the changes take place in the database table.
- It allows implementing of a wide range of machine learning algorithms to solve business problems such as linear regression, multiple regression and more.
Unfortunately, the actual code is very long, so I chose only to share the algorithm. I hope the algorithm helps in other use cases.
The algorithm for creating prediction tool in SAC
I used the following algorithm in the SAC Analytic application for data aggregation and applied the output from the data aggregation to the Bayes formula for calculating risk prediction. The algorithm is as follows:
Step 1: Create array A1 which constitutes the data from the fact table (Figure 2).
Step 2: Create an array B1 which comprises an array of dimension members from the data source.
Step 3: Loop over each element of B1 and count the number of data of that dimension in A1.
Step 4: Create an array C1 to store the count of the array created in the previous step.
Step 5: Get the value for each element of B1[x] at C1[x]
Step 6: Get the length of A1
Step 7: Calculate the probability of risk
Step 8: Get the count for the selected impact (Hypothesis)
Step 9: Calculate the probability of the impact
Step 10: Count the element of B1[x] when A2[x] is equal to the given impact (Hypothesis) (Figure 3).
Step 11: Store count from the previous step in a variable.
Step 12: Calculate P(Risk|Impact)
Step 13: Perform arithmetic calculation for the formula of Bayes Theorem.
Step 14: Return the output
Figure 2: The figure shows the logic for Algorithm steps 1 to 4; Source: Author’s own illustration.
Description of the Algorithm logic
The first four steps of the algorithm are used to bring all the required data from the model dimension and fact table and store these data in the array. The arrays are later used for creating another array that holds the count of data found in the fact table concerning the dimensions. After these tasks are achieved, the next step is to create a subset of this sample data based on the hypothesis selection by the end user. The selection made by the user using the second dropdown (which allows one to choose the risk name) is similar to an element at B1[x], where x is the address of the data in the array. It retrieves the count value of the risk B1[x] from C1[x], where x is the address of the data in the array. Calculates the length of the fact table and uses the length to calculate the Probability of Risk P(Risk) (Figure 2).
The user chooses a hypothesis for the impact using the radio button. The selection made on the radio button is compared on the A2 array (dimension having binary data), and the count is saved in a variable used to calculate the Probability of the Impact P(Impact) (Figure 3).
Moreover, P(Risk|Impact) is calculated using the number of risks B1[x] in column A2, where A2 equals the impact. Finally, after calculating P(Risk), P(Impact), and P(Risk|Impact), this calculation is passed down to a function that uses the formula for the Bayes theorem and returns the output of the posterior probability.
Figure 3: The figure shows the logic for Algorithm steps 10 to 12; Source: Author’s own illustration.
Testing the output of prediction tool
In this section, The correctness of the output from the risk prediction application is validated using the manual method of calculating the Bayes theorem. The first step is to check the data points from the fact table in the SAC model, and the next step is to calculate the probabilities used in the Bayes theorem and then use these probabilities to get the final prediction.
Figure 4: It shows the output of the risk prediction tool based on given input; Source: Author’s own illustration.
As shown in the figure above (Figure 4), the tax risk prediction tool predicts that the hypothesis of impact being "High" for the risk type "compliance" and risk "Data Integrity" is 35%.
Figure 5: The figure shows the fact table data in the model; Source: Author’s own illustration.
Manual calculation: The manual calculation is required to check the correctness of the tool. The Bayes theorem requires the calculation of posterior probability (P(Impact|Risk)), which requires probabilities such as P(Risk), P(Impact), and P(Risk|Impact). The manual calculation of data points requires manually calculating each data point from the table. In the above figure (Figure 5), it can be seen that the total number of rows is equal to 19. The compliance risk "Data Integrity" comes up only three times, and the probability of the "Data Integrity" turning into a high impact risk is two times out of 19 rows, i.e., 2/19.
P(H)= 10/19= 0.52
P(E)= 3/19= 0.157
P(E|H)= 22/19= .105
P(H|E)= (P(E|H)*P(H))/P(E)
=(0.105*0.52)/0.157
=0.347 ~ 0.35 = 35%
Testing the AutoML capability of prediction tool
As discussed earlier, the Risk prediction tool may update the prediction values in real-time as new information is added to the database. This demonstration shows that the initial prediction value changes when new data is entered into the database.
In the below figure (Figure 6), It can be seen that the risk type is "Management", and the risk name is "Experienced tax people leaving the company". The probability of the impact of this risk being high is 52%.
New risk data is added to the database (Figure 7). The data added is only for the "Experienced tax people leaving the company", and the impact is set to High. as shown in the figure
The tax prediction tool updates the predicted value in real-time to 55% for the same risk type and risk name (Figure 8).
It demonstrates the tax risk prediction tool's capability to give a real-time prediction as the data changes in the database. The query automatically updates the result without retraining data. It shows the ability of AutoML in the SAC analytic application.
Figure 6: It show the risk prediction based on the given input.
Figure 7: It shows adding new data using CRUD APIs; Source: Author’s own illustration
Figure 8: It shows the risk prediction based on the given input; Source: Author’s own illustration
Conclusion:
This blog demonstrates that the SAC Analytic application allows the flexibility of creating applications with AutoML capabilities. The algorithm and the use case shown and discussed in this blog aren't limited to creating a risk prediction tool. The algorithm can leverage other machine learning algorithms (such as Naive Bayes Classifiers, Linear Regression, Multiple Regression etc.) to solve classification and regression problems.
Please share your thoughts and feedback in the comment. If you like this blog, please follow me for more similar content.
Further Reading:
- Creating a predictive analytic application using R in SAP Analytics Cloud