Introduction:
Hello everyone, I am back with an actual requirement that I had to develop on Web Intelligence which got my head spinning for a while.
I was asked to visualize the data distribution for some scores on a report, in other words, users needed to see the histogram of a certain data set.
In this blog I will focus on 2 points:
- Preparing the data and plotting the histogram (Column and line chart).
- Calculating the normal distribution function, known as NORM.DIST() function on excel.
I will be using Web Intelligence 4.2 SP7.
Step By Step Guide:
Preparing the data and plotting the histogram:
So, what's a histogram? It's basically a plot for the occurrence of certain values within a certain range or interval, the key words in the solution is in these 2 words intervals (Bins) and occurrences (Count), this is an example of how my data looks like, this is a sample data of 500 records that I generated on excel to be used on this exercise:
Let's assume that ID is an employee idand and Scores is an employee's score in a certain task, we have an upper and lower range for the scores in my case it's 50 to 100.
First we will need to create the bins:
Create a dimension variable and type in the following formula:
[Bins] =
If [Scores] Between(50;54) Then 0
ElseIf [Scores] Between(55;59) Then 1
ElseIf [Scores] Between(60;64) Then 2
ElseIf [Scores] Between(65;69) Then 3
ElseIf [Scores] Between(70;74) Then 4
ElseIf [Scores] Between(75;79) Then 5
ElseIf [Scores] Between(80;84) Then 6
ElseIf [Scores] Between(85;89) Then 7
ElseIf [Scores] Between(90;94) Then 8
ElseIf [Scores] Between(95;99) Then 9
ElseIf [Scores] >=100 Then 10
You are basically telling it if you find a score within the range 50 to 54, put 0 (assign bin number zero).
We stopped at 54 in order to avoid redundant count between each bin and the next bin, I applied this on percentages data only, I always choose a step of 4, it worked on 4 data examples for me until now, you can change the bin size, given that you will increase or decrease the number of binning conditions accordingly.
I tried thinking of a more generic way to automatically bin them, it could be through loading the data into BW and doing a start or end routine with a loop that assigns a bin for a certain range on a certain step then stops at the maximum score, I will think of that later when I have some time, let me know if you have any ideas.
Note that if your percentages have decimal values you will need to round them to avoid losing samples between integer gaps, for example you could have a 54.5 value, it will end up unassigned in the conditions above. You can use Round([Scores];0), Floor([Scores]) or Celi([Scores]), read the differences between them and decide which one is more suitable for your case.
Note that this will limit the application of this solution on certain cases, I can imagine manual bucketing would be very hard on big statistical data sets.
The bins are a must in order to keep your graph smooth, no sudden drops due to lack of values, suppose you have 50, 51, 52, 55 scores only, no one scored 53, or 54, you will have a sudden drop in your graph, it will look something like this:
Count the occurrence of scores within each bin:
[Count] = Count([Scores]) ForEach([Bins];[ID])
What ForEach does is force the data to be broken down by Bins and ID, you can read more about it, in a nut shell it's like you are giving any visual a table of the scores aggregated by the Bins and ID.
Plotting the histogram:
If you want it to be displayed as bars, you will simply drag a column chart assign the [Bins] to the Category Axis and the [Count] to the Value Axis 1.
There you go:
Now it's all a matter of formatting, let's remove the X axis, Y axis and narrow those bars and the spacing between them a little bit, here is how we can do it:
Removing X axis:
Simply right click the chart>Format Chart>Category Axis>Design>Untick Visible.
Removing Y axis:
Right click the chart>Format Chart>Value Axis>Design>Untick Visible.
Reduce spacing between bars:
Right click the chart>Format Chart>Plot Area>Design>Change spacing between groups value to 2%.
Finally removing the dotted background horizontal gridlines:
Right click the chart>Format Chart>Plot Area>Background>Set the slider of Category Axis Grid Color to 0%.
This is the result:
I also calculated the Mode, Median and Mean, you can calculate the Percentiles, Min and Max put them in a horizontal table to help whoever is doing the analysis, also note that we will need to calculate the Mean and StDev for the normal distribution excercise, here are the equations:
[Mean] =Average([Scores] ForEach([ID]))
[Median] =Median([Scores] ForEach([ID]))
[Mode] = Mode([Scores] ForEach([ID]))
[StDev]= StDev([Scores] ForEach([ID]))
You can also embed the calculations in the chart category axis or title whichever you prefer, here is a sample on the category axis:
Title formula: = "μ ="+[Mean]+" σ="+[StDev]
Right click the chart>Format Chart>Category Axis>Design>Title>Tick visible>Choose custom title put the formula above>make sure you go to design>Tick the show category axis in general, In case you want to hide the bin numbers, untick Display axis and show under layout if you still want to hide the bin numbers.
We can also change the plot to a smooth line chart, we will need to adjust the following options:
Right click the chart>Format Chart>Global>Palette and Style>Untick Line Symbol under Marker and Tick Spline line under Line Effects to change it from a broken line chart to a spline, this is the result:
Calculating the normal distribution function, known as NORM.DIST() function on excel.
I have seen many people asking about this in some old threads, so I searched for the actual formula that NORM.DIST computes, and I was able to calculate it successfully in my report.
The Normal Equation. The value of the random variable
Y is:
Y = { 1/[ σ * sqrt(2π) ] } * e
-(x - μ)2/2σ2
where
X is a normal random variable, μ is the mean, σ is the standard deviation, π is approximately 3.14159, and
e is approximately 2.71828.
Now I will take this as a chance to introduce a very useful new feature in Webi 4.2, it's called cell reference, what it does is exactly as it sounds, references the value within a certain cell in your report, I used this on the table in the picture above to reference the Mean and Standard deviation values, you can also hide the table you are referencing from without any issues.
You simply reference a cell by right clicking it and selecting assign reference, give it a name, here is where you will find them:
Now for the formula:
=(1/([R.StDev]*Sqrt(2*3.14159)))*Exp(-(Power([Scores per ID]-[R.Mean];2))/(2*Power([R.StDev];2)))
If you want to avoid using cell reference or for older versions of Web Intelligence, use this formula:
=(1/([StDev] ForAll([ID])*Sqrt(2*3.14159)))*Exp(-(Power([Scores per ID]-[Mean] ForAll([ID]);2))/(2*Power([StDev] ForAll([ID]);2)))
Here is the webi result compared to the excel NORM.DIST() result, as you can see only 1 decimal difference, most probably because of the pie approximation:
This is everything, I hope you found this useful, Web Intelligence is a very powerful tool and a lot can be done with it, specially after the 4.2 release, I hope SAP is planning to introduce R or Python to it.
I believe that we can do a lot by getting the most out of our tools, you can do far more than normal with the proper mix and match.
If you like this blog post you can check my other ones through these links:
https://blogs.sap.com/2018/12/18/monitoring-bw-loading-status-through-web-intelligence-via-email/
https://blogs.sap.com/2019/03/10/automatically-report-rejected-rows-from-error-stack/
Please feel free to post any questions you have.