First, I want to thank my mentor and supervisor
sarah.detzler for her help and advise.
As a data scientist I put a lot of work in analyzing my data and creating the right model to predict the future. Time is often of the essence and depending on the target group the results also have to be visualized in different ways. Imagine having a very complex model and explaining your results to your manager or a business group on a regular basis. An easy and efficient solution is to create a dashboard, which can even be connected to live data. Once set up it saves us valuable time which we can rather spend for creating awesome models in R or Python. We can even keep our familiar R environment with the R Visualization feature in the SAC. Further, the R Visualization tool can handle local datasets and even better also
live data using the live connectivity for certain systems, see this link. In addition, we can filter the plots created with the R Visualization tool in the SAC and we can even link plots created with the SAC Story Modus with our R Visualizations. Hence, we can insert R Visualizations to our story, interact with the R Visualizations using the SAC Controls and share them with other users.
As an example, we will tackle a use case based on simulated, financial data and create a histogram with different reference distributions using ggplot2. A histogram is a graphical tool to represent the distribution of numerical data. Of a single continuous variable, the entire range is divided into intervals and then we count how many observations fall into each interval. We focus on the ggplot2 package since it is probably the most commonly used library in R for creating visualizations. Also, other R libraries are available while using the SAP R server. A list can be found under the following
link.
Before we get started with our visualizations we can check a couple of things. First, make sure you are connected with an R Server. To do so your user must have admin rights. Otherwise, please ask your system admin user for help. Different options are available for example you may use your own R Server or use the SAP R Server. To do so go into the main menu and press “Administration” under the point “System”:
Then proceed to “R Configuration” at the top of the screen.
Check if you are connected to the SAP R sever runtime environment or to your remote R connection. If this is the case, we can proceed with our tutorial using the R Visualization in the SAC. For more information on how to set up the R Server please refer to this
blog.
Second, let’s have a look at our profile setting and make sure the number formatting is set to “1,234.56”. Otherwise, the dataset won’t be recognized correctly in the SAC.
Before we create our amazing plot, we should have a look at our dataset. The dataset can be found
here.
In the first column our data is labeled, which is important for mapping, later when we create our model. For our example we simulated fictive log returns from a t-distribution seen in the third column of the dataset. A t-distribution is a commonly used parametric probability distribution. In the second row the log price of our imaginary stock is given. When data is scarce or if the data may be restricted due to data protection rights, a possible solution may be to make the data unrecognizable or to simulate data. If you are more interested, how the data is simulated, have a look at the R Code, which can be found
here.
Let’s upload the data set into the SAC. To do so, we click on the menu on the top left, select “Create” and click on “Dataset”.
We then select “Data uploaded from a file”.
We select the source file “Simulated_data.csv”, click “Import” and then “Ok”.
After you imported the dataset we want to convert our data into a model. Hence, click “Create” and then select “Model”. The following window will open:
We choose “Dataset” under “Acquire data” and find our uploaded dataset. Now we must check that all the variables are in the correct format. We need to convert the variable “Nr” (left column in the dataset) into a generic variable. Click on the column on the left and choose “Generic Dimension” under “Type”. This is an important step to correctly map the data in the end. Otherwise, the plot of the histogram will not work. The other variables should be labeled as a measure.
Then press “Create Model” in the bottom right corner. Save the model as “Simulated_data” so that the prepared R script is running correctly in the end. The model is our basis for creating a story.
Now let’s create our story in the SAC. Click on the menu on the top left, select “Create” and click on “Story”.
There are several options available, seen on the right, to start our story like a Smart Discovery, a Canvas Page, a Responsive Page or a Grid Page. In example with a Smart Discovery we can get a great starting point by getting suggestions how to visualize our data. It automatically creates a Dashboard for us, which we can flexibly adjust and extend. But we want to start from scratch with the Canvas or Responsive Page. Since we focus on plotting a histogram we start with a responsive page to create a flexible dashboard, which automatically adjusts to our mobile devices. Hence, choose “Add a Responsive Page” to start our story.
This will give us the following responsive page, where we can insert different visualizations like scatterplots, bubble charts, bar charts, heat maps and many more.
But first we need to link our story with our model. Therefore, press on “Data” in the top left and choose “Data acquired from an existing model”. Then find your prior created model.
Then go back into the story modus, click “Story” in the top left. Now we want to create a R Visualization.Hence, click under “Insert” on the plus sign and choose “R Visualization” on the bottom of the available options.
The builder opens, and we first must add the input data. Hence, press “+Add Input Data”.
We create a dimension from the number label which we saved earlier as a generic dimension in the step where we created the model. Then press “OK”.
Now we want to add our script for our R Visualization. The simplest way is to prepare it in your used R coding environment like RStudio and then copy it into the builder. For this Use Case the R script is already prepared
here as an R file.
Now press “Add Script” and expand the window of the R editor. The uploaded data from our model can be seen in the environment.
General information about the ggplot2 package can be found under the following
link. Before we can create our histogram we have to load the ggplot2 library. Next we create an object of class ggplot() and then add further layers. The aesthetic mappings describe how the variables in our data are mapped to visual properties. We set the aesthetic in the ggplot function to the variable logreturns. In the next step we use the geom_histogram() function to visualize the distribution of a single continuous variable. With the commands geom_density() and stat_function() we add the kernel density and the normal distribution to the plot. The normal distribution is commonly used as reference distribution. In real life data we wouldn’t know the underlying parametric distribution of the process.
Further, we want our R Visualization to look like the plots created in the SAC. Therefore, we add a specific theme to our plot, which we will describe next. In the theme() function we set the panel grid to dashed and light grey. Further, the panel background would be by default dark grey but we choose it to be white to match the plots in the SAC. In addition, we can set a panel border. In the last step we add the legend to our plot and label the x- and y-axis. Now, copy the R code into the “Editor” window and click “Execute”.
In the bottom right, we can see a preview of our plot and in the top right in the environment our data is shown just like in RStudio. If everything worked correctly press “Apply” to add your plot into your story.
It is best practice to add a title or a subtitle to our plot. Further, we could change i.e. the background with the “Designer”.
In our histogram we can see the probability distribution of the log returns. The red curve is the kernel density which is a non-parametric method to estimate the probability density function of a random variable. The green curve represents the normal probability distribution with the same mean and variance as the log returns. The normal distribution is added as a reference point to get an intuition how the log returns are distributed. Here we know that they are simulated from a t-distribution but often in practice this is not the case. Therefore, adding several reference distributions can provide us with more information on how we must treat the data. A histogram can provide us with different business insights. First, we can check if the distribution is skewed to the left or to the right, which means we should rather rely on the median than on the mean to represent the center of the data. Or you may be interested in the span of the data i.e. if you are analyzing the spread of salaries in your company. Further, we can identify outliers which might represent unusual cases i.e. data entry errors or extreme high or low values.
Congratulations, you have finished this Hands-On-Tutorial successfully! Have fun creating your own awesome R Visualizations. ?
If you are more interested and want to learn more about R Visualizations in the SAC have a look at these blog posts:
How to Leverage R Visualization Feature in SAP Analytics Cloud