In CUBIST (see last blog post), the main means for conducting analytics is based on a theory called Formal Concept Analysis (FCA). Analytics based on FCA are quite different from traditional BI analytics: The focus in FCA is not quantitative data analysis (no “show me the numbers”), but on qualitative data analysis (“show me dependencies and meaningful clusters”, so-to-speak). Another means currently in discussion for future BI tools (keep in mind that I come from an SAP Research Background, I cannot give or even commit to any details of future SAP BI products) are graph-based Visual Analytics. This blog entry provides some preliminary thoughts of mine on different kinds of analyzing some data, in order to compare the following Visual Analytics means:
- Traditional BI Visual means (here: a bar chart)
- A graph-based visualization (here: force-based layout)
- A visualization based on Formal Concept Analysis (here: concept lattices)
In order to compare these means, let us consider the following toy and fictious data set (though the skills like “IE -> Information Extraction”, “ST -> Semantic Technologies” etc are skills needed in CUBIST 😉 ).
Skill | Persons with that Skill |
IE | Anja, Ben, Ernst, Fred, Ken |
ETL | Chris, Fred, Mark |
BI | Ben, Chris, Fred, Lemmy, Mark, Naomi |
ST | Anja, Diana, Ernst, Fred, Gerald, Harriet, Ken, Owen |
FCA | Anja, Diana, Gerald, Harriet, Ian, John, Ken, Owen |
VIZ | Anja, Diana, Ian |
There are different possible information needs for a dataset like this. E.g. the following questions might be asked:
- Show me the count of people for a given skill.
- Show me the skills and how many people share some skills, in order to get an idea on how strongly skills are related.
- Show me the skills and people such that I get an idea of the distribution of skills among people and dependencies between skills.
To me, these three questions can best be answered with different Visual Analytics means. The first question can best be answered with a traditional chart, e.g. a bar chart. So we transform the initial dataset and build then the corresponding chart as follows:
Of course, it is easy to immediately read off some information, e.g.
- ST and FCA are the skills most people have
- ETL and VIZ are the skills least people have
Anyway, the chart is slightly misleading, as most people are counted manyfold.
The second question can better be answered with inspecting a graph-based visualization. For each pair of skills, we can count the number of people sharing those skills and build an according graph.
Now we can easily read off information like this:
- The skills FCA and ST are strongly related (because the link between them is strong)
- The skills FCA and ST are only weakly related (because the link between them is weak)
- No one has knowledge on both FCA and ETL (because there is no link between FCA and ETL)
But we don’t get information about the distribution of people amongst skills.
Finally, for the last question, FCA comes into play.
The diagram is read as follows: If you want to know which persons have a specific skill, find the node with that skill and look at all persons in nodes you can reach from the skill-node going downwardly. Vice versa, for finding the skills of a person, find the node with the person and look at all skills in nodes you can reach from the person node going upwardly. We can read off information like this:
- Owen, Harriet and Gerald have exactly the same skills (because they belong to the same node)
- Whoever is skilled in ETL is skilled in BI, too (because the BI-node is above the ETL-node)
- Anja has more skills than Ken, and Ken has more skills than Ernst (because the nodes are ordered that way
But this diagram is somewhat unfamiliar, and it hard to read off the numbers of persons per skill.
Having said that, all the diagrams certainly have their pros and cons, and each of them is suited for specific types of questions. Here is a very first comparison of myself:
| Pros | Cons |
Barchart (and similar means) | - Many well-known visualizations
- Good (readable and comprehensible) layouts
- Good for analyzing numbers
| - Loss of information (what people)
- Misleading for overlapping attributes (counting people manyfold)
- Not utilizing relationships between entities
|
Graphs | - Attractive visualizations
- (Relatively) easy to understand
- Utilizing and showing links between entities (skills)
- Loss of information (what people)
- Bad for analyzing numbers
| - Loss of information (what people)
- Bad for analyzing numbers
|
FCA visualizations | - No loss of information
- Meaningful clusters in one node
- Showing dependencies between entities (both people and skills)
| - Number of nodes might explode
- Finding good layout is unsolved
(nice layout in example is accidential and has been manually created) - Unfamiliar means for analytics
- Scalability
- Bad for analyzing numbers
|
To me, the different visualizations are really complementing, future BI tools should provide all types of visualizations (for example, side by side with linking-and-brushing).
Anyhow, my knowledge is limited, therefore I want to conclude this post with some questions to my readers:
- Do you agree to my thoughts, or do you have different viewpoints on the value of different Visual Analytics means?
- What value do you see in non-standard Visual Analytics means (like graphs or FCA)? Or, to put it different: When do you feel limited with traditional means such that you prefer a graph-based or even FCA-based Visual Analytics?
Best
Frithjof