This is the sixth article of a series of posts on Worker and People Behavioral Analytics and the “Connected Workforce”, which started with the introductory post (
link to the post is here) which covered the following fundamental questions:
Q1) WHERE IS THE “CONNECTED WORKFORCE” APPLICABLE?
Q2) WHAT MAKES THE “CONNECTED WORKFORCE” VALUABLE?
Q3) WHAT MAKES THE “CONNECTED WORKFORCE” CHALLENGING?
Q4) BUT THEN, WHAT MAKES THE “CONNECTED WORKFORCE” POSSIBLE?
Q5) WHAT WE WILL SEE IN THE NEXT POSTS OF THE SERIES?
After the introductory article, the second article (
link to the post is here) described a specific use case for a client, and covered the questions (also providing our main
ALGORITHM PART 1 towards Episode Extraction):
Q6) WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?
Q7) HOW TO EXAMINE THE ENTRY/EXIT LOG (X1) IN FURTHER DETAIL?
And then, the fourth article (
link to the post is here), continued and also provided the main
ALGORITHM PART 2 (Investigation for unclear episodes and classification into probable non-violation or violation), while answering the questions:
Q8) HOW TO EXAMINE THE TRACKER LOG (X2) TOWARDS OUR INITIAL GOALS?
Q9) HOW TO UTILIZE THE TRACKER LOG (X2) FOR OUR INITIAL GOALS?
Q10) WHAT ARE THE DESIDERATA, STEPS, AND DETAILS OF ALGORITHM PART 2?
And finally, in the fifth article (
link to the post is here), we provided HANA-ML (Machine Learning) code, supported by Jupyter notebooks using Python, and most importantly, provided a glimpse at the repertory of design choices, mathematical formulas and computational algorithms towards behavioral profiles and similarity assessment, and answered the questions (using concrete code from our real-world case study):
Q11) HOW TO CONNECT PYTHON JUPYTER NOTEBOOKS TO HANA?
Q12) HOW CAN WE DEFINE WORKER BEHAVIORAL PROFILES?
Q13) HOW CAN WE CREATE WORKER BEHAVIORAL PROFILES?
Q14) HOW CAN WE DEFINE AND CREATE THE SIMILARITY OF BEHAVIORAL PROFILES?
In this article, the sixth in the series, as promised, we will focus on a number of possible extensions, and investigate:
Q15) IS THERE A USABLE RELATION BETWEEN ENTRY GATES AND FIRST TRACKER SIGHTINGS?
Q16) HOW CAN ONE CLUSTER WORKERS USING SIMILARITY MATRICES?
Q17) HOW CAN ONE VISUALIZE THE BEHAVIORAL SIMILARITY OF ALL WORKERS BY EMBEDDING THEM IN A LOWER-DIMENSIONAL SPACE?
Let us thus start!
Q15) IS THERE A USABLE RELATION BETWEEN ENTRY GATES AND FIRST TRACKER SIGHTINGS?
This was a question that was proposed by our client. The potential benefit of such a relation, would for example, to use the first tracker sightings as a "proxy" for entering through the gates, even when gate entry is not registered. Of course, we also investigated the complementary problem too: of whether the last tracker sighting in an episode has a usable relation with exit gate events.
We started by collecting the "first" tracker sightings of each episode, as well as their time differences from the gate entries:
#EXTENSION #1:
# Check the first beacon (or plant) that a worker is ticked in just after GATE IN in an episode
# Check the last beacon (or plant) that a worker is ticked in just before GATE OUT in an episode
# Find greatest timestamp to use as end-time for open-unded events
ytempTS = y.select("TIMESTAMP")
ytempTSmax = ytempTS.max()
print("ytempTSmax", ytempTSmax)
maxcount = 200 #UID.count()
count = 0
And then, followed by our main loop:
And again, after the processing loop, we can save the output variables:
%store FirstBeaconTimeDiffsPerUID
%store FirstPlantNamesPerUID
%store FirstPlantNameSetPerUID
%store nFirstPlantNameElemPerUID
%store FirstPlantNameCountPerUID
%store FirstBeaconIDsPerUID
%store FirstBeaconIDSetPerUID
%store nFirstBeaconIDElemPerUID
%store FirstBeaconIDCountPerUID
Now, what remains, is to also collect in analogous structures the Entry/Exit Gates per episode
First, we empty the output lists:
GateInElemPerUID = []
GateOutElemPerUID = []
And then, we fill them:
Now, we can start examining the results.
We concentrate on six of our python lists-of-lists (easily indexable arithmetically):
GateInElemPerUID[i][j]: Where i=idx of Worker (1...200 in our case), j=idx of episode #
FirstBeaconIDsPerUID[i][j]
FirstPlantNamesPerUID[i][j]
LastPlantNamesPerUID[i][j]
LastBeaconIDsPerUID[i][j]
GateOutElemPerUID[i][j]
And of course, we also have at our disposal the relative time different between the Gate In event and the first BeaconID/PlantName tracker sighting: FirstBeaconTimeDiffsPerUID
But how "constant" are the FirstPlantNames of FirstBeaconIDs per person, even more so for those that use almost always the same Gate for Entry?
Unfortunately, not constant enough, especially the BeaconIDs (a little better situation for the PlantNames exists). Let us first see an example, and then investigate entropies:
If one concentrates on Person #0, then his GateEntryElemPerUID has entries for 54 episodes: All 53 of which are through the MINES GATE, with only one from the TOWN GATE (episode 39 from 1...54). Thus, this is fairly very constant; so one would ideally want the first tracker sighting, to be constant enough too. But is it? Let's see:
set(GateInElemPerUID[0]) gives only "MINES GATE" and "TOWN GATE"
And if one displays GateInElemPerUID[0] clearly 53/54 values are "MINES GATE" and only one is "TOWN GATE"
But then:
FirstPlantNameSetPerUID[0] gives:
{'mine gate', 'Canteen', 'IBMD', 'Bus', 'R and D Lab', 'Road', 'Security'}
Which is 7 different first Plant Names for almost always the same Entry Gate! And the distribution is:
({'Canteen': 8,
' mine gate': 22,
'Road': 12,
'R and D Lab': 9,
'IBMD': 1,
'Bus': 1,
'Security': 1})
I.e. their distribution if far from uniform.
And if one looks at the First Beacon IDs, then the situation is even worse!:
FirstBeaconIDCountPerUID[0] gives:
({'5714397': 6,
'13499586': 3,
'2427005': 1,
'9632434': 2,
'16492267': 3,
'12118927': 9,
'11656797': 1,
'7649429': 1,
'10937610': 5,
'5254328': 1,
'7439': 1,
'2451642': 4,
'3247971': 2,
'6687308': 1,
'15451062': 1,
'8919434': 1,
'652132': 1,
'7505207': 1,
'3781758': 1,
'14292570': 5,
'15508333': 1,
'14253419': 1,
'6976968': 1,
'12473227': 1})
But how general is this observation (the fact that if we take First PlantID or First BeaconID alone, we cannot really predict the Entry Gate)? A classic mathematical way to measure the uniformity of distributions (and here we would like them to be as spiked as possible, and not uniform!), is through Shannon Entropy (an information theory concept). Note also, that the maximum entropy for an n-D discrete distribution can be easily calculated to be log2(n) - so we can also talk about "Scaled" entropies, with values 0...1, by dividing with this max entropy, as per the dimensionality n. This is exactly what we do below, when we calculate the Entropies of FirstBeaconIDCountPerUID and FirstPlantNameCountPerUID across all workers:
For the scaled entropies, a value close to 1 means "close to uniform distribution"; which is exactly what we don't want, if we expect the Entry gates to be constant (or at least, for those workers that have nearly-constant entry gates). So let's see what distributions of Scaled Entropies we get:
import statistics
print(statistics.quantiles(FirstBeaconIDsEntropyScaledPerUID))
print(statistics.quantiles(FirstPlantNamesEntropyScaledPerUID))
What we get is:
[0.8249528221342328, 0.8713710372266097, 0.9101859010533151]
[0.6452143429161679, 0.7201279904660798, 0.8351088474653205]
Which means that:
Median FirstBeaconID scaled entropy = 0.87
(very high! as is, we can never use First BeaconID as predictor of Entry GateID)
Media FirstPlaneName scaled entropy = 0.72
(a little better, but still prohibitive!)
But not all is lost. We still haven't examined a very important measurement:
The time difference between Gate Entry and the First Tracker Sighting!
(be it BeaconID or PlantName)
So, let's have a look, again at Person #1, and the time difference between entry and First Tracker Sightings (FirstBeaconTimeDiffsPerUID[0]) , and try to see if it has any apparent relation to the First Plant Name (FirstPlantNamesPerUID[0]):
(Times given in Seconds:)
'Canteen', 426.332,
'mine gate', 5.409,
'Road', 1193.278,
'R and D Lab', 159.349,
'mine gate', 32.29,
'mine gate', 6.232,
'Road', 1350.091,
'R and D Lab', 209.039,
'Canteen', 435.111,
'mine gate', 8.167,
'mine gate', 8.927,
'Canteen', 467.855,
We are very fortunate! Although the entropy is large, if one imposes a time filter (i.e. restricts the observation to those that are within (for example) a minute (threshold = 60 seconds) from the Gate Entry, then indeed we start to get small entropy (which means, most importantly, pretty good constancy, and thus easy mutual predictability, for almost-constant Entry Gates!).
And from here, three observations can be made, pointing to next steps:
O12) The appropriate threshold that minimizes the entropy of the First Plant / Beacon ID distribution while keeping it of good size could be algorithmically optimized (instead of just observing and choosing θ5 = 60 seconds, for this case)
O13) One could even now create a model, that given the "most probable Beacon sighting" (at the PlantID or BeaconID granularity), as a function of time-from-entry (maybe using temporal bins; i.e. Bin1=0...1 min, Bin2=1...10min, Bin3=10...60min and so on, for appropriately optimized bin edges (as per a generalization of the method sketched out in O12 above)
O14) In this specific person's case, we are lucky because the First Plant Name ("mine gate") is almost the same as the Entry Gate ID ("MINES GATE"), but this not necessarily always the case.
Also, if one views this as an input-output model (input=Entry Gate, output=FirstTrackerSighting within a pre-defined time period, with a known histogram of the input distribution, and a known histogram of the output distribution), then one could move from Entropies to Cross-Entropies and KL-divergences (Kullback-Leibler Divergence, named after the famous cryptologists, mathematicians, and academic S. Kullback and R. Leibler), and thus could also start examining uni-directional and bi-directional predictivity. And of course, one can also use machine learning methods: Treating this as a classification problem, with (Worker ID, Entry Gate, Time Since Entry) as inputs, and the "Most probable PlantID/BeaconID at that time" as an output: thus, creating a simple relative-time-dependent model - which is also a nice starting point for more complex models! And of course, one can also create similar classification (or alternatively, regression) problems by interchanging what the input variables and what the output variables are: For example, the input could be (Worker ID, Time Since Entry, PlantID at this time) and the output "Most probable Entry Gate", and so on. We can decide to create multiple models that "fill in the gaps" of what assume as being unknown, by using what is known as the input.
Q16) HOW CAN ONE CLUSTER WORKERS USING SIMILARITY MATRICES?
Given that we now have similarity matrices, we can apply various clustering methods. One of them is hierarchical clustering, which gives as its output "Dendrograms" with inclusion relations of clusters of various granularities: One needs to just decide upon a vertical "height" threshold, and cut with a straight horizontal line across the dendrogram with y="the chosen threshold", in order to select a specific clustering. Below follow some examples from our case study dataset:
Dendrograms created by Hierarchical Clustering applied to the Similarity Matrices across all workers: Entry & Exit Set-level Similarity (ABOVE) and Cumulative Histogram-level Similarity using Weighted Entry/Exit and Plant Names as per our code in the previous post (BELOW)
Notice the differences; the lower dendrogram provides nicer transitions to groups of increasing dimension, due to the more informative differentiation provided by the histogram-level similarity matrix that uses combined elements with weighted sum (Entry/Exit and Tracker Plant ID ticks)
Q17) HOW CAN ONE VISUALIZE THE BEHAVIORAL SIMILARITY OF ALL WORKERS BY EMBEDDING THEM IN A LOWER-DIMENSIONAL SPACE?
Finally, one can employ
Multi-Dimensional Scaling (MDS) Techniques to "embed" all workers as points in a lower-dimensional space (for example, a 2-D plane), often after a suitable transformation from the similarity matrix to a distance matrix having certain properties. For example:
Here, you can see such an embedding to 2-D, on the basis of the "Most Frequent Plant Name" visited by each worker (a single-valued summary). Of course, when using MDS, one needs to take into account how much of the variance of the model is really captured in a specific choice of dimensionality for the low-D space; it might or might not be the case that a 2-D embedding preserves adequate structure for input data that is usually much-higher dimensional; and this can be verified by looking at the R squared values - and then one might decide to increase the dimensionality of the lower-dimensional embedding space appropriately (for example, from 2-D to 3-D): The problem though being that, for direct viewing, 2-D is straightforward; while 3-D usually necessitates an interactive interface that enables the user to rotate, zoom-in etc. in order to get a better understanding of the underlying geometrical structure. And also, that even 3-D might not actually capture the variance; and then one needs to enter the field of how to create meaningful visualizations of points in 4-D, 5-D and higher-dimensional spaces, which is a complex topic, and many times in these cases MDS is abandoned and other ways are sought in order to visualize the relative similarity/distance structures between the points. Last but not least, it is worth mentioning that there exist several varieties of MDS algorithms; each with its own benefits and domains of applicability (metric mMDS and non-metric nMDS, classical PCoA, all the way to GMD).
RECAP – WHAT WE HAVE SEEN IN THIS ARTICLE
In this article, which is the sixth in the blog post series (links to the other articles can be found in the introduction of this), we have discussed the following questions:
Q15) IS THERE A USABLE RELATION BETWEEN ENTRY GATES AND FIRST TRACKER SIGHTINGS?
We considered specific examples from our real-world dataset; and saw that, even for workers with very constant entry gates, the first Plant ID after entry can vary greatly; and the first Beacon ID, even more. This was verified overall by looking at the scaled entropies of the discrete probability distributions estimated by the histograms of "first sub-plant (i.e. Plant Name) after entry" and "first Beacon ID sighting after entry". However, when the time interval from entry to first tracker sighting was considered, especially at the more-coarse-grained "Plant Name" level (instead of the more fine-grained "Beacon ID" level), then a considerable amount of constancy was apparent, if one for example restricts his view to only those first sightings that happen within a small time from entry.
And following this observation, suggestions for binning the tracker sightings across time were given, as well as for utilizing cross-entropy or KL-div as a measure for examination of the cases where the entry gate varies too. And furthermore, it was explained how traditional ML models of classification could be used to predict one or more of (Entry Gate, First Tracker ID, Worker ID, time from entry) from the rest. And such models could also be explored for the Second, Third, and other Tracker sightings in the sequence as counted from the Entry time onwards, and actually either as per their discrete order in sequence (1st, 2nd, ...) or as per their time from entry.
Q16) HOW CAN ONE CLUSTER WORKERS USING SIMILARITY MATRICES?
Here, we briefly looked into hierarchical clustering and the dendrograms it produces, as well as some of the semantics of the difference of appearance of two such dendrograms, which differ in which similarity matrix was used (and also in what was contained in the behavioral profiles out of which the similarities were calculated).
Q17) HOW CAN ONE VISUALIZE THE BEHAVIORAL SIMILARITY OF ALL WORKERS BY EMBEDDING THEM IN A LOWER-DIMENSIONAL SPACE?
And finally, we briefly touched upon yet another way to visualize the relative distances / similarities of the behavioral profiles of workers, by viewing each worker as a point in a low-dimensional embedding space (usually 2-D, sometimes also 3-D), after utilizing techniques of Multi-Dimensional Scaling (MDS).
In next articles of this blog series, we will further extend upon these ideas, and especially upon models
taking into account more of (sequence, relative timing, and absolute timing), and showing how not only traditional classification/regression methods of supervised learning can be employed, but also other more advanced and/or specialized methods. Furthermore, code for various levels of abnormality detection will be explored, by assessing similarity of a person's behavioral profile of a recent time period (a few hours, one day, a few days) to his previous behavioral profile and/or his peer group's behavioral profiles. This is heavily re-using the code already presented in this series, and building incrementally above it. Finally, moving from math and algorithms and technical views back to business value, a roadmap for further extensions of the "Connected Workforce" will be given in my future posts.