This is the fourth article of a series of posts on Worker and People Behavioral Analytics and the "Connected Workforce", which started with the introductory post (
link to the post is here) which covered the following fundamental questions:
Q1) WHERE IS THE “CONNECTED WORKFORCE” APPLICABLE?
Q2) WHAT MAKES THE “CONNECTED WORKFORCE” VALUABLE?
Q3) WHAT MAKES THE “CONNECTED WORKFORCE” CHALLENGING?
Q4) BUT THEN, WHAT MAKES THE “CONNECTED WORKFORCE” POSSIBLE?
Q5) WHAT WE WILL SEE IN THE NEXT POSTS OF THE SERIES?
After the introductory article, the second article (
link to the post is here) described a specific use case for a client, and covered the questions:
Q6) WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?
Q7) HOW TO EXAMINE THE ENTRY/EXIT LOG (X1) IN FURTHER DETAIL?
As promised in the second article, in which the first part of the dataset (Entry/Exit Log X1) was covered as well as the first part of our basic processing algorithm (
ALGORITHM PART 1) towards shift analytics and overstay estimation using very sparse data in space and time (i.e. with large "gaps" in space and time), we will now see how to perform the "investigation" of the third type of episodes, in order to decide how they can be classified as "probably no violation" vs. "probably violation", by using the Tracker Log (X2) and
ALGORITHM PART 2, that will be described below. Kindly note that the third article of the series has been posted too, which addresses the question:
“Can data modeling be enhanced by incorporating business knowledge?” (
link to the post is here). So, let is start this fourth article, by asking:
Q8) HOW TO EXAMINE THE TRACKER LOG (X2) TOWARDS OUR INITIAL GOALS?
Having seen the data schema and some basics (such as the number of rows) for X2 in the answer to the first question of this post (Q6: WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?)
, now let us look into more detail, through two more observation:
O7) In terms of
spatial coverage, due to observation O2 made above ("
only 30 specific pairs of (latitude, longitude) existed, and one of these were assigned to each beacon, i.e. multiple beacons were sharing the same (latitude, longitude) pair - which of course were not their true locations."), and due to the 10 meter or so radius of each beacon as compared to the several kilometers sides of the plant, not only is the spatial coverage very sparse (not only almost no overlapping or "touching one another" regions of two beacons exist, but also the distances between these 30 points are larger than 100 meters or so), but also any attempt to directly and purposefully use the given (longitude, latitude) coordinates of the beacons, either as they are or converted to a planar projection, is doomed to be very limited, if not useless.
However, one
forgets about coordinates and thinks only in terms of beacon ID's, then one can attempt to use
the sequences of the observed beacon ID's and their syntactical patterns, with our without temporal information (absolute or relative, i.e. time durations only). For example, it might be the case that if we (for simplicity) assume we have beacons with IDs 1001, 1002, ... 1100, then it might be the case that, for a specific worker, we very frequently see Beacon 1010 appearing as the first beacon upon his gate entry, followed by Beacon 1021 for a few ticks, and then by 1027 for more ticks. I.e. many of the observed daily sequences might have the form:
(1010, 1021, 1021, ... (10-15 times) 1021, 1027, 1027 ... (70-130 times) 1027 ...,)
Such patterns can easily be modelled using techniques borrowed from Natural Language Processing (NLP): for example, the so-called
n-grams could come in handy (n-grams are the probabilities of single beacons, pairs of consecutive beacons, triads of consecutive beacons, all the way to n-ads of consecutive beacon ticks). One could even create
Hidden Markov Models for the production of such sequences, and one could also incorporate temporal distributions (absolute or relative) with appropriate modifications of such models (for example, by modifying appropriately the probability of transition from a state back to itself, the average number of repetitions of the state can be modelled through E{(1-p)^n} and so on; but also, one could explicitly incorporate state duration and / or transition duration probability distributions as part of such models). But before delving further into this in other posts of this series, let us go to a second observation, this time not regarding spatial coverage, but rather temporal coverage:
O8) In terms of
temporal coverage, although our client had originally suggested that we would have one beacon "tick" per minute, accumulated with 7 ticks being packetized and sent every 7 minutes, the reality was much worse. Let us first look at a typical example, and then illustrate the general situation through appropriate statistical analysis:
A typical example of tracker data for a worker over a day, after gate entry and before gate exit. Notably there exist many long periods (up to almost two hours!) with no data at all; see for example the time period from 10am to circa 11:50am
In the graph above, the temporal points when a beacon "tick" exists, are shown with blue asterisks, plotted on or above the x-axis. The y-values correspond to estimated worker motion "velocity" or rather apparent walking velocity, for the 30-only locations that were given for the beacons, taking into account the time differences between the relevant ticks; thus this is only a very rough and inaccurate estimate of velocity, given the error introduced by using only 30 (latitude, longitude) points; and also, because of the irregular time intervals which are often quite large, as we shall see.
Let's now have a better look at the blue asterisk points, and the distances between them; and move from this typical example to the whole data set. So what are the stats of the “breaks” in tracker data?
If we collect, across all people, all complete episodes (i.e. (Gate/Biometric) IN to (Gate/Biometric) OUT), which have <13 hours duration (i.e. certainly not overstay), then:
There exist 500312 tracker point time differences, out of which, 402993 (80%) are almost 1 minute (as had been originally the client's impression!)
But the other 20% is different!
31333 (6%) are less than 30 sec
65986 (14%) are above 90 sec,
(with 64193 less than 1 hour (1793 outliers > 1 hour!))
(median = 3 minutes, mean = 10.8 minutes)
But, if we view the above, not in terms of data points, but in terms of TIME PERIODS, we get:
Only 14.6% of the time with >=1 sample/min
Only 23.4% of the time with >=1 sample every 10 min
And …
10.9% of the time with <1 sample per hour!
Thus, the general observation to be made (O8) is crystal clear: Although we do have some "dense" periods with one tick per minute, there exist many periods of 1 sample every ten minutes, and more than 10% of the time we have less than one sample per hour! Thus, as we shall see, not only is the spatial coverage of the tracker data very problematic; but also their temporal "gaps" need special treatment, in order to be dealt with: And as we shall see, we so utilize special "filling" techniques for some gaps, in the second part of our algorithm.
So, after having made these two new observations (O7 regarding spatial coverage of the tracker samples and O8 regarding temporal gaps), let us now get back to where we were before this section started (i.e. before asking Q8) HOW TO EXAMINE THE TRACKER LOG (X2) TOWARDS OUR INITIAL GOALS?). What we had observed was that:
4% Episodes have duration above 24 hours, and thus need further investigation; as to whether they indeed indicate overstay or just arose out of "missing" in/out events:
And what he had promised is that: "...which is exactly what the second part of our algorithm will do, on the basis of the tracker log (X2) as we shall see below!)"
So, now the time is ripe to address the question that arises:
Q9) HOW TO UTILIZE THE TRACKER LOG (X2) FOR OUR INITIAL GOALS?
Let us see four typical examples of workers and their data quality, and then illustrate for two of them how we utilize the tracker data, and thus first motivate and then spell out the steps of our ALGORITHM PART 2.
Let's start with a "quite good" case, Person #1:
Episode Start (Entry = Green Asterisk) and Episode End (Exit = Blue Asterisk) events, plotted over Time (x-axis), with height (y-axis) representing Episode Duration (hours)
As you can see, we are covering the one-month period roughly between September 10 and October 10. As the height represents Episode Duration, the red line which is placed at 13 hours (maximum allowable stay time), has all episodes of type "certain no violation" (i.e. no overstay) below it. Above it, we have episodes that could either be below 24 hours (i.e. of type "certain violation") or above 24 hours (i.e. of type "needs further investigation").
For this specific Person #1, there are 27 episodes in total during this month, with 26 of them being of type "certain no violation" (and a median of 11h17m duration), and only one of type "needs further investigation" (with roughly 35 hours duration, placed in a Blue Box in the diagram above). More specifically, this episode for investigation, upon taking its original events (showing the entry-exit log before keeping only IN/OUT), is:
06-Oct-2021
06:55:34 GATE IN
06:58:42 BIOMETRIC IN
(NOTE: No Exit registered on Oct 6th!!!)
07-Oct-2021
06:53:45 GATE IN
06:56:48 BIOMETRIC IN
18:07:18 GATE OUT
But how can we decide whether this is a probable violation or not? Well, let's have a look at the tracker log (X2), corresponding to this episode "under investigation":
Looking into the Tracker Log (X2) corresponding to this Episode under Investigation
By looking at the Tracker Log, even without taking into account anything regarding the locations of the beacons and there IDs, it becomes immediately evident that there are no beacon samples at all during the period between roughly 6pm on October 6, and 7am on the next day October 7. Thus, it could well be the case, that an unregistered "Gate Out" event had taken place around 6pm on the first day; then, the existing beacon data, would be highly congruent with this hypothesis: These might well have been two normal working shifts, the first one on October 6 between roughly 7am-6pm, followed by one on October 7 between 7am-6pm: It might well be just that because of a single missing "Gate Out" event row (which should have been roughly at 6pm on October 6), we are now getting a single episode of 35 hours, instead of two episodes of 13 hours. Of course, we can never be totally certain of this though; due to the "spatial" and "temporal" gaps of the tracker stream X2, that we extensively analyzed above.
But how can we turn the above argumentation to an algorithm, an even better, to a quick algorithm that can massively process data of tens of thousands of workers over periods of many months? This is exactly where we are going; but let us first look at some more cases of Persons and tracker logs, namely Persons #4 and #9. Here are their tracker logs:
Tracker Logs (X2) for two workers: Person #4 (Upper Graph) and Person #9 (Lower Graph), showing Episode Start (Entry = Green Asterisk) and Episode End (Exit = Blue Asterisk) events, plotted over Time (x-axis), with height (y-axis) representing Episode Duration (hours)
Notice that:
Person #4 (Upper Graph): Only has 2 episodes with <13 hours; and all the other 10 long episodes (14...94 hours) need investigation!
Person #9 (Lower Graph): Has all three types of episodes: 14 of type "certain no violation" (below the red 13 hour line, with median duration 9.3hr), "certain violation", as well as one very long episode (115 hours!, the big blue box at the beginning) that requires investigation.
So, let us look deeper into Person #4:
Looking into the Tracker Log (X2) corresponding to this Episode under Investigation
His first episode, which requires investigation, shows quite a different picture as compared to the Episode of Person #1 which we zoomed in in order to investigate (look at two diagrams above in this post). There, it was clear that no "sightings" through beacon trackers were made at times where there should be none, because the worker is supposed to be out of the plant; in that case above for Person #1, for example, there were no beacon tracker measurements between 6pm of the first day and 7am of the second. But now, in this new case (Episode #1 of Person #4 that is shown just above), things are different: There exist weird clusters of bacon sightings; a first one between 10:46 and 11:00am (with one "tick" per minute) on September 10; and then silence, all through the end of the next day, i.e. September 11. However, the day after, on September 12, there is a single sighting at 2:07am in the morning; followed by two clusters of beacon data: 7:56am to 8:39am (19 samples), and then 11:52am to 12:22am (17 samples), and THEN, again a sighting, at 19:42 of the same day. This indeed looks weird; if we assume that this worker's September 12 was a normal shift, then 7:56 to 19:42 are OK, but what about him being present the same day at 2am in the morning INSIDE the plant? Thus, this case smells more like an overstay violation: for example, one possible hypothesis would be that the worker entered (or even stayed from the evening before) overnight, before his normal day shift - but with high probability violating the overstay limit.
So, we have seen two cases of episodes of type-3 (i.e. "requires further investigation"), one of which seems to be "probably no violation" and one of which seems to be "probably violation", as examples. But then the question arises: How can we turn the above reasoning into a quick algorithm?
Q10) WHAT ARE THE DESIDERATA, STEPS, AND DETAILS OF ALGORITHM PART 2?
There are three desiderata (desired qualities - a qualitative set of "functional requirements") that we have chosen for our algorithm:
DESIDERATA FOR ALGORITHM PART 2:
D1) BLACK-BOX I/O DESCRIPTION: The algorithm will be fed with the episodes of type-3 ("under investigation") and will decide whether they are classified as "probably no violation" versus "probably violation".
D2) DEALING WITH GAPS: The algorithm should be able to handle the very bad data quality of the tracker data (the temporal "gaps" and the inadequate spatial characteristics)
D3) ALL-SHIFT 24-HOUR CRITERION: The algorithm should work both for day shifts as well as night shifts, or any other shift; and should determine whether in ANY possible 24-hour period within the episode, there seems to be a stay of more than 13 hours (i.e. probable overstay)
In order to create a quick and simple algorithm, an idea from image processing came in handy; as we need to deal with temporal "gaps", we can use a simplified form of an operation used for "hole filling" in image processing; i.e. the so-called
morphological operation of "
Dilation" (which is presented together with its complementary operation of "Erosion").
The simplest form of Dilation is binary dilation; if we have a black and white figure in an image, and we assume that the "body" pixels of a region are white, then we "Dilate" by applying the following rule:
OutputImagePixel(x,y) = 1 (white) if any of its 8 neighbors in the input image is white, else 0 (black)
Thus, "Dilation" effectively performs "hole covering" in binary images, by "growing" the boundary of regions with a new "enlarged" boundary of thickness of 1 pixel, every time that it is applied. For example:
INPUT IMAGE: OUTPUT IMAGE (AFTER ONE DILATION):
00000000 01111100
00111000 01111100
00101000 01111100
00111000 01111100
00000000 01111000
Note how the single-pixel "hole" in the center of single connected region of the input image, was effectively "covered" through the single dilation. By applying more dilations (or dilations with a larger "structuring element" as it is called), one can cover larger holes, too: but of course, one also gets "enlargement" of the boundaries of regions.
So, by a simplification of the above technique for one-dimensional images (i.e. sequences of intervals of one minute length which either contain a beacon "tick" (value=1) or don't (value=0)), we can start "covering the gaps(holes)" of our tracker data beacon ticks over time. This is thus the second part of our algorithm, using the above idea as well as conforming to desiderata D1-D3:
ALGORITHM PART 2
(Automatic classification of type-3 Episodes requiring investigation)
For those episodes arising out of Step 5 which have duration >24 hours:
STEP 6) Fill the small “gaps” between the red points (the temporal “holes” in the tracker samples), and thus create continuous “red lines” like these! (with max ”gap” of 3 hours), by using 1-D dilation with structuring element "thickness" of θ1/2 = 1 hour, so we can also cover gaps up to 2 hours (through the bilateral contributions of one hour from each of their endpoints), i.e. we chose as the maximum gap length the maximum allowable "short duration leave" time.
STEP 7) Slide a 24-hour window and check whether we never have more than θ2+θ1 hours of “red” in any 24 hour period: If this is the case, classify the episodes with type "probably no violation", else classify the with type "probably violation"
Τhus, with the above steps, we can create a reasonable estimation of which episodes "under investigation" should be classified as having type "probably no violation", and which as "probably violation", in congruence with the reasoning analyzed in the two example given above:
Classification of Episodes under investigation into type Probably-No-Violation vs. Probably-Violation
Detailed Algorithmic Implementation of Step 7 and Step 8:
To implement Steps 7-8 in code, a straightforward way is: (not the most efficient, but simple & works)
STEP 6a) CREATE BIN ARRAY: Chose a temporal bin "width"; for example, of θ3 = 1 minute. Take the whole duration between the start time of the episode ("IN" timestamp, keep only days/hours/minutes and strip away seconds), and the end time ("OUT" timestamp with seconds stripped). Create n1 bins, to cover all the minutes from the minute of the start-time (let's call it BIN(1)) to the minute of the end-time (BIN(n1)).
STEP 6b) INITIALIZE BIN ARRAY: Then, assign values of "0" in all of the bins, to initialize them.
STEP 6c) FILL IN BIN ARRAY: Then, iterate across all tracker data for the episode, and assign a value of 1 to the bin the fall in, i.e. for example, if a tracker row occurs at the 35th minute of the bin sequence, then BIN(35)=1.
STEP 6d) DILATE THE "1s" IN THE BIN ARRAY: Then, iterate across all temporal bins in the bin array, and create a new OUTPUT array, which starts with zero. For each bin in the original input bin array that is "1", assign "1" to the corresponding bin in the output array, but also to all bins with time θ1 before it and θ1 after it, if they exist in the array (i.e. if they are not "out of bound"). I.e., if we assume a bin width of 1 minute, and we have BIN(123) = 1 in the input array (before dilation), the in the output array we will assign 1's to all of BIN(123-60)....BIN(123+60), in order to perform dilation βυ θ1 = 1 hour = 60 minutes, as long of course as these bins exist in the OUTPUT array (i.e. as long as n1 > (123+60))
STEP 7) SLIDE 24-HOUR WINDOW AND CHECK WHETHER WE HAVE OVERSTAY: To slide a 24-hour window and check whether we never have more than θ2+θ1 hours of “red” in any 24 hour period, we calculate the number of bins corresponding to the sliding window length (WL = 24*60 for bins of 1 minute width), and then we calculate the number of bins corresponding to max coverage before overtime (MCL = (θ2+θ1)*60 for bins of 1 minute length). Then, we set a flag OVERSTAY= False. And we start with the sliding window from MIN_BIN = 1 and MAX_BIN = WL, and count how many "1" we find in the bins BIN(MIN_BIN)...BIN(MAX_BIN). If the count>MCL, then OVERSTAY becomes True, else it keeps its previous value. Then, we move to the next sliding window; i.e. MIN_BIN=2 and MAX_BIN = WL+1, and do the same. We continue in this fashion, taking the next windows, and setting OVERSTAY if needed, until we reach MAX_BIN = n1. Finally, we examine the value of the flag OVERSTAY: if it is True, then we classify the episode as "probably violation" else we classify it as "probably no violation"
RECAP – WHAT WE HAVE SEEN IN THIS ARTICLE
In this article, which is the fourth in the blog post series (links to the other articles can be found in the introduction of this), we have discussed the following questions:
Q8) HOW TO EXAMINE THE TRACKER LOG (X2) TOWARDS OUR INITIAL GOALS?
Where we looked at the data quality of the tracker log (spatial coverage, temporal gaps and so on), and also provided statistics for the situation it is in. And then, we asked:
Q9) HOW TO UTILIZE THE TRACKER LOG (X2) FOR OUR INITIAL GOALS?
Here, we saw four typical examples of workers and the output from our
ALGORITHM PART 1 ("Episode Creation") for them, with the quality ranging from good cases (where few investigations are required) to really messy ones. Then, we illustrated in more depth how we could utilize the tracker data, and what reasoning can drive the results of our investigations. And thus, we reached the state of being ready to spell out the steps of our ALGORITHM PART 2.
Q10) WHAT ARE THE DESIDERATA, STEPS, AND DETAILS OF ALGORITHM PART 2?
And finally, we spelled out
three desiderata (D1-D3), and provided the two steps of
ALGORITHM PART 2 ("Investigation of type-3 Episodes and classification into "probably no violation" vs. "probably violation"). We then
illustrated them for two of the cases of Q9, and finally we spelled them out with
substeps and a high detail level so they can be readily implemented.
All of the above steps, of Algorithm Part 1 as well as Part 2, were implemented in our case in HANA using primarily SQL; the details of which will be available in the
blog post that can be found here. As a result of the steps, a new table was created in HANA (beyond the input tables X1 (Entry/Exits) and X2 (Tracker Data)), which will call Episode Table (X3).
On the basis of the three tables (X1, X2, and derived X3), the
SAP Analytics Cloud (SAC) was utilized to create a multi-page, adjustable-granularity interactive dashboard (described in the
blog post that can be found here). Then,
HANA-ML (Machine Learning), and also the Predictive Analytics Library (PAL) and Python libraries, were used for further processing, towards the initial steps of the roadmap for more advanced use cases (beyond worker analytics and overstay estimation), and we illustrate how then can be used to perform
worker behavioral profiles and behavioral similarity assessments, and from there can be used for hierarchical clustering as well as abnormality detection, at multiple levels, among other use cases, in the next article of this series,
which can be found here.