After searching the case files related to the last ABAP Detective
post on data cleansing, I decided to branch out and wire some new circuits.
The data stream I had been processing had temperature measurements from a micro-device, with an unfortunate data acquisition quality issue where the occasional data point was incorrect. If you looked at the graph, you'd see it immediately. Coding the correction turns out to be a thornier issue than I first suspected.
Temperature (L)
Temperature (R)
A blip up or down could be found with a SQL statement (at least in Postgresql):
select
clock,
value,
lag(value) over (order by clock)
from history
The lag function shows the value from the last row. While I didn't uncover (yet) ways to do math on those columns, dropping the content into a temporary table works well enough to see the iterations.
clock | value | lag | to_timestamp
------------+----------+----------+------------------------
1649292931 | 89.7046 | 70.5562 | 2022-04-07 00:55:31+00
1649293051 | 70.4734 | 89.7046 | 2022-04-07 00:57:31+00
1649311051 | 89.6956 | 69.5554 | 2022-04-07 05:57:31+00
1649311171 | -45.5078 | 89.6956 | 2022-04-07 05:59:31+00
1649311291 | 69.4114 | -45.5078 | 2022-04-07 06:01:31+00
Hopefully, you can see the challenge in automating this search. What is a "reasonable" value change and what is "unreasonable?" A 10 degree (F) drop is possible in a 2 minute timeframe if a blast of cold air comes across the sensor.
However, the clock cleaning in the post title relates to a second set of data values to be superimposed on the first. In other words, just seeing the ambient temperature is not enough to tell if energy conservation is working; we also need to see the correlation with energy use.
As it turns out, the local utility has installed a "smart meter" and has records to the one-hour detail level. This lets them tell consumers hints about whether usage is higher or lower than expected sooner than a monthly billing cycle.
Electric use and temperature
This above image shows peak energy use when the temperature drops, due to an electric heat pump working. It's helpful, but I want to show the data my own way. It seems easier to understand trends and ranges when you can control the data retrieval and visualization.
Besides the semi-static charts, the utility offers data downloads in both CSV and XML format, like so:
TYPE,DATE,START TIME,END TIME,USAGE,UNITS,COST,NOTES
Electric usage,2022-03-06,00:00,00:59,0.40,kWh,$0.05,
Electric usage,2022-03-06,01:00,01:59,0.72,kWh,$0.09,
A short Perl script cleaned this information and reformatted it to meet the Zabbix manual data upload requirements. The result looks like this:
param ElectricUsage 1646546340 0.40
param ElectricUsage 1646549940 0.72
param ElectricUsage 1646553540 0.40
param ElectricUsage 1646557140 0.40
The "param" is the entity I created to hold the "zabbix_sender" content, essentially externally generated data rather than agent collected. Links/references below.
In testing the trapper data load, I used simple command lines before moving to using datafiles.
zabbix_sender -vv -z data.base -p $portno -s "param" -k ElectricUsage -o "1.1"
Here's the issue: the database server runs on GMT (also known as UTC), while the energy use data is in the local time zone. When I first loaded the data and viewed a combined energy & temperature chart, it did not look right.
Before
After I fixed the data load, it looks like this:
After
If you can't see any significant difference between the above 2 images, don't worry. I didn't either at first until I zoomed in looking for confirmation the data load was valid. Because these visuals have different aspects than the commercial vendor, the error is not apparent. What I was looking for was higher energy use when the temperature drops, with adjustments due to knowledge of the programmed thermostat (which can be manually changed at any time).
Here's a close-up of part of the first chart.
Before, zoomed
To help understand the content, I added a "20 degrees Celsius" label at the 68 F line. The arrow points to a 7PM measurement; the temperature readings are correctly aligned with the local wall clock. But the energy use (in orange) does not align correctly.
How do I "know" this is incorrect? In data science, you can't use hunches to validate data. And you need to understand how to detect and correct bias in the measurements. The correlation I expected to see is increased temperature when the "heat goes on". Common sense, right? But we're dealing with not just a flaky sensor, I've introduced calculations (code) to the chain. And creating the graph shown requires making assumptions about the data. My first graphs were not useful because the watts values are "high" and the temperature values are "low," necessitating creating a chart with 2 Y-axes. Fortunately, the Zabbix platform allows this customization.
After pondering the image a bit, I recognized there is a 5-hour shift (or 6, depending on when daylight savings time change hit). Then it dawned on me that I had loaded the data from a shell session with the "TZ" variable unset, so the time values were GMT, the same as the database.
What was needed to undo this error and get the values at the correct time? I couldn't just load again because that would either fail (due to duplicate keys maybe) or allow multiple readings at the same time, with unpredictable side effects. I had to wipe the energy use history and load it once more, with the session time zone set to local time. The good design of the Zabbix data made this easy once I found the "
itemid" key used to store the watts.
After reloading in the proper time zone, the main image is shown above, with the zoomed in image here:
After zoomed
So is this perfect, or did the DST change muck things up? I think this is right, since the energy use goes up first, and then the room is heated. Also, one parameter has hourly buckets, while the other has one or two-minute intervals.
Conclusions
The data bias I've shown is clear enough: when comparing multiple samples, be sure the X-axis time values are viewed using the same time zone. It's not enough to say the data was collected at a point in time, validate them.
Extrapolating this to other measurements, identify error sources. Know accuracy versus precision, and check your watch. I mean math.
Notes