This post is part of an entire series
Hana Smart Data Integration - Overview
Another rather common scenario is to retain the history of the data throughout the loads. In short, build a slow changing dimension table of type 2. The idea is simple: We get changed rows from the source and whenever something interesting changed, we do not want to overwrite the old record but create a new version. As a result you can see the entire history of the record, how it was created initially, all the subsequent changes and maybe that it got deleted at the end.
I want to use the RSS feed as realtime source again, but instead of simply updating each post, I want to see if corrections are made in the title. Granted, other source make more sense. More common would be to keep the history of the customer to correctly assign a sales order for this customer to the country the customer was living in at the time the sales order got created.
But since the principle is the same....
The flow is similar to the previous one, all source rows are compared with the target table, but then the result of the comparison is sent to the History Preserving transform and then loaded. (This time I am using the WebIDE, hence the screenshot looks different. But you can open and edit a .hdbflowgraph file with either one, even switch.)
In such a Slow Changing Dimension there are many options you can chose from, for example
Putting all of this into a single transform would make it very hard to understand. Instead the transforms work together in unison.
(1) In the Filter transform the source data is prepared and aligned to match the target table. In this case the source and target tables have the same columns, in name and datatype, only the target table has four additional fields: The surrogate key as the target will have multiple rows for one source record over the time; the valid from/to date and the current indicator column.
None of these columns are needed in the Filter transform as they will be dealt with in the Table Comparison and History Preserving transform. The sole exception is the VALID_FROM column, the transforms need to know the valid from value and that is set in this Filter transform. Hence I have added one more column and mapped it to "now()".
(2) In the Table Comparison transform we deal with the question of what we want to compare. In this case we have columns like the URI - the primary key of the RSSFEED table, a title, a description and more.
The output of the Table Comparison transform is the structure of the selected compare(!) table, not the input structure. How would we know for example the SURROGATE_KEY of the row to update, else?
The first and most important setting are the columns to compare and to define what the logical primary key is.
Let's consider the column URI, the primary key of the source. This column should not only be compared, it is the key column for the comparison. In other words, the transform should execute something like "select * from compare_table where URI = Input.URI". Hence the URI column is not only added to the Compare Attributes but also marked as Primary Key=True.
All other columns are listed as Compare Attributes as well, hence the transform will compare the result of above "select * ..." column by column in order to find a change in the values. If all column values are identical, the transform will discard the row - no need to update a row that is current already.
The interesting part is what happens when one column is not part of the Compare Attribute list? Well, it is not compared but what does that mean in practice? Imagine the UPDATEDATE is not part of the list. If the transform find that the TITLE got changed, it will output the row. When the DESCRIPTION changed it will send the row. But if all values are the same only the UPDATEDATE column has a different value, the transform will consider that row is nothing-changed. Maybe somebody did open the row and saved it again without doing anything - all values are the same, only that UPDATECOLUMN is different.
For above to work a few obvious rules apply to the TC transform
Above rules sound complex at first sight, but actually all of them are quite natural and what the user will do anyhow. But it helps to understand the logic in case of an error.
In our use case we have the problem that one URI returns potentially multiple matching rows from the compare table, all the various past versions. We need to specify which row to compare with.
We have two options for that, both are in the first tab. Either we filter the compare table or we specify a generated-key column.
The latter builds on the assumption that a surrogate key is always increasing, hence by specifying one, we tell the transform to compare with the higest one, that is the row that was inserted most recent.
The filter option would be to make the compare table appear as having the latest record only, e.g. by adding the filter condition CURRENT_INDICATOR='Y'. Assuming that there is only one current record per URI, that would work also, except for deletes. Deleted rows and not current, hence an incoming row would believe no such record was ever loaded before and mark it as brand new insert. So be careful when choosing this option.
The History Preserving transform gets all the information it needs from the Table Comparison transform, that is the new values from the input and the current values for all columns from the compare table and is using those to produce the output data.
In the most simple case, that is when neither a valid-from/to date is used nor a current indicator, all the transform does is comparing the new values with the current values for all columns listed in the Compare Attributes. If one or more is different, then it outputs an insert row with the new values and the table loader will insert that. If all these columns are the same and the input was an update row, it does send an update row. Insert rows are inserted.
If the input is a delete row, either a delete row is sent or in case the checkbox at the bottom called "Update Attribute on Deletion" is checked, an update is sent.
In case a valid-from/to column is used and/or a current flag, then the transform has to create a second row of type update to modify the current row in the target table. From the TC transform it knows the surrogate key of the compare row, it knows all the current values, hence it can update the row to the same values except for the valid-to-date and current-indicator, these should be changed to the new version's valid from date and the current indicator from 'Y' to 'N'.
Same thing happens for delete rows in case the "Update Attribute on Deletion" is set.
The table loader should be a regular opcode writer, meaning its Writer Type option is left the default.
In the Sequence tab the surrogate key column is selected and the name of the Hana sequence to use is specified. This sequence has to exist already.
All rows that are inserted will get a new value in the surrogate key column, regardless of the input, but update/delete rows the surrogate key from the TC transform is used. That is the reason why the HP transform can output two rows, an insert for the new version and an update to alter the latest version currently in the target table. Both, insert and update, will have a surrogate key, e.g. SURROGATE_KEY=5, and therefore the update statement will look like "update table set current_ind='N', valid_to_date=...... where SURROGATE_KEY=5. But for the insert row, the 5 will be replaced by the sequence's next value.
A typical transformation is to retain history so we can query the data as if it was queried back then, something that is not possible with the source tables themselves - they do not contain history. The Hana time travel feature would be too expensive and is too limited as a generic method. Hence we need a transformation where we can define precisely what should trigger a history record, what should be just updated.
All of these setting can be made in the individual transforms and together they allow to build all variants of Slow Changing Dimensions Type 2.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
10 | |
10 | |
10 | |
9 | |
8 | |
8 | |
6 | |
6 | |
5 | |
5 |