My original plan for this post was to wrap up source and target based changed data capture in a single instalment unfortunately, I seem to have got carried away and will post a follow up on target based CDC in the future.
Once the method of data delivery has been established (push or pull) the next area of consideration is how can change data capture (CDC) be applied within Data Services? More often than not, when a project demands data to be extracted many times from the same source system using Data Services, the area of change data capture will be discussed. As the name suggests CDC is all about identifying what has changed in the source system since the data was previously extracted and then only pulling the new or modified records in the next extraction process. The net result is that effective CDC enables users to build efficient data processing routines within Data Services reducing batch windows and the overall processing power.
With Source based change data capture any record changes are identified at the source and only those records are processed by the Data Services engine. These changes can be pre-identified using techniques such as SAP IDOC’s or Sybase Replication Server (now integrated in the latest version of Data Services) or dynamically identified by using pushdown logic and timestamps etc.
There are various methods available with Data Services, I have used a number of these within many different scenarios. With Data Services there is nearly always two or more ways to get where you need to be to achieve the required result and this comes down to the ETL developer’s level of creativity and caffeine one the day. The following are just a rule of thumb that I use, they don’t always work in all scenarios as there are usually many variables that need to be taken into consideration, but as far as my thought processes go these are the different stages I go through when trying to identify the best methods for change data capture.
So the customer has said that they want to do CDC the first questions I always ask are:
What is the source database/application?
How much data is in the source system tables we are extracting from?
How often do we need to extract, and what is the batch window?
In the latest version of Data Services (4.2) within the Workbench the above timestamp step example above can be configured as part of the replication wizard. If the source system is SAP I would also look at using the CDC functions available within the content extractors as this is preconfigured functionality and doesn’t require any of the above job configuration steps.
If data needs to be extracted at various points throughout the day then the pushdown method could still be an option however, I am always very cautious about impacting performance on the source systems and if there is a chance that performance degradation is going to affect a business-transacting then I would opt for a different approach where possible.
When working on site with customer data the source systems and infrastructure will nearly always determine what methods of change data capture can be put to best advantage with Data Services. However, given a free reign the best and most efficient approach without doubt is to carry out the data identification process, as close to the source system as possible, however, that isn’t always an available option. In the next blog post I will dig a little deeper into using target based change data capture method.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 4246 | |
| 3365 | |
| 2605 | |
| 2153 | |
| 1983 | |
| 1257 | |
| 1164 | |
| 1122 | |
| 1100 |