As said in the previous article, I have the feeling we are moving backwards in the Data Integration space. In this post I would like to look at workflows.
If multiple steps must be executed in a proper order, this sequence is explicitly specified. That is standard in all workflow engines.
In this example, the customer data must be loaded first and because the subsequent dataflow requires current customer data, it is executed after.
Other objects might be loaded in parallel because they are fully independent. In BODS that would be two unconnected objects.
This approach is called
Data Orchestration. The user is calling the shots, specifying the exact order explicitly. For simple scenarios this is doable but it has two issues: Changing execution times and adding new objects later.
Modifying the execution order
In a sequential case inserting a new dependency at the proper place is simple. Shall C be executed after A but before D, it is placed somewhere between B and D. That means removing the dependency line, adding the C workflow and drawing the two dependency lines connecting it.
What if the starting point is different and D was executed in parallel to A,B? Now C has to be inserted at the proper place, so that A is executed before C and D after? That requires some redesign, more lines, less obvious to understand.
These examples get more and more complex with every single workflow. Real life projects have 100s of workflows to be executed in perfect harmony.
Varying execution times
The majority of the workflow can be executed in parallel, for sure. This saves time, e.g. why should reading the data from two totally different systems be done one after the other instead of in parallel - assuming there are no logical reasons against such? But this can have the downside that way too many connections are opened for a source system to cope with, the source system gets overwhelmed by the data requests, etc.
In Data Orchestration, the user has to manually resolve those issues. Usually by having arbitrary groups of parallel streams. This is tough work, especially if one workflow must be moved to a completely different place. Most important, this task will never be perfect as execution times differ between days due to different amounts of data and moving one task to a different place impacts all others.
It would be way better if the system does that automatically - in form of a
Data Choreography. In a self organisation kind of way, adapting to the current situation.
And the best is combining both.
User interface
The challenge is how to build a user interface that allows both, specifying the logical dependencies and let the system decide, which to execute when, in all other cases. It took the BODS team quite a while but the end result we came up with I like a lot.
In this example the CostCenter requires the ControllingArea to be loaded upfront.
The first simplification for the user is allowing to group tasks into workflows. The job "load the entire data warehouse" then consists of workflows like "load sales orders", "load shipping docs" which either have a dependency (a connection line) or not. The "load sales orders" consists of "load all sales order master tables" and "load the orders themselves".
This creates an easy to understand execution graph as we no longer must understand all dependencies globally. The local dependencies are enough, each workflow shows the dependencies valid for itself. Example: I must add a new master table for the sales orders? I go to the workflow "load all sales order master tables" and add it there.
What will happen now is that some workflows are used in multiple other workflows as dependency. The customer master data will be a requirement for sales orders, billings, shippings and many more fact tables. In case of BODS these workflows are added as dependencies at all places - thus it is still obvious where they are needed - but the customer master workflow itself is marked as "execute only once".
This ensures a workflow is executed before the dependents but regardless which of the dependents is started first and when.
Also a workflow has the option to catch specific errors in case they should not halt the execution.
Finally there is an upper limit of how many dataflows a BODS job executes at max in parallel, the default is eight, and then whenever a dataflow is finished, a new dataflow is started immediately.
All advantages, no disadvantages
The end user has all the advantages now:
- Orchestration for cases where there is a business reason to execute the one before the other.
- Automatic choreography of all none-dependent objects at any workflow level.
- Dataflows are executed as early as possible.
- Dependencies are easy to create, understand and maintain.
- Using the same workflow in multiple places does cause a single execution only, still.
- Only an upper limit of dataflows is executed in parallel at any time, even if more would be logically possible.
Some things never made it into the product, e.g. if a dataflow performs a lookup on a table that is loaded in another dataflow, chances are high it should be current - it has a dependency. The engine could resolve that automatically as well. On the other hand, what to do in the rare cases this is not wanted. The bypass feature I am no big fan of. Still...
Frankly I don't know a single other workflow engine that is equally powerful and easy to use.
Questions when evaluating any Data Integration tool that claims to have workflow support:
- Does it allow to control sequential and parallel execution?
- Demonstrate how complex it is to modify dependencies.
- How is the task achieved that 100 tasks are not executed in parallel for technical reasons?
- How is recovery supported? The requirement to start a job at the point it failed without re-executing the previously successful executed tasks again?
- How is error handling supported? The requirement to do something else in case of one type of error.
- Show me the workflow network for a typical data warehouse project where 100 dimension tables and five fact tables are loaded, some have dependencies, some don't.