I am not sure if it is just my impression, but I have the feeling we are going backwards in the Data Integration market. Many new tools gained attraction recently but are worse than SAP Data Services (aka "BODS") in almost every aspect. Or maybe BODS was just ahead of time?
Time to look into some rarely known details of BODS...
My reasoning is around three areas:
- Throughput: A Data Integration tool's most important property is to be fast. It will deal with massive amounts of data and every piece of optimization must be utilized. Or what would you say about a tool that takes more than 24 hours to load the daily delta?
- Developer efficiency: Of course, everything can be programmed or scripted manually. But this will take time and future changes will be very expensive to add. It should be the opposite. One glimpse and the developer know what is done and modifications are simple.
- Enterprise readiness: This covers documentation of the created logic, error handling, automation, multi-user development, moving to production, ... etc
Source/Target connectivity
The amount of development effort that went into Data Services is immense. In regards to performance, the first question is how a tool interacts with the connected systems. For a database it might sound simple, a JDBC/ODBC driver is used to execute SQL commands. That for sure is good enough, isn’t it?
When reading data, two systems are involved, and both should be utilized while reading. Hence the bare minimum is using
array fetches with every supported database. But even that might not be enough. Then all data is streamed through a single connection. The result of that would be to utilize a single CPU core on or large 128 CPU server.
Next: For large amounts reading the data partitioned and in parallel is a must. If the table is large, it will very likely be partitioned already. If so, the DI tool should read each partition via a single connection. (see e.g.
Hana select with partition_restriction on how to do that when reading Hana tables)
Now the entire system can be utilized until the network bandwidth will be the limiting factor.
For loading there are the same concepts available – partitioning, array loads, multiple connections – but databases often have a quick import method as well. This does lock the entire table partition and imports data directly into the database file. Loading this way has limitations but if data must be inserted only, it is much faster compared to SQL commands. ODBC/JDBC does not provide any APIs for that, this is vendor specific. BODS is not using ODBC/JDBC, it rather implemented the low-level APIs for every single database vendor. In Oracle this is the OCI Library and the
direct path API.
RFI questions: For which databases do you support ...
- array fetches?
- reading data partitioned in parallel automatically?
- array insert/update/delete/upsert?
- vendor specific, faster import APIs like Oracle's direct path API?
- partitioned parallel loading (with faster import methods)?
If the answer is no to any of these questions, the performance penalty will be huge. Estimate a factor of 10 minimum. In other words, what takes 1 hour with BODS takes 10 hours with the other tool.
ETL/ELT
The first tools in the DI space were file based. A batch program extracts the data from the source into a file. Then the file is read and transformed into the desired target structure, often via multiple steps, and finally loaded. Looking at the processing pattern where reading is a synonym for data extraction, it would look like:
EL-ETL-ETL-…-ETL-EL
The data is read from the source, written as file. The file is read, transformed and a new file written. This happens multiple times and finally the last file is read and loaded.
That the performance of such an approach would be unbearable is obvious. And the solution is simple as well: No files in between, instead the data is streamed through the transformation steps. This streaming has another positive side effect: The data is read, transformed and loaded simultaneously. The consequence is, if the read takes one hour, the transformation one hour and the loading as well, the overall execution time is closer to one hour, not three hours if one step is performed after the other.
This would be a pattern like:
E-TT...TT-L
Or in short
ETL.
Database vendors and trivial Data Integration tools called that old-school-technology and coined the term ELT instead. Data is extracted from the source, loaded into a database, transformed there using SQL statements. The main argumentation is that a database is much more powerful in terms of hardware and SQL optimization, compared to any ETL tool.
But looking at the read/write steps, the
ELT term is misleading. It should rather be:
EL-ETL
Data is read from the source, loaded into a target database table. Then read using a SQL statement that does all transformations and loads the result into the final table. The second part would be an insert..select.. statement. But we all know the limitations of SQL. For even mid-complex transformations multiple select statements must be executed or nested, each creating a temporary table inside the database. Leaving marketing aside, a typical transformation would be inside the database almost always a
EL-ETL-ELT-…-ETL
This is more like the first version where processing steps happen one after the other and with lots of read/write operations.
Which one is better from a practical point of view? Both can come up with examples where the one is better and the other sucks and vice versa. That is a good foundation for a nice, fierce and long-lasting religious war.
The objective answer is clear, however: Both use cases have their sweet spot.
If the data is in the target database already and the transformation can be done with a single, non-nested, SQL statement, then executing this single SQL statement ELT-like would be faster for sure. In comparison, an ETL tool would read the data out of the database, apply a trivial transformation just to load it back into the same database. That does not make sense performance wise.
With complex transformations, particularly ones that require multiple SQL statements to achieve the transformation, the ETL approach will be faster most of the time.
And in the normal scenario, a mixture of both will be the fastest. Perform as many actions as possible in the source and target database and do everything the database is not well suited for in the ETL tool.
Example: Two tables should be joined, and the result compared with the current target table in another database. Only the few changed rows should be applied. It would be ideal to push the join in the source database, that’s what the database is very good at and has all the supporting structures like indexes on the table. The table comparison requires two to three SQL statements instead of a single statement an ETL tool requires.
What does BODS do? Exactly that. It has an optimizer built in that tries to push down as many (sub-) operations as possible to the source or target database and does the rest in the engine. For simple cases, it generates an insert..select.. statement and behaves like an ELT tool, as this would be the fastest.
In other words,
BODS is an ELT and ETL tool, both has been incorporated,
plus an optimizer that decides plus the capability of breaking a data flow apart into an ELT and ETL part. The BODS team invested a lot of money to get the best of both worlds to its users for their benefit.
In-engine throughput
For the in-engine performance the goal is to get into millions of rows per second. Obviously, every overhead required for getting the rows from one transformation to the next will impact performance significantly. Whenever the rows cross system boundaries there must be a conversion of some kind.
The first conversion is when reading the data. It comes in one format, e.g. as SQL result set and this must be converted into the engine format. A SQL decimal converted into C++ data structure, a SQL date into a C++ date. This involves CPU power for the actual conversion and memory access time for the copy operation. These are in the nano second times but with millions of rows per seconds this trivial process alone takes 0.1% of the execution time. If these conversions happen between every single transformation and there 100 transformation steps, the engine data copy operations are already responsible for 10% of the execution time without any of the actual transformation CPU times yet.
When implementing a streaming engine, the naïve approach is to take the row from the input, allocate memory for the output row and then apply all mappings. The first field is a 1:1 copy, so the content is copied to the output. The second column has a mapping of subtring(col2,1,1) to extract the first char and copy that result to the output. Consequence of that for one million rows per second and 100 transformations (and 1k Byte row size) is that 100 GB/s memory bandwidth is required. That is … a lot.
Data movement without mem-copy
The better approach is to move the row pointer only to the output. The input row is at address X, output is the same row. For 1:1 mappings the input and output is the same, so nothing must be done. In the case where column #2 is changed, the input row value is overwritten with the new value. With this approach no memory copy is needed!
This approach works if the input and output have the same columns. Here the optimizer comes into play again. Let’s assume the output has an additional column. What would happen if this additional column is moved to the very beginning, the point where the data is read from the source? It does not exist in the source and therefore its value will be null. But the column itself does not hurt. Same argumentation if the output has less columns. Yes, the output row at memory pointer address X has an additional column available, but it is not used when loading into the target database. With this trick the condition “input row and output row must have the same columns” can be kept true for much more transformation types. This can be visualized when looking into the optimized execution plan BODS generates.
The next thing the optimizer does is combining and splitting transformations. A trivial example would be three transformation steps that each modify columns only. All these transformations can be collapsed into a single transformation step.
The consequence of that is an extreme in-engine performance in BODS. The CPU power and memory bandwidth for the data movement is effectively zero leaving all the resources for the actual transformations.
Different engines
The worst a tool can do is using different engines with different row formats and/or transferring the data across the network. For example a tool that has a structured engine, a python operator, a Javascript operator,… In that case the e.g. structured data must be serialized into a Json string, just so the Javascript engine deserializes the data into its internal format. Then the Javascript logic applies its transformations and the resulting Json structure is serialized into a Json string, handed over to the Python operator so it can deserialize it. Lots of back and forth conversions of every single row!
If there is network involved between two operators things get from bad to really, really horrible. Do you happen to know a Data Integration tool that is built that way? If yes, compare it with BODS!
Funny enough, even Data Services has the option to split a data flow into multiple processes. It is a very effective way of slowing down a dataflow.
These performance numbers are less easy to get from tool vendors and they would be hard to compare. Better to evaluated them in the context of Proof of Concept, where a more complex scenario is implemented, preferably with SAP Data Services as one explicit contender. As I have shown before, the argumentation that tool xyz has an ETL engine included does not proof it is fast. If it has this engine only, not multiple engines and operators and programming languages, then maybe. But a software where every single row is moved from one container to the next must use memory copy operation is best case, network in the normal case.
Developer efficiency
Another development I found recently is to go back to scripting. Transformations are implemented as Javascript code, entire dataflows written in Scala or Python and executed as Spark jobs. At the surface these are equivalent to BODS dataflows. I mean, a BODS dataflow is nothing else than a sequence of CPU instructions as well. The difference is the usability or, like I called it in this chapter, the developer efficiency.
As a first example let’s use the partitioned reading. The source table shall have 20 partitions and each should be read via a single connection. Building a dataflow that does that is doable, it would have the same source table plus the partition filter 20 times in the dataflow. Not nice but okay. The troubles really start when the number of partitions is changed. The database might even add more partitions automatically if e.g. a table is partitioned by year. The dataflow must still work and must read all data including the new partition and preferably utilize all 21 partitions now. The first requirement for reading all data still is doable, if the developer pays attention. The second one would require to manually add the 21
st partition, move the code to test and production.
In case of BODS the user adds the source table just once - a great time saver already - and ticks the flag to read the table partitioned. Then the dataflow optimizer creates the modified plan with the n readers each reading the data from one partition. Here BODS could do a little bit better in aligning the partitions and the degree-of-parallelism, but this simple change never made it into the product for whatever reasons.
This kind of feature, where the user just sets a flag and the optimizer changes the dataflow into something that does logically the same but is optimized for performance, will be seen in the actual project, when more complex requirements are to be implemented. Things like a lookup of a row in another table which returns not all matching rows but the most recent one. Or when loading tables designed for slowly changing dimensions. One click in BODS versus a lot of code the user has to maintain.
My point here is that having the option to implement something fancy is one thing. But if the costs are too high, it cannot be done due to the amount of time it would take. Hence it is very important how easy such advanced functionalities can be utilized.
Enterprise readiness
The final point is all about the surroundings of a Data Integration project. Error handling must be implemented for example. By my experience the dataflows and workflows are designed in a BODS project once and then they simply run. The product is very stable by itself and if something fails, next time the job is executed again, it recovers from the previous error automatically and e.g. processes yesterday’s and today’s data together.
Other features like the AutoDocumentation allows the user to create 100s of pages of project documentation using the information the dataflows contain. What has been designed but also the comments added at various places.
Is everything perfect than? Of course not, some points I have listed above already. The biggest issue I have is the development speed. All work seems to go into SAP Data Intelligence and the “What’s new in BODS” document in each release is just basic maintenance. Even the plan of SAP to
integrate BODS into SAP Data Intelligence does not help, as it will allow to use the BODS reader and a query transform. Almost none of the above mentioned advantages of BODS can be utilized then. Lots of memory copy operations are required, all data must be converted from the BODS in-engine format to one of the many SAP Data Intelligence internal formats, each row is transported across the network multiple times, etc.
So I probably have my answer to “Or maybe BODS was just ahead of time?”: Yes, it was ahead of time and benefits from the concepts still.