2024 Apr 22 9:16 AM
Hello,
I have a question about what happens when an error occurs during a data load in SAP Datasphere.
If we use a data flow to perform a load to a DSP Local Table, do all the processed data operate within a database transaction, and if it fails, is there a rollback? Or is each batch treated as an individual transaction? I don’t see any option or flag to control this (as is the case in SAP Data Services), and I’m not sure about the default behavior.
Regards,
2024 Apr 23 3:56 PM - edited 2024 Apr 23 3:57 PM
Hi @XaviPolo ,
I see there is no reply yet! It's a very good question! Maybe you could do an experiment and analyze the results. Intentionally somehow --give a bad data-- and fail some batch, etc. then see what happens. 🙂
Looking forward to seeing more replies to your question as we try to make Data and Analytics group more active! 😉
Regards,
Tuncay
2024 Apr 23 9:27 PM
Hi XaviPolo,
Excellent question.
I haven't been able to find documentation on the topic.
Since all activity involving loading data into database tables fall under 2 phase commit and are timestamped, I
would assume everything has to succeed or a full or point in time rollback occurs. (But you know what they say when you assume).
2024 Apr 24 9:19 AM
@TuncayKaraca ok, I'll do the tests 🙂
TLDR: Commits are done for each block defined in fetch size
The test consists of loading a CSV with 30,000 rows by means of a Data Flow, and to provoke an error in the middle of the process, to see what we obtain in the destination table.
The CSV has 30000 rows, with two fields:
The source is SFTP, configuring the CSV as:
The destination is a DSP local table with
The execution of the data flow ends with error.
In the destination table, data has been saved, specifically the IDs ranging from 1 to 19000.
So I draw two conclusions:
- The most important is that the DF execution is not an “all or nothing” type, and commits have been made during the process.
- The second conclusion is that as we have two fetch controls, in source and in destination, it seems that this makes that the block of rows 19,001-20,000 has not been loaded even if it did not have any error, surely, when the block of 1,000 rows with error was read the exception was thrown, and the part that saves in the destination had not yet processed the block of 19,001-20,000, and that is why we do not have it.
I did a second test by increasing the rows to 110,000, leaving the wrong row at 105,001, and removing the fetch size control on both sides, to check if the theoretical maximum limit of the fetch size is 100,000.
But it saved from row 1 to 104,000 ... implying that the fetch size was not 100,000. I don't know if the value it takes depends on the source, the combination of them, etc.
When doing ETLs with delta loads that are manually managed based on e.g. timestamps, there is a risk that when an error occurs, rows from a failed load are left in the target table.
This may cause the calculation of the next timestamp to be loaded to be erroneous, so these rows should be deleted before their calculation.
Regards,
2024 Apr 24 2:53 PM
@XaviPolo What a great experiment! 😉 and what a great scientific proof of how DFs work with Control Fetch Sizes in error situations. ✔️ Very valuable findings!
It's interesting to see that unexpectedly even the previous block --in your example 19,001-20,000-- was not loaded. This proves the value of the experiment! I think you've explained the reason very well.
There still seems to be a question about the theoretical maximum limit of the fetch size being 100,000. Where has it been mentioned that the maximum limit of fetch size is 100,000? Is it an official limit?
The risk associated with delta loads is so critical. That's why monitoring and managing ETLs is so important.
Thank you
2024 Apr 25 11:48 AM
The form has a validation for "Batch Size" that shows a message "Enter a number between 0 and 100000." if criteria is not meet.
2024 Apr 25 3:18 PM
Hello XaviPolo,
Just a follow up on data loads into DataSphere with data flows.
The ETL processes for loading data into DataSphere should be based on the SAP HANA data transfer model which includes the persistence layer which is responsible for the durability and atomicity of transactions and transaction manager which is responsible for coordinating all database transactions and keeping track of all running and closed transactions. When a transaction is executed or failed, Transaction manager notifies relevant data engine to take necessary actions.
You thoughts.
2024 Apr 25 3:30 PM
I think the question is what does the system consider a transaction? If it considers the whole process or each one of the batches.
From the tests, it is clear to me that the whole process is not managed as a single DB transaction.
Actually this is fine for some cases, since in loads with a lot of data keeping it in a single transaction consumes a lot of resources, but in other smaller ones it can be important to use a single transaction to facilitate the data integrity.
And the problem gets more complicated, when in DSP we don't have a path for failures, for example to do cleanup after a failure.
Maybe if we do a HANA to HANA data load it will be different, but at least I am more concerned about loading data from external systems, where I have more chance of failure.
Regards,
2024 Apr 25 3:20 PM
I got you. Thanks! So removing the fetch size control on both sides and it saved from row 1 to 104,000 even though error was on 105,001 can we say there maybe a default fetch size is 1,000?
2024 Apr 25 3:38 PM
I would say yes