Data and Analytics Blog Posts
cancel
Showing results for 
Search instead for 
Did you mean: 
XaviPolo
Active Contributor
1,312

You are using DSP as a Data Fabric to load multiple data sources, and all works great, and you can enjoy life (you know, teleworking from some heavenly place) ... until something goes wrong. You receive an email that a Data Flow has failed. That SFTP seems to be not as reliable as it should be, so now what?

Leave your Virgin Mojito and open DSP.

 

 

 

 

Test definition

The test consists of loading a CSV with 30,000 rows by means of a Data Flow, and to provoke an error in the middle of the process, to see what we obtain in the destination table.

The CSV has 30000 rows, with two fields:

  • ID: consecutive numeric
  • VAL: string with the value '1' in all rows except 20101 which has a 'AAA'.

The source is SFTP, configuring the CSV as:

  • ID as Int32
  • VAL as String(1).
  • Control Fetch Size active to 1000 rows.
  • Fail Run on String Truncate = On. This will cause row 20101 to give an error for exceeding the defined size of string(1).

The destination is a DSP local table with

  • ID: int32
  • VAL: String (1)
  • Control Fetch Size active to 1000 rows.

 

Test Results

The execution of the data flow ends with error.

In the destination table, data has been saved, specifically the IDs ranging from 1 to 19000.
So I draw two conclusions:
- The most important is that the DF execution is not an “all or nothing” type, and commits have been made during the process.
- The second conclusion is that as we have two fetch controls, in source and in destination, it seems that this makes that the block of rows 19,001-20,000 has not been loaded even if it did not have any error, surely, when the block of 1,000 rows with error was read the exception was thrown, and the part that saves in the destination had not yet processed the block of 19,001-20,000, and that is why we do not have it.

I did a second test by increasing the rows to 110,000, leaving the wrong row at 105,001, and removing the fetch size control on both sides, to check if the theoretical maximum limit of the fetch size is 100,000.
But it saved from row 1 to 104,000 ... implying that the fetch size was not 100,000. I don't know if the value it takes depends on the source, the combination of them, etc.

 

Which implies

When doing ETLs with delta loads that are manually managed based on e.g. timestamps, there is a risk that when an error occurs, rows from a failed load are left in the target table.
This may cause the calculation of the next timestamp to be loaded to be erroneous, so these rows should be deleted before their calculation.

1 Comment
TuncayKaraca
Active Contributor
0 Kudos

Hi @XaviPolo,

It's a good idea to turn the Data Load Integrity Control discussion into a blog post. It's a very useful experiment.

Regards,
Tuncay