Showing results for 
Search instead for 
Did you mean: 

DOX - Process of training an extended Invoice schema - country specific

0 Kudos

G'day all,

I've read some questions along the same lines with great responses from Tomasz Janasz but I'm still unclear.

I want to extend the standard Invoice Schema to include a few Australian standard fields (BSB/BPay Biller/BPay Ref).

The out of the box pretrained model seems to do pretty well on a wide variety of Invoices thrown at it. No training required as the name would suggest. Happy days.

But to add in the required additional fields I need create an extension schema and then that needs training.

A few questions?

What are the field level "Default Extractors" and how can I make one? How can I instruct the general logic on how to look for a BSB or BillerCode without disrupting all the good default logic of "Prertained" (knowing that the Template logic still sits on top of Pretrained).

Given that I want to use "Detect Automatically" (because the alternative is unthinkable), how do get my desired result given that "BSB" and "BPayBiller/BPayRef" are mutually exclusive. No invoice would ever have both, but all Australian Invoices will have one or the other.

My concern is that if I start add Templates for some vendors to train for these fields, I'll have to have a Template per vendor - this seems crazy given how well works with no training to begin with. There is very little "I" in the "AI" if thats the case.

I do wonder if a custom field level extractor is the answer, but I haven't found any doco on it, or any real understanding from other answers I've seen.

Thanks for any and all advice!

Have fun,


Accepted Solutions (0)

Answers (1)

Answers (1)


Hi Mark,

current Extension feature is layout-based (Template). In other words: to be able to extract additional custom fields one needs to annotate each distinct layout to show to the system where the key-value pair is localized on the document.

Default Extractors correspond with the fields that are managed by SAP (the ones that we train the models on). It means that you have the option to combine your own manual annotation with the power of the underlying ML algorithms.

"Detect Automatically" helps to find/detect the correct template that corresponds with the layout of the document that you upload to the service. The value is that you need to create a Template and annotate the corresponding (custom) fields only once, and the extraction is then done automatically by the service.

I hope that you will find some "I" in Document Information Extraction 🙂

Feel free to reach out if you have further questions.

Tomasz (PM)

0 Kudos

G'day Tomaz,

Thank you so much for such a quick response and sharing the above I 🙂

I did raise another question before I'd seen you answer, but it's in a similar vein.

Having to create and train Templates seems to detract from the power of the ML? We go from not having to do any training, and then once we add a couple of fields we're on the hook for loading templates that cover every Vendor Invoice layout.

As there is the option of a 1000 Templates per Schema, what is the expectation? Is it really a Template per Vendor? If not, how do you know what's 'close enough' in terms of layout to include different but similar documents in an existing Template.

If it's really 1 vendor per Template, why the need for 50 documents per Template? They'll be identical? That doesn't sounds right, and so that leaves me with the 'close enough' question above.

My concern is once up to live volumes of Vendors that there will be too many Templates and the "Detect Automatically" logic will crumble like a DB optimiser overloaded by too many Indexes.

If we are unable to extend at field level by defining our own "Default Extractor" is it potentially better to leave the schema alone and manually scrape the few additional fields with the "Get Text After (PDF)" step and just ask specifically for each field - in this case it doesn't matter where it appears in the document. So for e.g.

In this case we can still rely on what appears so far to be pretty reliable - the Pre-Trained option.

I'm very nervous about this technology which is great for a demo, but I'm not sure will live up to Productive deployment without a lot of effort.

Any advice is greatly appreciated. There is a lot of magic happening and I get nervous when I don't understand how to most effectively influence that magic.

Have fun,