cancel
Showing results for 
Search instead for 
Did you mean: 

Challenges with Field Extraction in Indian Invoices Using Document Information Extraction

koushik1
Explorer
0 Kudos
155

Hello,

I am currently working on extracting fields from over 100 different templates of Indian invoices using the SAP Document Information Extraction service. However, I am facing challenges in obtaining accurate results across multiple scenarios.

Here are the two approaches I have tried:

Scenario 1:

  • I modified the default SAP invoice schema and added custom fields such as CGST, SGST, IGST (header fields), and HSN, batch number and so on (line items).
  • I created templates for each vendor/different invoice template and trained them using 2-3 sample invoices.

Issues Encountered:

  • Extra line items below the table are being captured as there is no way to declare a footer for the table.
  • Custom fields like CGST rate, CGST value, and HSN are not extracted properly in most cases, especially when there is even a slight difference between the trained template and the input invoice.

Scenario 2:

  • I chose a custom schema and created a list of fields that I needed, such as vendor name, vendor GST, PO number, invoice number and date, CGST, SGST, freight, packaging & forwarding, with the setup type as “auto” for all fields.

Few examples of Issues Encountered:

  • CGST is getting captured even when IGST is present on the invoice.
  • The vendor GST number is being captured as the receiver GST number.
  • Delivery note numbers are being captured as invoice numbers, and material codes are being captured as HSN codes in line items.

When I try to create templates for each invoice under the custom schema and train them, I observe even worse results than before (untrained).

Could anyone help with tackling these issues or suggest a better approach to achieve more accurate results in field extraction?

Thank you!

Best regards,
Koushik

Sankara1
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hello Koushik, Thank you for reaching out to us. Since this involves a customized invoice, we recommend training individual templates for the best results for extractions data.
View Entire Topic
yogesh-001
Discoverer
0 Kudos

You need to create individual document templates for each different format and don't use created schema , Try to use OCR with AI capability,

Other way you will try to get data by using inbuild OCR(PDF) for failed documents in Pretrained model and compare results.

I think you can get better results