Invoices are essential, but validating them manually can be tedious and error-prone. In this blog post, we’ll walk through a Python-based workflow that leverages SAP’s Document Information Extraction (DOX) API to extract invoice data and intelligently compare it against field records. The goal? Validate invoices and highlight discrepancies—all in one go.
Supported document types for the SAP Document Information Extraction Service include supplier invoice, purchase order, payment advice, and business cards. More to come in the roadmap.
This blog post builds on an excellent introductory guide by Joni – here – which covers getting started with SAP DOX API and testing it via Swagger. Here, we introduce Python logic, with Python modelling doing the heavy lifting of:
TL;DR: Automate invoice data extraction and validation using SAP DOX API and Python. OCR your PDFs, validate against expected fields with confidence scoring, and flag mismatches—all in one streamlined Python pipeline.
This guide is for: SAP BTP developers, automation engineers, finance teams exploring DOX integration, or anyone dealing with high-volume invoice processing.
Step 1: Authentication with SAP BTP
First, we authenticate against SAP BTP using OAuth 2.0 to retrieve an access token.
Update <TENANT> to your SAP BTP tenant name.
The Bearer Token is required for all subsequent API calls to the DOX endpoint.
import requests
import json
client_id = ""
client_secret = ""
oauth_url = "https://<TENANT>.authentication.ap10.hana.ondemand.com/oauth/token?grant_type=client_credentials"
# Get access token
def get_access_token():
token_response = requests.post(
oauth_url,
data={
"grant_type": "client_credentials",
"client_id": client_id,
"client_secret": client_secret,
},
headers={
"Content-Type": "application/x-www-form-urlencoded"
}
)
if token_response.status_code == 200:
return token_response.json().get("access_token")
else:
raise Exception(f"Failed to get token: {token_response.status_code} - {token_response.text}")
Step 2: Submitting an Invoice PDF for Document Information Extraction
Next, we upload a sample invoice PDF (in this case, Thai_Invoice_with_TH_date.pdf) to the DOX API. The POST/document/jobs endpoint performs a document submission (e.g a pdf, png or jpeg file) for asynchronous processing. If successful, the API returns a job ID which we use to track and fetch the results. This ID can be used with other endpoints, including GET/document/jobs/{id} and DELETE/document/jobs/{id}.
Sample Invoice:
import requests
url = "https://<TENANT>.ap10.doc.cloud.sap/document-information-extraction/v1/document/jobs"
access_token = get_access_token()
file_path = "Thai_Invoice_with_TH_date.pdf"
headers = {
"Authorization": f"Bearer {access_token}",
"accept": "application/json"
}
# simulate -F flags in curl
files = {
'file': (file_path, open(file_path, 'rb'), 'application/pdf'),
'options': (
None,
'''{
"schemaName": "SAP_invoice_schema",
"clientId": "c_00",
"documentType": "invoice",
"receivedDate": "2020-02-17",
"enrichment": {
"sender": {
"top": 5,
"type": "businessEntity",
"subtype": "supplier"
}
}
}''',
'application/json'
)
}
response = requests.post(url, headers=headers, files=files)
if response.status_code in [200, 201]:
print("Document submitted successfully!")
print("ID:", response.json().get("id"))
else:
print("Submission failed")
print("Status Code:", response.status_code)
print("Response:", response.text)
Step 3: Retrieving Extracted Data
Using the job ID, we retrieve the results. Replace {id} with the ID in the previous output. From the SAP Document Information Extraction interface, you’ll also see your extracted fields formatted.
The extracted fields (like receiver address, receiver name, document date, and net amount) are returned with confidence scores, allowing us to evaluate the reliability of each extracted field.
You get the actual confidence scores via the DOX API instead of an extraction confidence range on the interface.
url = f"https://<TENANT>.ap10.doc.cloud.sap/document-information-
extraction/v1/document/jobs/{ID}"
response = requests.get(url, headers=headers)
if response.status_code == 200:
data = response.json()
fields = data.get("extraction", {}).get("headerFields", [])
if not fields:
print("No header fields extracted.")
else:
print("Extracted Fields:")
for field in fields:
print(f"- {field['name']}: {field['value']} (confidence: {round(field['confidence'], 2)})")
else:
print("Failed to retrieve job data")
print("Status:", response.status_code)
print("Response:", response.text)
Step 4: Document Confidence Scoring and Mandatory Field Check
We define a list of mandatory fields and set a minimum confidence threshold (e.g., 70%).
The Document Confidence Score is reflected at the end of the output after checking for matching and missing fields.
min_confidence = 0.7 # threshold to flag low-match fields
mandatory_fields = {
"receiverName": "Buyer Name",
"receiverAddress": "Buyer Address",
"documentDate": "Date Issued",
"documentNumber": "Invoice Number",
"currencyCode": "Currency",
"taxAmount": "VAT Value",
"senderAddress": "Company Address",
"senderName": "Company Name",
"taxId": "Tax ID"
}
if response.status_code == 200:
data = response.json()
fields = data.get("extraction", {}).get("headerFields", [])
# Convert extracted fields into a dictionary: name => field
extracted = {f["name"]: f for f in fields}
total_score = 0
max_score = len(mandatory_fields)
low_match_flags = []
print("Mandatory Field Check:\n")
for key, label in mandatory_fields.items():
field = extracted.get(key)
if field:
total_score += 1 # Presence of field
print(f"{label} ({key}): {field['value']}")
else:
print(f"{label} ({key}): MISSING")
print("\nResult Summary:")
print(f"Fields Found: {total_score}/{max_score}")
if total_score < max_score:
print("\nLow Match Fields:")
# Check for missing fields and flag low match
for key, label in mandatory_fields.items():
if key not in extracted:
print(f" - {label} ({key})")
document_confidence_score = round((total_score / max_score) * 100, 2)
print(f"\nDocument Confidence Score: {document_confidence_score}%")
else:
print("Failed to fetch job results.")
print(f"Status Code: {response.status_code}")
print(f"Response: {response.text}")
Future Expansion
For integration and tracking – you’ll want your enterprise system as part of the Document Information Extraction process. While keeping all validation logic in Python, you can expand the case with fuzzy matching, error handling, and real databases, including SAP S/4HANA or SAP Build Process Automation:
All in all, this blog post provides a preliminary look into:
If you process high volumes of invoices, SAP Document Information Extraction can help you save countless hours and reduce errors.
Leave your thoughts or questions in the comment section below!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Subject | Kudos |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
User | Count |
---|---|
19 | |
19 | |
15 | |
9 | |
9 | |
8 | |
7 | |
6 | |
6 | |
6 |