Technology Blog Posts by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
jing_wen
Associate
Associate
761

Invoices are essential, but validating them manually can be tedious and error-prone. In this blog post, we’ll walk through a Python-based workflow that leverages SAP’s Document Information Extraction (DOX) API to extract invoice data and intelligently compare it against field records. The goal? Validate invoices and highlight discrepancies—all in one go.

Supported document types for the SAP Document Information Extraction Service include supplier invoice, purchase order, payment advice, and business cards. More to come in the roadmap.

This blog post builds on an excellent introductory guide by Joni – here – which covers getting started with SAP DOX API and testing it via Swagger. Here, we introduce Python logic, with Python modelling doing the heavy lifting of:

  • OCR-based document post and extraction (via SAP DOX API)
  • Confidence scoring and validation of key fields
  • Automatic detection and reporting of mismatches

TL;DR: Automate invoice data extraction and validation using SAP DOX API and Python. OCR your PDFs, validate against expected fields with confidence scoring, and flag mismatches—all in one streamlined Python pipeline.

This guide is for: SAP BTP developers, automation engineers, finance teams exploring DOX integration, or anyone dealing with high-volume invoice processing.

Step 1: Authentication with SAP BTP

First, we authenticate against SAP BTP using OAuth 2.0 to retrieve an access token.
Update <TENANT> to your SAP BTP tenant name.

The Bearer Token is required for all subsequent API calls to the DOX endpoint.

import requests
import json
 
client_id = ""
client_secret = ""
oauth_url = "https://<TENANT>.authentication.ap10.hana.ondemand.com/oauth/token?grant_type=client_credentials"
 
# Get access token
def get_access_token():
    token_response = requests.post(
        oauth_url,
        data={
            "grant_type": "client_credentials",
            "client_id": client_id,
            "client_secret": client_secret,
        },
        headers={
            "Content-Type": "application/x-www-form-urlencoded"
        }
    )
    
    if token_response.status_code == 200:
        return token_response.json().get("access_token")
    else:
        raise Exception(f"Failed to get token: {token_response.status_code} - {token_response.text}")

Step 2: Submitting an Invoice PDF for Document Information Extraction

Next, we upload a sample invoice PDF (in this case, Thai_Invoice_with_TH_date.pdf) to the DOX API. The POST/document/jobs endpoint performs a document submission (e.g a pdf, png or jpeg file) for asynchronous processing. If successful, the API returns a job ID which we use to track and fetch the results. This ID can be used with other endpoints, including GET/document/jobs/{id} and DELETE/document/jobs/{id}.

Sample Invoice:
Thai_Invoice_with_TH_date.png

Doc Submitted.png

import requests

url = "https://<TENANT>.ap10.doc.cloud.sap/document-information-extraction/v1/document/jobs"
access_token = get_access_token()
file_path = "Thai_Invoice_with_TH_date.pdf"

headers = {
    "Authorization": f"Bearer {access_token}",
    "accept": "application/json"
}

# simulate -F flags in curl
files = {
    'file': (file_path, open(file_path, 'rb'), 'application/pdf'),
    'options': (
        None,
        '''{
            "schemaName": "SAP_invoice_schema",
            "clientId": "c_00",
            "documentType": "invoice",
            "receivedDate": "2020-02-17",
            "enrichment": {
                "sender": {
                    "top": 5,
                    "type": "businessEntity",
                    "subtype": "supplier"
                }
            }
        }''',
        'application/json'
    )
}

response = requests.post(url, headers=headers, files=files)

if response.status_code in [200, 201]:
    print("Document submitted successfully!")
    print("ID:", response.json().get("id"))
else:
    print("Submission failed")
    print("Status Code:", response.status_code)
    print("Response:", response.text)

Step 3: Retrieving Extracted Data

Using the job ID, we retrieve the results. Replace {id} with the ID in the previous output. From the SAP Document Information Extraction interface, you’ll also see your extracted fields formatted.

The extracted fields (like receiver address, receiver name, document date, and net amount) are returned with confidence scores, allowing us to evaluate the reliability of each extracted field.

You get the actual confidence scores via the DOX API instead of an extraction confidence range on the interface.

Picture 1.pngExtracted Fields.png

url = f"https://<TENANT>.ap10.doc.cloud.sap/document-information-
extraction/v1/document/jobs/{ID}"
response = requests.get(url, headers=headers)

if response.status_code == 200:
    data = response.json()
    fields = data.get("extraction", {}).get("headerFields", [])

    if not fields:
        print("No header fields extracted.")
    else:
        print("Extracted Fields:")
        for field in fields:
            print(f"- {field['name']}: {field['value']} (confidence: {round(field['confidence'], 2)})")
else:
    print("Failed to retrieve job data")
    print("Status:", response.status_code)
    print("Response:", response.text)

Step 4: Document Confidence Scoring and Mandatory Field Check

We define a list of mandatory fields and set a minimum confidence threshold (e.g., 70%).

The Document Confidence Score is reflected at the end of the output after checking for matching and missing fields.

Field Check.png

min_confidence = 0.7  # threshold to flag low-match fields

mandatory_fields = {
    "receiverName": "Buyer Name",
    "receiverAddress": "Buyer Address",
    "documentDate": "Date Issued",
    "documentNumber": "Invoice Number",
    "currencyCode": "Currency",
    "taxAmount": "VAT Value",
    "senderAddress": "Company Address",
    "senderName": "Company Name",
    "taxId": "Tax ID"
}

if response.status_code == 200:
    data = response.json()
    fields = data.get("extraction", {}).get("headerFields", [])
    
    # Convert extracted fields into a dictionary: name => field
    extracted = {f["name"]: f for f in fields}
    
    total_score = 0
    max_score = len(mandatory_fields)
    low_match_flags = []

    print("Mandatory Field Check:\n")

    for key, label in mandatory_fields.items():
        field = extracted.get(key)
        if field:
            total_score += 1  # Presence of field
            print(f"{label} ({key}): {field['value']}") 
        else:
            print(f"{label} ({key}): MISSING")

    print("\nResult Summary:")
    print(f"Fields Found: {total_score}/{max_score}")
    
    if total_score < max_score:
        print("\nLow Match Fields:")
        # Check for missing fields and flag low match
        for key, label in mandatory_fields.items():
            if key not in extracted:
                print(f" - {label} ({key})")

    document_confidence_score = round((total_score / max_score) * 100, 2)
    print(f"\nDocument Confidence Score: {document_confidence_score}%")

else:
    print("Failed to fetch job results.")
    print(f"Status Code: {response.status_code}")
    print(f"Response: {response.text}")

Future Expansion

For integration and tracking – you’ll want your enterprise system as part of the Document Information Extraction process. While keeping all validation logic in Python, you can expand the case with fuzzy matching, error handling, and real databases, including SAP S/4HANA or SAP Build Process Automation:

  1. Pull real data dynamically
  2. Push results (e.g. approved/rejected invoices)
  3. Trigger workflows (emails, dashboards, auto-approval/review flagging etc.)
  4. Store validated invoice results in your backend
  5. Log confidence scores, mismatches, timestamps

All in all, this blog post provides a preliminary look into:

  • Invoice document posting and extraction with the SAP DOX API
  • Invoice validation
  • Confidence scoring

If you process high volumes of invoices, SAP Document Information Extraction can help you save countless hours and reduce errors.

Leave your thoughts or questions in the comment section below!