Extracting Structured Knowledge from Document (Sal...

Thakur_Avinash · ‎2026 Mar 22

What You WillBuild :

A single-page web application where you can:

Upload a PDF or DOCX Sales SOP document
Click Extract Entities — Claude reads the document and returns structured JSON
Browse entities grouped into 4 categories: People & Roles, Products & Services, Processes & Steps, Systems & Tools
Add missing entities manually via a Fiori dialog
Delete incorrect entries
Works on both text-layer PDFs and scanned/image PDFs (OCR fallback)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

ArchitectureOverview

The stack is intentionally minimal:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Prerequisites

System dependencies (install once):

 # macOS

  brew install tesseract poppler

  # Ubuntu / Debian

  sudo apt install tesseract-ocr poppler-utils

Python 3.10 or higher
An Anthropic API key or SAP AI Core service credentials (see section below)

Project Structure

 sales-sop-extractor/

  ├── app.py                    # Flask app, all routes

  ├── services/

  │   ├── document_parser.py    # PDF + DOCX → text (with OCR fallback)

  │   ├── entity_extractor.py   # Claude API integration

  │   └── storage.py            # In-memory session store

  ├── static/

  │   └── app.js                # Frontend state and API calls

  ├── templates/

  │   └── index.html            # SAP Fiori single-page UI

  ├── requirements.txt

  └── .env.example

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 1 — Clone and Install

git clone https://github.tools.sap/<MY SAP ID >/document_extractor

cd sales-sop-extractor

python3 -m venv venv

source venv/bin/activate

pip install -r requirements.txt

requirements.txt:

  flask==3.0.3

  python-dotenv==1.0.1

  anthropic>=0.49.0

  pdfplumber==0.11.4

  python-docx==1.1.2

  pdf2image==1.17.0

  pytesseract==0.3.13

  gunicorn==22.0.0

-------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 2 — Configure Your API Connection

Copy the example env file:

cp .env.example .env

Option A — Anthropic API (direct)

ANTHROPIC_API_KEY=sk-ant-your-key-here<br>
FLASK_SECRET_KEY=change-this-in-production<br>
FLASK_DEBUG=true<br>
MAX_UPLOAD_SIZE_MB=20

Option B — SAP AI Core on BTP

If you are running this inside an SAP BTP environment, you can connect through SAP AI Core instead of calling the Anthropic API directly. Replace the entity extractor's HTTP call with the AI Core OAuth2 + inference endpoint:

 AICORE_AUTH_URL=https://<subaccount>.authentication.sap.hana.ondemand.com

  AICORE_CLIENT_ID=<client-id>

  AICORE_CLIENT_SECRET=<client-secret>

  AICORE_RESOURCE_GROUP=default

  AICORE_BASE_URL=https://api.ai.<region>.cfapps.sap.hana.ondemand.com/v2

  AICORE_MODEL=claude-opus-4-6

The entity_extractor.py service handles the OAuth2 client-credentials flow and token caching automatically when these variables are present:

def _get_access_token() -> str:

      auth_url = os.environ["AICORE_AUTH_URL"].rstrip("/") + "/oauth/token"

      resp = requests.post(

          auth_url,

          data={"grant_type": "client_credentials"},

          auth=(os.environ["AICORE_CLIENT_ID"], os.environ["AICORE_CLIENT_SECRET"]),

          timeout=15,

      )

      resp.raise_for_status()

      data = resp.json()

      return data["access_token"]

-----------------------------------------------------------------------------------------------------------------------------------------------------------

Step 3 — Run the App

python3 app.py

Open http://localhost:5000 in your browser.

----------------------------------------------------------------------------------------------------------------------------------------------------------

How It Works

Document Parsing — Text Layer + OCR Fallback

The parser first attempts to extract the text layer using pdfplumber. If fewer than 50 characters are returned (indicating a scanned or image-only PDF), it automatically falls back to OCR:

def parse_pdf(file_bytes: bytes) -> tuple[str, bool]:

      # 1. Try text layer first

      text_parts = []

      with pdfplumber.open(io.BytesIO(file_bytes)) as pdf:

          for page in pdf.pages:

              page_text = page.extract_text()

              if page_text:

                  text_parts.append(page_text)

      text = "\n".join(text_parts)



      # 2. Fall back to OCR if text layer is empty

      if len(text.strip()) < 50:

          images = convert_from_bytes(file_bytes, dpi=300)

          text = "\n".join(

              pytesseract.image_to_string(img.convert("L"))

              for img in images

          )

          return text, True  # ocr_used=True



      return text, False

When OCR is used, an orange 📷OCR badge appears in the sidebar so the user knows the document was image-based.

Entity Extraction — Claude Prompt Design

The system prompt is kept tight and unambiguous:

SYSTEM_PROMPT = (

      "You are an expert business analyst specializing in sales processes. "

      "Extract named entities from Sales SOP documents. "

      "Respond with valid JSON only — no explanation, no markdown, no code fences."

  )

The user prompt asks Claude to return a JSON array with category, name, and description fields across four categories:

- people_roles — job titles, teams, functions

- products_services — products, SKUs, pricing tiers

- processes_steps — sales stages, workflow steps, approvals

- systems_tools — CRM, platforms, software mentioned

Documents are truncated to 80,000 characters before sending to stay within context limits. The response is parsed defensively — code fences are stripped, each item is validated, and a UUID is assigned:

def _parse_response(raw: str) -> list[dict]:

      text = re.sub(r"^```(?:json)?\s*", "", raw.strip())

      text = re.sub(r"\s*```$", "", text).strip()

      data = json.loads(text)

      return [

          {

              "id": str(uuid.uuid4()),

              "category": item["category"],

              "name": item["name"],

              "description": item.get("description"),

              "source": "extracted",

          }

          for item in data

          if item.get("name") and item.get("category") in VALID_CATEGORIES

      ]

UI - Front End

 <link rel="stylesheet"

  href="https://unpkg.com/@sap/fundamental-styles@0.30.0/dist/fundamental-styles.css" />

  <link rel="stylesheet"

  href="https://unpkg.com/@sap-theming/theming-base-content/content/Base/baseLib/sap_horizon/css_variables.css" />

The layout uses fd-shell, fd-table, fd-busy-indicator, fd-dialog, and fd-object-status — giving it a genuine Fiori Horizon look without any SAP system dependency.

-----------------------------------------------------------------------------------------------------------------------------------------------------

Testing with a Scanned PDF

To test the OCR path without needing a real scanned document, the repo includes a script that generates one. It renders text onto a PDF with reportlab, rasterizes the pages to images with pdf2image, adds a slight scanner-like blur, then re-embeds the images as a PDF with no text layer:

venv/bin/python3 generate_scanned_pdf.py

# Creates: sample_scanned_sop.pdf (2 pages, ~250 KB, zero text layer)

Upload this file and you will see:

- The orange 📷OCR badge in the sidebar

- All entities extracted correctly despite no text layer

Test results from the sample document:

28 entities extracted. Zero missed.

------------------------------------------------------------------------------------------------------------------------------------------------------------

API Reference

The backend exposes a clean REST API, so it can be consumed by other tools or integrated into existing SAP workflows:

Example — upload and extract via curl:

# Upload

UPLOAD_ID=$(curl -s -X POST http://localhost:5000/upload \

    -F "file=@my_sop.pdf" | python3 -c "import sys,json; print(json.load(sys.stdin)['upload_id'])")

# Extract

curl -s -X POST http://localhost:5000/extract \

    -H "Content-Type: application/json" \

    -d "{\"upload_id\": \"$UPLOAD_ID\"}"

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Known Limitations:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Running in Production

For SAP BTP Cloud Foundry deployment, add a manifest.yml and push as a standard Python buildpack app. Point AICORE_* environment variables to your AI Core service binding.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Summary

In this blog post we built a complete, plug-and-play Sales SOP entity extractor:

Flask + Claude for AI-powered structured extraction
pdfplumber + pytesseract for universal PDF support (text layer and scanned)
SAP Fiori Fundamentals via CDN for a professional UI with zero SAP infrastructure
SAP AI Core ready — swap one environment variable block to route through BTP

The full source code is structured to be dropped into any Python environment and running in under 5 minutes. It is equally at home as a standalone internal tool or as a microservice feeding structured SOP data into larger SAP workflows.

I hope this is useful for anyone working with Sales knowledge management / application development on the SAP ecosystem. Feel free to extend it — export to SAP Knowledge Base, feed entities into SAP CRM, or pipe the output into SAP Analytics Cloud.

Questions and feedback welcome in the comments below.

Happy building!

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile

Extracting Structured Knowledge from Document (Sales SOPs) Using LLM— A Plug-and-Play Python App

Summary