What You WillBuild :
A single-page web application where you can:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ArchitectureOverview
The stack is intentionally minimal:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Prerequisites
System dependencies (install once):
# macOS
brew install tesseract poppler
# Ubuntu / Debian
sudo apt install tesseract-ocr poppler-utils
Project Structure
sales-sop-extractor/
├── app.py # Flask app, all routes
├── services/
│ ├── document_parser.py # PDF + DOCX → text (with OCR fallback)
│ ├── entity_extractor.py # Claude API integration
│ └── storage.py # In-memory session store
├── static/
│ └── app.js # Frontend state and API calls
├── templates/
│ └── index.html # SAP Fiori single-page UI
├── requirements.txt
└── .env.example-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 1 — Clone and Install
git clone https://github.tools.sap/<MY SAP ID >/document_extractor
cd sales-sop-extractor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtrequirements.txt:
flask==3.0.3
python-dotenv==1.0.1
anthropic>=0.49.0
pdfplumber==0.11.4
python-docx==1.1.2
pdf2image==1.17.0
pytesseract==0.3.13
gunicorn==22.0.0-------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 2 — Configure Your API Connection
Copy the example env file:
cp .env.example .env
Option A — Anthropic API (direct)
ANTHROPIC_API_KEY=sk-ant-your-key-here<br>
FLASK_SECRET_KEY=change-this-in-production<br>
FLASK_DEBUG=true<br>
MAX_UPLOAD_SIZE_MB=20Option B — SAP AI Core on BTP
If you are running this inside an SAP BTP environment, you can connect through SAP AI Core instead of calling the Anthropic API directly. Replace the entity extractor's HTTP call with the AI Core OAuth2 + inference endpoint:
AICORE_AUTH_URL=https://<subaccount>.authentication.sap.hana.ondemand.com
AICORE_CLIENT_ID=<client-id>
AICORE_CLIENT_SECRET=<client-secret>
AICORE_RESOURCE_GROUP=default
AICORE_BASE_URL=https://api.ai.<region>.cfapps.sap.hana.ondemand.com/v2
AICORE_MODEL=claude-opus-4-6The entity_extractor.py service handles the OAuth2 client-credentials flow and token caching automatically when these variables are present:
def _get_access_token() -> str:
auth_url = os.environ["AICORE_AUTH_URL"].rstrip("/") + "/oauth/token"
resp = requests.post(
auth_url,
data={"grant_type": "client_credentials"},
auth=(os.environ["AICORE_CLIENT_ID"], os.environ["AICORE_CLIENT_SECRET"]),
timeout=15,
)
resp.raise_for_status()
data = resp.json()
return data["access_token"]-----------------------------------------------------------------------------------------------------------------------------------------------------------
Step 3 — Run the App
python3 app.py
Open http://localhost:5000 in your browser.
----------------------------------------------------------------------------------------------------------------------------------------------------------
How It Works
Document Parsing — Text Layer + OCR Fallback
The parser first attempts to extract the text layer using pdfplumber. If fewer than 50 characters are returned (indicating a scanned or image-only PDF), it automatically falls back to OCR:
def parse_pdf(file_bytes: bytes) -> tuple[str, bool]:
# 1. Try text layer first
text_parts = []
with pdfplumber.open(io.BytesIO(file_bytes)) as pdf:
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text_parts.append(page_text)
text = "\n".join(text_parts)
# 2. Fall back to OCR if text layer is empty
if len(text.strip()) < 50:
images = convert_from_bytes(file_bytes, dpi=300)
text = "\n".join(
pytesseract.image_to_string(img.convert("L"))
for img in images
)
return text, True # ocr_used=True
return text, FalseWhen OCR is used, an orange 📷OCR badge appears in the sidebar so the user knows the document was image-based.
Entity Extraction — Claude Prompt Design
The system prompt is kept tight and unambiguous:
SYSTEM_PROMPT = (
"You are an expert business analyst specializing in sales processes. "
"Extract named entities from Sales SOP documents. "
"Respond with valid JSON only — no explanation, no markdown, no code fences."
)The user prompt asks Claude to return a JSON array with category, name, and description fields across four categories:
- people_roles — job titles, teams, functions
- products_services — products, SKUs, pricing tiers
- processes_steps — sales stages, workflow steps, approvals
- systems_tools — CRM, platforms, software mentioned
Documents are truncated to 80,000 characters before sending to stay within context limits. The response is parsed defensively — code fences are stripped, each item is validated, and a UUID is assigned:
def _parse_response(raw: str) -> list[dict]:
text = re.sub(r"^```(?:json)?\s*", "", raw.strip())
text = re.sub(r"\s*```$", "", text).strip()
data = json.loads(text)
return [
{
"id": str(uuid.uuid4()),
"category": item["category"],
"name": item["name"],
"description": item.get("description"),
"source": "extracted",
}
for item in data
if item.get("name") and item.get("category") in VALID_CATEGORIES
]UI - Front End
<link rel="stylesheet"
href="https://unpkg.com/@sap/fundamental-styles@0.30.0/dist/fundamental-styles.css" />
<link rel="stylesheet"
href="https://unpkg.com/@sap-theming/theming-base-content/content/Base/baseLib/sap_horizon/css_variables.css" />The layout uses fd-shell, fd-table, fd-busy-indicator, fd-dialog, and fd-object-status — giving it a genuine Fiori Horizon look without any SAP system dependency.
-----------------------------------------------------------------------------------------------------------------------------------------------------
Testing with a Scanned PDF
To test the OCR path without needing a real scanned document, the repo includes a script that generates one. It renders text onto a PDF with reportlab, rasterizes the pages to images with pdf2image, adds a slight scanner-like blur, then re-embeds the images as a PDF with no text layer:
venv/bin/python3 generate_scanned_pdf.py# Creates: sample_scanned_sop.pdf (2 pages, ~250 KB, zero text layer)
Upload this file and you will see:
- The orange 📷OCR badge in the sidebar
- All entities extracted correctly despite no text layer
Test results from the sample document:
28 entities extracted. Zero missed.
------------------------------------------------------------------------------------------------------------------------------------------------------------
API Reference
The backend exposes a clean REST API, so it can be consumed by other tools or integrated into existing SAP workflows:
Example — upload and extract via curl:
# Upload
UPLOAD_ID=$(curl -s -X POST http://localhost:5000/upload \
-F "file=@my_sop.pdf" | python3 -c "import sys,json; print(json.load(sys.stdin)['upload_id'])")# Extract
curl -s -X POST http://localhost:5000/extract \
-H "Content-Type: application/json" \
-d "{\"upload_id\": \"$UPLOAD_ID\"}"------------------------------------------------------------------------------------------------------------------------------------------------------------------
Known Limitations:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Running in Production
For SAP BTP Cloud Foundry deployment, add a manifest.yml and push as a standard Python buildpack app. Point AICORE_* environment variables to your AI Core service binding.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
In this blog post we built a complete, plug-and-play Sales SOP entity extractor:
The full source code is structured to be dropped into any Python environment and running in under 5 minutes. It is equally at home as a standalone internal tool or as a microservice feeding structured SOP data into larger SAP workflows.
I hope this is useful for anyone working with Sales knowledge management / application development on the SAP ecosystem. Feel free to extend it — export to SAP Knowledge Base, feed entities into SAP CRM, or pipe the output into SAP Analytics Cloud.
Questions and feedback welcome in the comments below.
Happy building!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 30 | |
| 20 | |
| 18 | |
| 16 | |
| 15 | |
| 13 | |
| 10 | |
| 9 | |
| 6 | |
| 4 |