Part 1: Intelligent Web Crawler for Competitor Websites

CanAbdulla · ‎2025 Oct 29

In a fast moving market, knowing what your competitors are up to including product launches pricing changes and new content is essential for strategic decision making.

But manually tracking competitor websites and news is time consuming. In this tutorial we will build a hands-on solution leveraging the full power of the SAP BTP.

We will build an LLM-powered system that automatically crawls competitor websites for updates and answers your questions about those findings using a conversational agent.

The system combines an intelligent web crawler with an AI Agent to deliver the most relevant updates.

Note: Always ensure web crawling complies with each site's Terms of Service and robots.txt. Avoid crawling content that is private or protected. Use responsible rates and respect legal restrictions when collecting data.

Image 1: Agent for Competitive Intelligence

A demonstration of the systems capabilities can be found here.

Solution Overview

The solution has two main components that work together.

1. Automated Web Crawler + Summarizer: A Python crwaler which periodically visits a competitors site, extracting both structured data including links and page hierarchy and unstructured data including the most relevant page content. Here we will use a Large Language Model to make intelligent decisions which links to follow and to classify pages as static (unlikely to change, e.g. press releases) and dynamic (frequently updated). All page data is stored in a SAP HANA Cloud database for persistence with content embeddings for semantic search generated automatically. After crawling the system produces a "Competitor Intelligence Briefing", a concise summary of the latest relevant competitor activity.

2. AI Analyst Agent: An interactive AI Agent that can answer questions using two primary knowledge sources: (a) The internal database of crawled pages (b) the live internet. The agent uses tools and reasoning capabilities following the ReAct paradigm to answer questions. More in depth information on such an agent can be found in this blog on an conversational Agent with human-in-the-loop control I wrote or in this blog on a Q&A Agent Andreas Forster wrote.

Overall, the crawler keeps our internal knowledge base fresh and the AI Agents lets us query this internal data plus the latest public information.

Image 2: Conceptual Overview

Prerequisites and Setup

Before we take a closer look at the code, make sure you have the following ready:

Python 3.12
Access to SAP HANA Cloud (with database credentials to store data)
Access to the SAP AI Core
A target website URL

Installation: Clone the project repository and install the dependencies:

pip install -r requirements.txt

This will pull the necessary libraries, e.g. streamlit, langgraph, playwright, hana-ml.

Configuration: Create an .env file in the project root with your HANA DB connection details.

DB_ADDRESS=<your_HANA_hostname>
DB_USER=<your_db_user>
DB_PASSWORD=<your_db_password>

Make sure the SAP Generative AI Hub SDK is configured correctly such that init_llm() can connect to the LLM (visit this blog for more information).

With this setup now done let's run through all the components in more detail.

Part 1: Intelligent Web Crawler for Competitor Websites

The crawling system is implemented in start_crawling.py and helper modules. Its job is to explore the competitor site and record the content of each important page in the database.

Key features include:

Selective Site Exploration: Starting from the seed URL the crawler uses Breadth-First-Search to follow the page hierarchy. However, it doesn't blindly grab every link but uses LLM-based filtering to pick only relevant follow-up URLs that likely lead to meaningful content, e.g. articles or product page while avoiding irrelevant links.

Static vs. Dynamic Page Detection: Each page is analyzed to determine if it is a static or a dynamic page. Static pages are flagged such that the crawler can skip crawling them frequently.

Change Detection and Versioning: The crawler computes a hash of each page's content. If a page has been seen before, it compares the hash to detect updates. Changed pages get their content updated in the database with the old content saved as previous_content to keep a history of changes.

Database Storage (SAP HANA Cloud): All data lands in a HANA Cloud database table. A database table PAGES for the crawl data and a table SUMMARIES for the daily summary briefing.

Daily Summary Generation: After crawling the system automatically composes an executive summary of updates. This summary is saved to the DB and will be provided to the AI agent as initial context.

Crawling Strategy and LLM-Assisted URL Filtering

The crawler begins at the TARGET_URL and uses a queue to perform bread-first traversal of the site, respecting max depth. Here is the code of the crawl loop:

Spoiler

# In start_crawling.py
queue = deque([(base_url, None, 0)])
visited = {base_url}
while queue:
    current_url, parent_url, depth = queue.popleft()
    static = retrieve_page(conn, current_url, ['static'])
    #skip already known static pages
    if static: 
        continue 
    
    html_content = extract_site_content(current_url, return_raw_html=True)
    #Parse HTML and compute content hash
    if page_exists:
        if new_content_hash != old_content_hash:
            content = get_content(html_content)
            update_page(conn, page_id, new_content_hash, content)
    else:
        content = get_content(html_content)
        static = identify_static_site(current_url, html_content)
        add_new_page(conn, current_url, parent_url, static, new_content_hash, content)
    #Record processed pages
    if depth < MAX_DEPTH:
        follow_up_urls = get_urls(html_content, current_url, filter_prompt)
        for url in follow_up_urls:
            clean_url = url_normalize(url['url'])
            if clean_url not in visited:
                visited.add(clean_url)
                queue.append((clean_url, current_url, depth+1))

# In start_crawling.py queue = deque([(base_url, None, 0)]) visited = {base_url} while queue: current_url, parent_url, depth = queue.popleft() static = retrieve_page(conn, current_url, ['static']) #skip already known static pages if static: continue html_content = extract_site_content(current_url, return_raw_html=True) #Parse HTML and compute content hash if page_exists: if new_content_hash != old_content_hash: content = get_content(html_content) update_page(conn, page_id, new_content_hash, content) else: content = get_content(html_content) static = identify_static_site(current_url, html_content) add_new_page(conn, current_url, parent_url, static, new_content_hash, content) #Record processed pages if depth < MAX_DEPTH: follow_up_urls = get_urls(html_content, current_url, filter_prompt) for url in follow_up_urls: clean_url = url_normalize(url['url']) if clean_url not in visited: visited.add(clean_url) queue.append((clean_url, current_url, depth+1))

In this loop the crawler fetches a page, stores it in the DB, then discovers new links to follow. Note the call get_urls() which extracts candidate links from the page, then passes them through an LLM filter to identify legitimate follow-up links. The filter_prompt instruct the LLM with criteria for selecting these. By using this AI-based filtering the crawler focuses only on relevant content pages.

This intelligent filtering is very useful on complex sites as it tries to mimic what a human analyst would do while clicking through a site. You can adjust the filter_prompt to fit the specific structure of your target site.

Image 3 Flow Chart of Crawler Strategy

Identifying Static Pages to Avoid Recrawling

Some pages are essentially archives. Continuously recrawling them would be wasteful. Therefore we use identify_static_site(url, content) to classify pages. The prompt for this tool looks at URL patterns and page content cues to classify a page. For example a page containing /news/2025-08-29 or a page with a visible "Published on August 29, 2025" is probably a static news article. When adding a new page to the database we store a Boolean flag static to avoid unnecessary recrawling.

Storing Pages and Detecting Changes in SAP HANA

All crawled data is saved in a HANA Cloud database with the following structure:

Image 4: Database Tables

Whenever the crawler fetches a page, it computes the new hash and compares with the stored one. If

different, the page is updated and the old content moved to previous_content.

Semantic Embeddings for RAG: Notice we also added an EMBEDDING colum to the PAGES table. This is a special vector column automatically generated by HANA using a pretrained language model. Essentially, as we insert the page content, HANA computes a vector for the text via the function VECTOR_EMBEDDING( DOCUMENT, 'CONTENT', 'SAP_NEB.20240715'). Where 'SAP_NEB.

20240715' is the specific embedding model used. With this functionality in place: we can later perform semantic similarity search directly in SQL to power the agent answers.

Finally we create a SUMMARIES table. Each completed crawl will generate a new summary record.

Summarizing Updates with LLM

After the crawling loop finishes the script calls create_summary(). This function gathers all pages that were newly discovered or updated today from the database.

A powerful prompt is used here that instruct the model to generate a concise intelligent report. It explicitly states to synthesize new insight, grouping related changes into themes and highlight new vs updated content, as well as inferring a potential strategic intent behind the changes. The model then outputs a summary in a structured Markdown format.

This summary gets stored in the database. It is essentially an automated daily brief for your team. Even on

its own, this feature is valuable. Employees could read the brief each morning to stay informed. But our next step is to use this summary as an entry point for an interactive Q&A Agent.

Part 2: Building the AI Analyst Agent

With the crawler populating our internal knowledge base, the next component is an interactive AI Analyst Agent that end users can query for insights.

Tools for Knowledge Retrieval and Reasoning

To enable this hybrid knowledge search, we give the agent three main tools:

Internal Data Retrieval rag() : Queries the SAP HANA Cloud database of crawled pages for relevant information. This tool finds pages from the competitor’s site that semantically match the user’s query. For instance, a question about “pricing changes” might retrieve the latest pricing page or a blog post from the competitor’s website that we collected. Is uses vector similarity search. Essentially it finds the top 5 pages with the content embedding closest to the query embedding. Under the hood we run a SQL Query with the COSINE_SIMILARITY on the EMBEDDING column.
Live Web Search search_web() : Performs a real-time web search using a DuckDuckGo API to fetch recent public information. This is crucial for anything not on the competitor’s site, e.g. news articles or industry reports about the competitor. We typically limit results to a handful of relevant URLs to keep things focused.
Page Content Extraction extract_site_content() : Given a URL, this tool fetches that webpage and extracts the text content. The agent uses this to read external pages on the fly. For example, after getting a URL from search_web(), it can call extract_site_content() to grab the details of that page, and even follow on-page links if needed.

Defining the Agent with LangGraph

LangGraph provides a prebuilt ReAct agent framework.

In agent.py we initialize the agent with our custom tools and a detailed system prompt. Here is how we create the agent:

# In agent.py
from gen_ai_hub.proxy.langchain import init_llm
from langgraph.prebuilt.chat_agent_executor import create_react_agent
...
tools = [extract_site_content, rag, search_web]
llm = init_llm('gemini-2.0-flash', max_tokens=1024, temperature=0)
agent = create_react_agent(llm, tools=tools, prompt=(system_message + briefing))

We use init_llm() to load the large language model over the SAP Generative AI Hub. Then create_react_agent() from LangGraph sets up the conversational agent that will use the provided LLM and tools. In this blog you can find a more detailed explanation on how to build an AI agent using LangGraph.

Behind the scenes LangGraph orchestrates these tools and the agent's reasoning process. The agent follows the ReAct loop. The language model analyzes the questions, then decides what tools to use, executes them to gather info and incorporates that info into the next steps. The model iterates until it has enough evidence to answer.

We won’t dive into the implementation code here. For readers interested in how the ReAct reasoning and LangGraph architecture work, see the previously published Human-in-the-Loop AI Agent blog post, which covers the agent loop and design in depth.

Why Combining Structured Crawling with Live Search Is Powerful

By now, the value of our hybrid approach should be evident. The internal crawler ensures that even if the

competitor quietly updates their website you have that info. The AI agent can

surface those insights on demand. Meanwhile, the web search ensures you don’t miss external perspectives.

Another benefit is that the agent can answer questions that go beyond the explicit content of one page by

synthesizing multiple pieces.

Conclusion and Next Steps

In this tutorial, we built a practical AI agent for competitor analysis that automates website monitoring and

provides an interactive Q&A interface for deeper analysis.

By using a crawler to build an internal database and an AI agent to generate insights, organizations can respond faster to competitor moves with well-informed decisions.

With the architecture in place, we have a flexible foundation. You can customize the LLM prompts to better fit your domain, adjust the crawling depth or frequency. Alltogether this powerful automation means your team spends less time gathering information and more time acting on it, turning competitor analysis into a proactive capability for your team.

Ready to outperform the competition with AI built on SAP BTP at your side?

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile

Hands-On Tutorial: Building an Autonomous Agent for Competitive Intelligence

Solution Overview

Prerequisites and Setup

Part 1: Intelligent Web Crawler for Competitor Websites

Crawling Strategy and LLM-Assisted URL Filtering

Identifying Static Pages to Avoid Recrawling

Storing Pages and Detecting Changes in SAP HANA

Summarizing Updates with LLM

Part 2: Building the AI Analyst Agent

Tools for Knowledge Retrieval and Reasoning

Defining the Agent with LangGraph

Why Combining Structured Crawling with Live Search Is Powerful

Conclusion and Next Steps