Scraping RSS Feeds with SAP Data Hub

Ian_Henry · ‎11-09-2018

I had a request to retrieve RSS data using SAP Data Hub and store this in SAP Vora.
There are many ways to do achieve this, here's how I did it.

Data Hub Pipeline

Docker with Beautiful Soup 4 & Pandas

Python Operator using Beautiful Soup 4

Vora Avo Ingestor

Vora Disk Table

Figure 1: Data Intelligence Pipeline

Python is great for scraping RSS feeds, we can wrap our code in a custom operator and then associate that with a suitable docker image that contains the required libraries.

Create a Docker Image

First we need to create a docker that contains the required python libraries, and associate this with some appropriate tags that we will link to our operator

Figure 2: Docker Image

# Use an official Python 3.6 image as a parent image

FROM python:3.6.4-slim-stretch



# Data Intelligence requires Tornado

RUN python3 -m pip --no-cache install tornado==5.0.2

RUN python3 -m pip install requests

RUN python3 -m pip install pandas

RUN python3 -m pip install beautifulsoup4

RUN python3 -m pip install lxml



# Add vflow user and vflow group to prevent error 

# container has runAsNonRoot and image will run as root

RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow

USER 1972:1972

WORKDIR /home/vflow

ENV HOME=/home/vflow

If the docker build fails, you can get more details through the Diagnostic Information.

Figure 3: Download Diagnostic Logs

Custom SAP Data Hub Python Operator

I have tested the operator with various RSS feeds and it appears to be reliable.

Figure 4: Create Custom Python Operator

import requests

import pandas as pd

from bs4 import *



url = "http://feeds.bbci.co.uk/news/rss.xml"



resp = requests.get(url)

soup = BeautifulSoup(resp.content, features="xml")



items = soup.findAll('item')



news_items = []



for each_item in items:

    news_item = {}

    news_item['RSS_TITLE'] = each_item.title.text

    news_item['RSS_DESC'] = each_item.description.text

    news_item['RSS_LINK'] = each_item.link.text

    news_item['RSS_DATE'] = each_item.pubDate.text

    news_items.append(news_item)



# Use a Pandas Dataframe to pass as CSV

df = pd.DataFrame(news_items)

df = df.to_csv(index=False, header=True, sep=";")



# Create Data Hub Message

attr = dict()

attr["message.commit.token"] = "stop-token"

messageout = api.Message(body=df, attributes=attr)

api.send("outmsg", messageout)

If we connect this to the WireTap component we can quickly see that data is being retrieved and structured as required.

Figure 5: WireTap Output

Vora Avro Ingestor

Using the Vora Avro Ingestor is a great way to receive structured information into Vora.
I needed to use fixed length fields below, this has the advantage of working with HANA Smart Data Access (SDA).

{

  "name": "RSS_FEED",

  "type": "record",

  "fields": [

    {

      "name": "RSS_TITLE",

      "type": "fixed",

      "size": 128

    },

    {

      "name": "RSS_DESC",

      "type": "fixed",

      "size": 2500

    },

    {

      "name": "RSS_LINK",

      "type": "fixed",

      "size": 128

    },

    {

      "name": "RSS_DATE",

      "type": "fixed",

      "size": 16

    }

  ]

}

For completeness I have captured the properties of the Vora Avro Ingestor, and highlighted the fields that I changed.

Figure 6: Vora Avro Ingestor Configuration

Executing this pipeline will now retrieve the RSS data amd automatically create the table within SAP Vora, we can easily verify the table has been created with the SAP Vora Tools or the Metadata Explorer.

Figure 7: Metadata Explorer Fact Sheet of RSS_FEED table

The Data Preview shows us what is now stored in the SAP Vora disk engine.

Figure 8: Metadata Data Preview