Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Ian_Henry
Product and Topic Expert
Product and Topic Expert
1,015
I had a request to retrieve RSS data using SAP Data Hub and store this in SAP Vora.
There are many ways to do achieve this, here's how I did it.

Data Hub Pipeline



  • Docker with Beautiful Soup 4 & Pandas

  • Python Operator using Beautiful Soup 4

  • Vora Avo Ingestor

  • Vora Disk Table



Figure 1: Data Intelligence Pipeline


Python is great for scraping RSS feeds, we can wrap our code in a custom operator and then associate that with a suitable docker image that contains the required libraries.

 

Create a Docker Image


First we need to create a docker that contains the required python libraries, and associate this with some appropriate tags that we will link to our operator


Figure 2: Docker Image



# Use an official Python 3.6 image as a parent image
FROM python:3.6.4-slim-stretch

# Data Intelligence requires Tornado
RUN python3 -m pip --no-cache install tornado==5.0.2
RUN python3 -m pip install requests
RUN python3 -m pip install pandas
RUN python3 -m pip install beautifulsoup4
RUN python3 -m pip install lxml

# Add vflow user and vflow group to prevent error
# container has runAsNonRoot and image will run as root
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow

If the docker build fails, you can get more details through the Diagnostic Information.


Figure 3: Download Diagnostic Logs



Custom SAP Data Hub Python Operator


I have tested the operator with various RSS feeds and it appears to be reliable.


Figure 4: Create Custom Python Operator



import requests
import pandas as pd
from bs4 import *

url = "http://feeds.bbci.co.uk/news/rss.xml"

resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")

items = soup.findAll('item')

news_items = []

for each_item in items:
news_item = {}
news_item['RSS_TITLE'] = each_item.title.text
news_item['RSS_DESC'] = each_item.description.text
news_item['RSS_LINK'] = each_item.link.text
news_item['RSS_DATE'] = each_item.pubDate.text
news_items.append(news_item)

# Use a Pandas Dataframe to pass as CSV
df = pd.DataFrame(news_items)
df = df.to_csv(index=False, header=True, sep=";")

# Create Data Hub Message
attr = dict()
attr["message.commit.token"] = "stop-token"
messageout = api.Message(body=df, attributes=attr)
api.send("outmsg", messageout)

If we connect this to the WireTap component we can quickly see that data is being retrieved and structured as required.


Figure 5: WireTap Output



Vora Avro Ingestor


Using the Vora Avro Ingestor is a great way to receive structured information into Vora.
I needed to use fixed length fields below, this has the advantage of working with HANA Smart Data Access (SDA).
{
"name": "RSS_FEED",
"type": "record",
"fields": [
{
"name": "RSS_TITLE",
"type": "fixed",
"size": 128
},
{
"name": "RSS_DESC",
"type": "fixed",
"size": 2500
},
{
"name": "RSS_LINK",
"type": "fixed",
"size": 128
},
{
"name": "RSS_DATE",
"type": "fixed",
"size": 16
}
]
}

For completeness I have captured the properties of the Vora Avro Ingestor, and highlighted the fields that I changed.


Figure 6: Vora Avro Ingestor Configuration


Executing this pipeline will now retrieve the RSS data amd automatically create the table within SAP Vora, we can easily verify the table has been created with the SAP Vora Tools or the Metadata Explorer.


Figure 7: Metadata Explorer Fact Sheet of RSS_FEED table


The Data Preview shows us what is now stored in the SAP Vora disk engine.


Figure 8: Metadata Data Preview