Introduction
While it is not very difficult to check out the latest Q&A on SAP Community Network(SCN), what if we could automate this simple task to monitor the watch for interesting Questions in the community. Even though it is possible to host the bot on AWS or similar cloud platforms, let us try to keep this simple and explore such options in the series later.
Target
Our target here is the SAP Question and Answers forum at "
https://answers.sap.com/index.html" where a list of recent Questions and answers available on a wide variety of topics in SAP. As we know we could tailor the URL to suit any particular topic of interest.
Fig 1: SAP Question Answer Community Page
Create an Application
We will be creating a Python application using the simple syntax rules available in Python 3. Till we get it working, you wouldn't believe how easy it was. No big drama about Web Service protocols (WSDL/REST) or confusing syntax using $(jQuery) or typical model definitions as in MVC Architecure or the restrictions in place like CORS etc. Very simple steps; use relevant libraries, few lines of code(ten or less), filter the results, take what is required and that is it. job done. Ok, let us get started.
Importing the library
We use the Python 3's 'urllib' which is a standard library for retrieving data from our target website.
Define our target, namely the web site URL:-
Test Connection
Let us quickly check whether our request to the particular Web site was successful or not. If we get an 'OK' (Http Status code = 200), it means the request succeeded.
The output shows, 200. Everything is fine as expected.
Fig 2: Establishing Connection
Display output
Let us proceed to view the data available on the site by blindly printing the same on console.
And the result is a stream of HTML data.
Fig 3: HTML Data from our target(formatted for easy verification)
Filtering the Result
With the complete website data in hand, let us focus to our area of interest.
If we inspect the source code of our target, we see a common pattern here.
Fig 4: Inspecting our target
HTML Elements on the page appears as below. All the Questions starts with an ordered list and is followed by few other repeating tags in the body section.
Fig 5: Start of Questions
In order to fetch only what is human readable and ready for immediate recognition, we would like to pick read the questions. Towards this multiple libraries are available, however we stick to a generic approach via using Regular Expressions(which we were comfortable in other languages like Java/JavaScript). Let us import the Python library of Regular Expressions.
Now using this library, we would fetch the <div> tag in our html. Notice that all these HTML sections follow a particular pattern with the <div> and css class as below:-
With the information in hand, we proceed to utilize our regex.
What does this regex do?.
It fetch all data with a <div class="dm-contentListItem__title"> tag, then inside the bracket, search for any character, except a newline or repetitions of that character. Then we have an ending <div> tag as well. Finally print what we get.
Fig 6: Filtered HTML Content, formatted for readability.
Content Creation
We have so far achieved somewhat human readable HTML content, let us quickly alter the output to suit easy consumption.
We pass the above content to a for loop and apply the regular expression similar to the above again, but this time would search for only titles of the information.
Final Output
And we get all the questions separated by a new line.
Fig 6: Filtered HTML Content, formatted for readability.
Yes, the questions are different from those we started with, shows how active SCN is.
Conclusion
While we were not able to completely automate the process, we were able to
manually execute and get results in a human readable form.We could check out the formatting options as well as hosting options in later series.
Git Repo:
https://github.com/jakes2255/ScnQuestionRead
Series: 2 with local automation available
here.
Thank you,
Jakes