Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
shivamshukla12
Contributor
Hello Everyone,

Hope you are doing well and safe too 🙂

Since long i am seeing various post on COVID-19 data analysis and forecasting and having the same thought i decided to post my recent work mainly on text mining & text analysis on COVID-19 tweets & data.

IDEA is to connect twitter from python and insert data into SAP HANA and utilize the built-in powerful capabilities in SAP HANA for text mining & text analysis & displaying word cloud and using few text mining functions as well.

we have seen people are expressing their views and thoughts onto twitter with different sentiments and you know we have lot of unstructured data & to dig into that is really a very difficult job hence i pick this area to try my hands on and thought to showcase few insights of it hope you will like it and please feel free to share your suggestion and feedback.

Technical flow for twitter and HANA communication - we have now python which is going to make our job easy and provides a smooth integration with SAP HANA + so many built-in python functions used for mining and analysis.

HANA ML Python APIs - For more information about HANA Machine learning API please refer SAP HANA document.


Pre-requisites: HANA ML Installation environment + SAP HANA DB (UP and Running)

HANA ML Installation - https://developers.sap.com/tutorials/hxe-ua-install-python-ml-api.html

Let's go step by step

  • First of all we need one twitter developer account this is pretty easy and straightforward go to twitter for developer url and setup one app for yourself and save all the tokens mentioned there somewhere in notepad

  • After successful creation of twitter account you will see below & please have them handy & save them somewhere in notepad


 

Twitter developer tokens




  • Go to your conda/python command prompt and install the twitter API (there are few APIs for python install based on your comfort or like)

  •  execute the below command
    pip install tweepy

    ## if this is successful execute the below command to check

    import tweepy as twitter



  • Initialize all the tokens , give them the value we noted from twitter developer account this is going to be used for connecting with twitter
    consumer_key = 'e6QvccFkjlkdsjlksadjlkjdsa'
    consumer_secret = 'aslkdjsalkdjsalkdjlsakd'
    token='1384909104-2alakdlsakdlsakdlakdslkdsalkdsaladkads'
    token_secret='rT0ZEUdasldkasdkjsadlkalkdjlkdajlkdsa'

     

  • Start with twitter authentication if connection is successful test for downloading few tweets based on the Tags


  • # Authenticate to Twitter passing key and secret tokens
    auth = twitter.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(token, token_secret)

    # Create API object - Wait on rate limit is true so that we can download the tweets
    #without any connection break
    api = twitter.API(auth, wait_on_rate_limit=True)

    # Create a tweet - let' see this
    api.update_status("Hi from HANA ML API to Twitter - Analysing tweets")


  • Tweeting using python -  last command will post a tweet using my account



 



 

 

 

  • The connectivity between python and twitter is fine now - let's try to download few tweets and see how it's coming at client and then cleansing and some formatting will be required before sending them to SAP HANA DB


 

  • Try fetching first 10 tweets based on Hash tag (COVID19) & lets what we get :
    hashtag = '#COVID19'
    date = '2020-06-30'
    tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date, ).items(10)
    for tweet in tweets:
    for i in range(len(tweet.entities.get('hashtags'))):
    print(tweet.entities.get('hashtags')[i]['text'])



 

  • Here we are passing date as well so it's going to give us 10 items since 30th June & lets have a look at output.It's up to you what parameter you choose & you are free to change Hash Tag .


 

  • If you see in below image mainly people are talking about mask /  air travel etc. we will see this in text analysis in detail this is just to check if we can download the tweets or not.


  • Now execute the same command for 5K items


  • tweets = twitter.Cursor(api.search, q=hashtag,lang='en',since=date).items(5000)
    tweet_data = [[tweet.geo,tweet.text,tweet.user.screen_name,tweet.user.location,tweet.created_at,tweet.entities['hashtags']]for tweet in tweets]​


  • This is going to download 5000 tweets for you and surely we can do some text analysis with this data - you can go for 50K items as well it's gonna take sometime & save those 50K times somewhere in the excel-sheet , and off-course you can try more function for text mining.


 

Columns 

  • GEO (Geo Location)

  • Tweet Text

  • User ID

  • Location

  • Date

  • tags


 

Create a dataframe for storing the tweets 
import pandas as pd
tweet_df = pd.DataFrame(data=tweet_data,columns=['geo','text','user_name','loc','date','tags'])
pd.set_option('max_colwidth',800)

 

Define a function to fetch Hash Tags 
def hash_tag(tdata):
tags = ''
for i in range(len(tdata)):
tags = tags + str(tdata[i]['text']) + ','
tags = tags.strip(',')
return tags


  • Tweets Processing only Hash Tag column


#process tweets , to have clean data for hastags

for i in range(len(tweet_df['tags'])):
tweet_df['tags'][i] = hash_tag(tweet_df['tags'][i])


 

Let's check how many records are tweet_df is having ?  its (5000 X 6)



 

Regex function for cleaning the text

Let's build now some regex to clean the text before any text mining and also before inserting the data into SAP HANA
import re

re.sub("RT @[\w]*:", "" , tweet_df.text[0])

Output
' BREAKING NEWS Kimberly Guilfoyle Trump Jr s girlfriend has tested positive for COVID19'

# RT and @ user name replaced with space , lets clean more

##Complete function which is going to remove @ , RT , HTTPS , New lines , # Tags , Idea is to clean the ##text.
def clean_data(tweet):
tweet = re.sub("RT @[\w]*:", "", tweet)
tweet = re.sub("@[\w]*", "", tweet)
tweet = re.sub("https://[A-Za-z0-9./]", "", tweet)
tweet = re.sub("\n", "", tweet)
tweet = re.sub("&amp", "", tweet)
tweet = re.sub("#", "", tweet)
tweet = re.sub(r"[^\w]", ' ', tweet )
return tweet

 

Apply the regex function and clean the text
tweet_df['text'] = tweet_df['text'].apply(lambda x: clean_data(x))

Print the text

tweet_df['text']

Output

if you look at the text i see most of things are clean but still scope is always there for improvement , i see here if i can remove the "HEY" but still HANA is going to remove it using the stop-words


 

Save DataFrame

Save this dataframe in your local file so that you have it for later use -
tweet_df.to_csv('test_new.csv')

Establish the connection to SAP HANA System 
from hana_ml import dataframe
from data_load_utils import DataSets, Settings
url, port, user, pwd = Settings.load_config(r"C:\Users\ABC\config\abc.ini")
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
print(connection_context.connection.isconnected())

##if connection is successful it will print

True

 

Go to your HANA System now - 

  • Choose the schema where you want to create your table for storing tweets go to sql console
    CREATE COLUMN TABLE TWEETS_DATA(
    "ID" INTEGER NOT NULL,
    "USER_NAME" NVARCHAR(100),
    "CREATED_AT" DATE,
    "TEXT" NVARCHAR (140),
    "HASH_TAGS" NVARCHAR (100),
    "GEO" NVARCHAR(100),
    "LOCATION" NVARCHAR(100),
    PRIMARY KEY("ID")
    );


  • Note Once this is done create the full text index and text mining as well for this table


  • CREATE FULLTEXT INDEX index_table_name ON table_name (column_name)
    ASYNC
    FAST PREPROCESS OFF
    SEARCH ONLY OFF
    TEXT ANALYSIS ON
    TEXT MINING ON
    TEXT MINING CONFIGURATION MYTMCONFIGURATION.textminingconfig
    LANGUAGE DETECTION ('EN')
    CONFIGURATION 'LINGANALYSIS_FULL'



 

Come back to Jupyter notebook (I am using jupyter for all the python work i have posted above)

  • Create an empty list example data = [ ]

  • Append all the tweets into this list and insert this data into tweet table


data = []
for indx , row in tweet_df.iterrows():
data.append([indx , row['geo'], row['text'], row['user_name'], row['loc'], row['date'], row['tags']])

sql = ('insert into ABC.TWEETS("ID", "GEO", "TEXT", "USER_NAME", "LOC", "DATE", "HASH_TAGS")' 'VALUES (?, ?, ?, ?, ?, ?, ?)')
with connection_context.connection.cursor() as cur:
rows_inserted = cur.executemany(sql, data)
print(rows_inserted)

 

Note - you can also use csv upload approach instead of it up to you which approach to follow

 

Text Mining /  Text Analysis 

Let's see word cloud first from HANA -


 

Let's try the same thing using Python which will be talking to SAP HANA using Dataframe.

 
# Start with loading all necessary libraries
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
% matplotlib inline

 
## Prepare SQL for Tokens 

sql_new = """SELECT STRING_AGG( "TA_TOKEN", ',' ) AS TOKENS FROM "ABC"."$TA_TWEET_DATA" """

 
##Collect will fetch the data
df_new_sql = connection_context.sql(sql_new).collect()

 

Generate word cloud using python

Set the properties for word cloud before display
wordcloud = WordCloud(width = 800, height = 800, 
background_color ='white',
min_font_size = 12).generate(df_new_sql.TOKENS[0])

# plot the WordCloud image
plt.figure(figsize = (12, 12), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 5)

plt.show()

 

Note :  -  you may see some bad data as well (it's not 100 % cleaned) but yes you will see the important text mining output as well

 

Execute the above python code to generate word cloud 


 

  • if you see the word cloud  - data coming directly from SAP HANA  - you see the covid 19 tag off-course and addition to this you see words like "severe" , "respiratory" , "sick" , "neurological",  " cases" , "mask" etc.

  • Important words - banned ,  plasma , President


So using this way you can insert more information and for sure it's going to give you more relevant data for displaying into word cloud. 

 

Fetching terms and checking importance using SAP HANA (text mining) - 

 

Function Used -  GET_SUGGESTED_TERM (See SAP HANA SQL for more info)

  • Let's see how text mining capability of SAP HANA is going to score the probability of word "CORONA"


input_tokens

['CORONA']


  • Call the function


##Create DataFrame to Store Result
df_terms = pd.DataFrame()

## Call TM function for to fetch score
for token in input_tokens:
df_new = GET_SUGGESTED_TERMS(token,"TEXT")
df_terms = df_terms.append(df_new)

 

Output


See the score for first record is 0.75 means 75% probability that this is getting used in text so ranking of this term is very high

Let's check for different token -
input_tokens = ['trump']

Output
df_terms

Ranking for this word is really High 97 %


 

We can also do the analysis for sentence as well and you will see the different ranking and probability of each word in sentence.

 
input_text = "corona is spreading in india during lockdown as well"

 

  • India is also ranked very high with 98% so you can say recent tweets are more related to india and you know people are also talking about lock-down its 45% around



 

Note - We have taken only 5000 tweets for analysis i am sure this can be better if we take more data , may be in my next post i will be uploading dataset from kaggle and will be playing more with TA & TM functions.

 

i will be sharing the notebook on my GitHub

 

Thanks.

 

 
1 Comment