Introduction
There are basically 2 approaches for doing a text analysis:
- Brute force blackbox single step approach or
- step-by-step
The brute force approach is using a text as a whole and classifies the text by using 'Deep Learning' techniques. The most familiar classifications are the
sentiment and
subjectivity polarity. Based on the type of the corpus the result could be quite good even for short texts like from twitter or movie and product reviews. But due to the very nature of language of being able to convey all kinds of information in numerous ways and because you need mostly an extensive amount of trained data this approach is of limited practicality.
The second step-by-step approach could also use models trained by
deep learning techniques but in a more controlled way and combined with other techniques in a kind of process pipeline. For example for the grammatical analysis a deep learning trained model could be of great use to create word bags that can subsequently been evaluated to find similar topics in texts by applying cluster algorithms.
For a proof-of-concept we have done both ways. A simple sentiment scoring of the texts and a word indexing pipeline. The word index can then be used for
further researches like the
1. trend of positively annotated brands in news or forums or
2. word groups with a high correlation probability
As a text corpus we are using online media articles that we scrape on a daily basis. For a start we selected 2 French (Le Figaro, Le Monde), 2 Spanish (El Mundo, El Pais) and 3 German newspaper (Der Spiegel, FAZ, Süddeutsche). For more details of how we scraped the online media web-site have a look at the blog
Web-site Scraping with SAP Data Intelligence.
You find all codings (custom operators, test scripts and pipelines), the Dockerfile and the README.md-file in the public github
SAP-samples/data-intelligence-text-analysis.
The project was accomplished as a joint effort together with
Lijin Lan, Eduardo Vellasques and
Cornelius Schaefer. Without their enthusiasm, creativity and expertise we could not have accomplished this project successfully.
Sentiment Analysis
As a first step we used the sentiment algorithm of the NLP package
textblob for a quick win. Without further text cleansing we used the
textblob algorithms for French, German and English to get the polarity (-1: negative to 1: positive) and the subjectivity (0-neutral to 1-subjective). The underlying algorithm is a model trained on movie reviews. Taken into account the caveat we learnt that the results are at least a good indicator.
Docker Image
For the Dockerfile just a few additional lines needs to be added to any Python-supporting docker image that also has the
pandas package been installed.
# nltk
RUN python3 -m pip --no-cache-dir install nltk --user
# textblob with language supporting packages
RUN python3 -m pip --no-cache-dir install textblob --user
RUN python3 -m pip --no-cache-dir install textblob-de --user
RUN python3 -m pip --no-cache-dir install textblob-fr --user
Operator text_sentiment
All sentiment processing is encapsulated in one operator
text_sentiment with one inport and two outports (1. data, 2. logging). The datatype of the data is a pandas DataFrame with at least 3 columns:
- text_id - text index
- text - plain text
- language - containing the language code ISO 639-1 in capital letters
Currently only the three languages English, French and German are supported but can easily be extended to other languages if textblob is supporting them. There is one function in the operator that is doing the corresponding textblob-call for the supporting language and is applied to each row of the DataFrame.
def get_sentiment(text, language):
if language == 'DE':
blob = TextBlobDE(text)
return [blob.sentiment.polarity, blob.sentiment.subjectivity]
elif language == 'FR' :
tb = Blobber(pos_tagger=PatternTaggerFR(), analyzer=PatternAnalyzerFR())
blob = tb(text)
return blob.sentiment
else:
blob = TextBlob(text)
return [blob.sentiment.polarity, blob.sentiment.subjectivity]
Sentiment Pipeline
The pipeline in our case was reading the stored news articles from an object store (JSON-format) and after some processing saved the results to a HANA database. You see 2 format transformation operators (JSON_df, df_csv) and a conditional termination gate (gate) that can be downloaded from my github repository
sdi_utils where I collect all my developed utilities. Most probably the become obsolete in the coming releases in particular when the vtypes are available.
The
text_preparation operator is doing the necessary setup of the data before it is send to the
text_sentiment, like renaming columns, removing html-tags from text, setting the language, etc. This was intermediate step was necessary after this pipeline was used with differently formatted input and to keep the core operator unchanged.
Words from Text
During the project we learnt that it makes sense to split the task into a 2-staged process:
- Tokenising the Text and tagging the words by grammar position and entity type
- Index words by selecting types, cleansing from blurring words and adding mappings to semantically similar words.
This 2-staged approach is reflected by 2 pipelines.
Tokenising the Text
For splitting the text into grammatical entities the open source framework
spaCy is been used. In addition also the entity detection (person, organisation, location) can be applied although only the entity type person (PER) delivered satisfying results. This is the more time consuming process step of the two-staged process.
There are basically 2 operators supporting this stage:
- Pre-processing (operator: doc_prepare, formerly text_prepare) - formatting the text, removing tags and common typos and converting it into our 'standard' format: 'text_id', 'language' and 'text' with data type DataFrame. This operator is also been used for the text sentiment analysis pipeline.
- Text split into words (operator: text_words) - Tokenising the text and creating word bags of different types
It is tempting but there should not be too much pre-selection done because this would later be sorely regretted. All operators are "Custom Python"-operators that can easily be edited. Although the pre-processing operator
doc_prepare has been designed as generally usable as possible, it is prone for adjustments. Currently 2 kinds of documents has been used as a starter for the design: News articles in a HANA-database having still HTML-tags included and plain text of online-media provided by JSON-documents.
The output of the pipeline is stored to the
database table word_text with the following structure:
- text_id - reference to text,
- language - containing the language code ISO 639-1 in capital letters
- type - type of word (proper noun, noun, verb, location, person, ...)
- word - lemmatised (inflected form of the word)
- count - number of words in text
The operator
text_words has the following ports:
- inports
- 'doc' for the DataFrame with the texts
- 'sentimentlist' (optional) for an alternative sentiment analysis based on counting words annotated with a sentiment score. The basic idea is published at PNAS.org.
- outports
- 'log' - for on-the-fly logging information
- 'sentiments' for getting the result of the word-sentiment analysis
- 'data' - DataFrame with the same structure as the final table
Indexing Words
The second stage could be either done with sql-statements or for more elaborate processes using python scripts. This provides the flexibility to adjust the text analysis to the desired outcome. There are 4 operators supporting this step:
- sql_word_index - Selecting words from the base word table that has been created in the previous stage. Limits can be passed for each word type to only select words that appears more frequently than the given limit. This constraint eleminates a lot of words, that have passed the pre-selection although containing numbers or special characters.
- word_regex - Removing words with certain patterns or replacing patterns. For both configuration parameters "removing" and "replacing" a list of regular expressions can be passed. There is an outport 'removed' that exports the changes of the regular expressions in order to verify that it works like intended.
- word_blacklist - Removing words that are on a 'blacklist' because they are very common and contorts the outcome, e.g. the 'country', 'publishing location', 'media'.
- word_lexicon - Map words according to the lexicon file to predefined keywords, synonyms, etc. Examples: 'corona', 'corona-virus', 'corona-pandemic' can all be mapped to 'corona'.
A pipeline using all operators would look like as following:
Example: News Media Cockpit
Nothing is better to apprehend the opportunities of data science than to visualise the data with appealing charts. Many thanks to Cornelius who has put a lot of creativity and skills into developing a News Media Cockpit with
SAP Analytics Cloud. In the following you some screenshots:
Co-occurance
"Corona" with Words of all Entity Types
with a colour coding of the connected sentiments
"Angela Merkel' with Entity Type: Person
Word Occurrence over Time
Example: German Parties "CDU" and "SPD" and Persons: "Angela Merkel" and "Donald Trump"
Entity Person Frequency of a given Time Period and Correlated Sentiment
Here you learn that the sentiment correlation has to be carefully interpreted. A positive or negative score is not linked to the person but to article in which the person appears, e.g. the cause of a negative score of George Floyd is due to his tragedy. Another finding is that persons in sports are above average scored positively because sports articles in general presumably written in a style that are classified as positive.
Cockpit in Action
- Selecting a "Word" in the Bubble-Chart (Sentiments-Number Articles)
- On left the charts of "Number of Articles"-chart and the "Sentiments"-chart over time are update
- Choose a time-snapshot on the left and the "Bubble" displays the detail of all "Words"
Conclusion
You have seen how easily you can leverage "open source"- solutions and publicly available ML-models with SAP Data Intelligence to analyse texts and create new insights. This does not need to be a one-time academic project but could run productively and highly automated.