Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Ian_Henry
Product and Topic Expert
Product and Topic Expert
4,650
Disclaimer:  We have not verified how fake the "fake news" really is, or how the true the "real news" really is.  This process is repeatable with ANY data.


Having seen a fake news dataset on Kaggle, we wanted to see if we could differentiate between "fake news" and "real news".  Four of the SAP HANA capabilities enabled us to attempt this challenge.  Text Analytics, Predictive (External Machine Learning), Data Modeling and Data Virtualization.

Main steps covered here

  1. Data Acquisition & Data Preparation

  2. Model Creation

  3. Model Deployment

  4. Model Consumption

  5. Validation

  6. Lessons Learned


Data Acquisition & Data Preparation


For this challenge we combined 2 datasets

  1. Real News from Twitter (using reputable news agencies) - 26,425 tweets

  2. Fake News from Kaggle - 12,994 articles


We used the data virtualization capability, HANA Smart Data Integration (SDI) Twitter Adapter to acquire some "real news" tweets from Twitter. We used 9 Twitter news feeds - BBC, CNN, Bloomberg, Reuters, ABC News, Wall Street Journal, New York Times, The Guardian and Sky News. The Twitter API limits queries to 3200 tweets, so we captured latest 3200 Tweets from each.  if needed we could parameterize these queries to return more historical Tweets.
-- GetUserTimeline Function already built as per Twitter Adapter setup
CREATE COLUMN TABLE NEWS_REAL AS
(select ID, SCREENNAME, TWEET, CREATEDAT from GetUserTimeline('BBCNews',3500,null,null)
where RETWEET = 0);

The Twitter feeds appeared to be pretty good for the chosen news outlets, so minimal data cleanup was needed for the "real news".  We did remove all RTs (re-tweets).



The "fake news" dataset is freely available from Kaggle, with some example pieces of code.
We found multiple different issues with the fake news dataset.  This is not unusual, some time was spent correcting and filtering the issues.  The issues found include datatype mismatches, columns offset and incorrect UTF-8 character encoding.  Python was then used to load the clean fake news.
# Python 3 Code
df = pd.read_csv('../data/Fake.News/Fake/fake.csv',
usecols =['uuid','author','title','text','language','site_url'])
# Filter out the records with UTF-8 issues
df = df[df.uuid !='97b1c7d89d0c2856afb7d729ad79038ea88b6943']
df = df[df.uuid !='e8077b77ccacf1493c94c9f85d9d47d949b340cb']
df = df[df.uuid !='effbcf6de55f9d987e4d8c5619843aff0219333d']
df = df[df.uuid !='fce977b83e53b59c297a8ef5b3523186bd6837fd']
df = df[df.uuid !='42ac7f1707d16e3c95317025e0cf5d5cd009e373']
df['title'].fillna(value="", inplace=True)
df.dropna(axis=0, inplace=True, subset=['text'])

# Trim Column Lengths
df.title = df.title.str.slice(0, 253)
df.text = df.text.str.slice(0, 4999)
df.author = df.text.str.slice(0,127)

print(df.shape)
df.head()



Having filtered out the rubbish, we used the HANA sqlalchemy driver - available from SAP's Github to load into HANA. If you use the sqlalchemy library, advice is to use the GitHub source - currently the repository one does not support Python 3, but the git source does.
from sqlalchemy import create_engine
hanaeng = create_engine('hana+pyhdb://USER:PASS@myHANAServer.com:30015')

df.to_sql('NEWS_FAKE', con=hanaeng, index=False,
dtype={'uuid': sqlalchemy.types.NVARCHAR(length=100),
'author': sqlalchemy.types.NVARCHAR(length=128),
'title': sqlalchemy.types.NVARCHAR(length=256),
'text': sqlalchemy.types.NVARCHAR(length=5000),
'language': sqlalchemy.types.NVARCHAR(length=24),
'site_url': sqlalchemy.types.NVARCHAR(length=255)})

Now we have both datasets in HANA we can make a combined table with an indicator whether it is "fake" or "real".  We trimmed both strings to 140 characters, randomised the order and created a  sequential ID.
--Create Combined Temp Table
CREATE COLUMN TABLE TITLE_TEMP AS
(SELECT "UUID", CASE WHEN TITLE ='' THEN LEFT("TEXT",140) ELSE LEFT(TITLE,140) END as TITLE, 1 as FAKE
FROM "FAKENEWS"."NEWS_FAKE");

--Add Tweets
INSERT INTO TITLE_TEMP
SELECT ID, LEFT("TWEET",140), 0 FROM "FAKENEWS"."NEWS_REAL";

--Randomise Order of Texts and generate ID
CREATE COLUMN TABLE TITLE_140 AS (SELECT *,ROW_NUMBER() OVER () as ID from (SELECT RAND() as RND1, * FROM TITLE_TEMP ORDER BY RND1));

-- Text Analaysis requires a primary Key
ALTER TABLE TITLE_140 ADD CONSTRAINT PRIM_KEY_ID PRIMARY KEY (ID);



The combined table is 39,373 rows.  Checking the data distribution shows a 1/3rd vs 2/3 split.  With more real than fake, we didn't balance the data but this could be done in HANA or python if necessary.



We tried to remove any obvious suspicious or bias texts in the data.  For example tweets often include URLs - http/s, where as the fake news did not.  We also removed the news Twitter handles, again another potential giveaway in the data.

We created a full text index on the title, this takes the unstructured text and restructures it in to a "Text Analysis" $TA_ table.  In doing so, we tokenise the raw text data.
CREATE FULLTEXT INDEX "TITLE_140_IDX" on 
"FAKENEWS"."TITLE_140"("TITLE")
LANGUAGE DETECTION ('EN')
ASYNC PHRASE INDEX RATIO 0.0
CONFIGURATION 'LINGANALYSIS_BASIC'
SEARCH ONLY OFF
FAST PREPROCESS OFF
TEXT MINING ON
TEXT ANALYSIS ON;
TOKEN SEPARATORS '\/;,.:-_()[]<>!?*@+{}="&#$~|';



Creating the tokenised data in HANA allows us to create a dictionary of words mapped to numbers (TOKEN_IDs).  These TOKEN_IDs are exactly what Deep Learning models such as TensorFlow require.
CREATE VIEW "FAKENEWS"."V_TOKEN_ID" ( "TOKEN_ID", "TA_TOKEN" ) AS
SELECT
ROW_NUMBER() OVER() AS TOKEN_ID,
TA_TOKEN
FROM ( SELECT, TA_TOKEN, COUNT(*)
FROM "FAKENEWS"."$TA_TITLE_140_IDX"
GROUP BY TA_TOKEN
ORDER BY COUNT(*) DESC )


Model Creation


Some Data Science colleagues, nidhi.sawhney and stojanm are the brains behind the Model Creation, as you may know, this is not in isolation - they also needed to perform further data prep and then evaluate the outcomes and iterate multiple times.  We can easily adapt this process for any text classification scenario.

The TensorFlow models are built in python, and we used Jupyter notebooks for this.
# Connect to HANA and retrieve data
connection = pyhdb.connect(
host="10.0.21.182",
port=30015,
user="MY_TOP_SECRET_USERNAME",
password="MY_TOP_SECRET_PASSWORD"
)
cursor = connection.cursor()
cursor.execute("SELECT ID,D.TA_COUNTER, V.TA_TOKEN,V.TOKEN_ID FROM TBL_TOKEN_ID V INNER JOIN TBL_TA_TOKENS D ON V.TA_TOKEN = D.TA_TOKEN ORDER BY ID,D.TA_COUNTER")
# Show First 5 Reords for Testing
#cursor.fetchmany(5)

To build the actual text classifier we used a Keras (TensorFlow) Convolution Neural Network (CNN)  with a number of Layers, similar to those taught by fast.ai.  Below we have specified 5000 as our vocabulary size and 40 is the maximum number of words coming from our 140 character text.
model = Sequential([
Embedding(5000, 32, input_length=40),
SpatialDropout1D(0.2),
Dropout(0.25),
Convolution1D(64, 5, padding='same', activation='relu'),
Dropout(0.25),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid',name='prediction')])

The model summary is shown below.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 40, 32) 160000
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 40, 32) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 40, 32) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 40, 64) 10304
_________________________________________________________________
dropout_5 (Dropout) (None, 40, 64) 0
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 20, 64) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 1280) 0
_________________________________________________________________
dense_2 (Dense) (None, 100) 128100
_________________________________________________________________
dropout_6 (Dropout) (None, 100) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 101
=================================================================
Total params: 298,505
Trainable params: 298,505
Non-trainable params: 0
__________________________


Train on 31498 samples, validate on 7874 samples
Epoch 1/14
31498/31498 [==============================] - 2s 60us/step - loss: 0.3160 - acc: 0.8539 - val_loss: 0.2802 - val_acc: 0.9271
Epoch 2/14
31498/31498 [==============================] - 1s 39us/step - loss: 0.0790 - acc: 0.9726 - val_loss: 0.3442 - val_acc: 0.9053
Epoch 3/14
31498/31498 [==============================] - 1s 39us/step - loss: 0.0457 - acc: 0.9846 - val_loss: 0.3334 - val_acc: 0.9347

We then trained this model with an 80/20 training/validation split over a number of Epochs.  Above you can see the accuracy of the model improves as the output back propagates.  The val_acc (Validation Accuracy) is above 90%, we can therefore be confident based on the data we provided that the model is able to identify the different types of news.

After further validation using F1 scores, Confusion Matrix's and analysis of individual records, we began to trust the model (but not the fake news!). We then could save the trained (Keras) model.
model.save('FakeNews-v2.h5')

Model Deployment


To deploy a TensorFlow model with HANA you need to create a Saved Model.
This is easier said than done! Unfortunately, the Keras model.save (as above) is not what TensorFlow Serving requires.  We therefore need to rebuild the Keras model as a pure TensorFlow model.

We had also performed some pre-processing steps during the model creation.  These pre-processing steps need to be captured and included with the saved model.  Below is the pre-processing function we required for TensorFlow Serving.
def preprocess (txt_input):
sparse_tokenized_input = tf.string_split(txt_input,delimiter=' ')
tokenized_input = tf.sparse_tensor_to_dense(sparse_tokenized_input, default_value='0')
token_idxs = tf.string_to_number(tokenized_input, out_type=tf.float32)
inputlength = tf.size(token_idxs)
# Max Number of Words in Sentance 40
padding = 40 - inputlength
token_idxs_padded = tf.pad(token_idxs, [[0,0],[padding,0]])
token_idxs_embedding = tf.slice(token_idxs_padded, [0,0], [1,40])
return token_idxs_embedding;

The save code that we used is below, this captured our pre-processing function.
from tensorflow.python.saved_model import builder as saved_model_builder
from tensorflow.python.saved_model import utils
from tensorflow.python.saved_model import tag_constants, signature_constants
from tensorflow.python.saved_model.signature_def_utils_impl import build_signature_def, predict_signature_def
from tensorflow.contrib.session_bundle import exporter

export_path = 'FakeNews-Serving/17'
builder = saved_model_builder.SavedModelBuilder(export_path)

signature = predict_signature_def(inputs={'text': txt_input},
outputs={'labels': model.output})

with K.get_session() as sess:
builder.add_meta_graph_and_variables(sess=sess,
tags=[tag_constants.SERVING],
signature_def_map={'predict': signature})
builder.save()

Saving the model requires a clean TF session, so be prepared to iterate here, if you are frequently exporting.  Often when saving models we experienced errors, for example.
AssertionError: Export directory already exists. Please specify a different 
export directory: FakeNews-Serving/16

If an error has been raised during export, it often breaks the TF session, we found that it is best to "Restart the Kernal" from within jupyter.

Once exported, to validate the "saved_model.pb" we can use the TF saved_model_cli.  Here you see a warning as we had a small version mis-match, but these warnings are generally ok.

Our signature below shows a single input with a string datatype, the shape of this is unknown because this relates to a TF placeholder value that is only known once the model is executed.  We will pass in a single string from HANA of numbers separated by spaces, our preprocess function expects this and translates it to a padded tensor. The output returned is a single tensor with one value - this contains the fake 0/1 prediction stored as a float.
ubuntu@ip-10-0-31-145:~$ saved_model_cli show --dir FakeNews-Serving/17 --signature_def 'serving_default' --all
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: compiletime version 3.6 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.5
return f(*args, **kwds)

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
The given SavedModel SignatureDef contains the following input(s):
inputs['text'] tensor_info:
dtype: DT_STRING
shape: unknown_rank
name: txt_input:0
The given SavedModel SignatureDef contains the following output(s):
outputs['labels'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1)
name: prediction_1/Sigmoid:0
Method name is: tensorflow/serving/predict
ubuntu@ip-10-0-31-145:~$

When the model is running via TensorFlow Serving the output would look something like
ubuntu@ip-10-0-31-145:tensorflow_model_server --port=9000 --model_name=fakenews --model_base_path=/home/ubuntu/FakeNews-Serving
2017-12-19 16:05:18.560516: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /home/ubuntu/FakeNews-Serving/17
2017-12-19 16:05:18.560541: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /home/ubuntu/FakeNews-Serving/17
2017-12-19 16:05:18.598911: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle.
2017-12-19 16:05:18.623654: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle.
2017-12-19 16:05:18.625504: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 64901 microseconds.
2017-12-19 16:05:18.626184: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: fakenews version: 17}
E1219 16:05:18.628689844 15683 ev_epoll1_linux.c:1051] grpc epoll fd: 3
2017-12-19 16:05:18.630423: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:9000 ...

Model Consumption


Using the HANA EML we can now consume the model, for completeness I have pasted all the code we used here, but 95% of this came from the HANA Academy EML by philip.mugglestone
-- register model
INSERT INTO "_SYS_AFL"."EML_MODEL_CONFIGURATION" VALUES ('fakenews', 'RemoteSource', 'TensorFlowModelServer');
SELECT * FROM "_SYS_AFL"."EML_MODEL_CONFIGURATION";

-- create parameters table (used in subsequent calls)
CREATE TABLE "PARAMETERS" ("Parameter" VARCHAR(100), "Value" VARCHAR(100));

-- apply registered models
CALL "_SYS_AFL"."EML_CTL_PROC" ('UpdateModelConfiguration', "PARAMETERS", ?);

-- verify model is up and running on remote source
TRUNCATE TABLE "PARAMETERS";
INSERT INTO "PARAMETERS" VALUES ('Model', 'fakenews');
CALL "_SYS_AFL"."EML_CHECKDESTINATION_PROC" ("PARAMETERS", ?);

CALL "SYS"."AFLLANG_WRAPPER_PROCEDURE_DROP" ('FAKENEWS', 'CLASSIFY_NEWS');

-- create table types
CREATE TYPE "T_PARAMS" AS TABLE ("Parameter" VARCHAR(100), "Value" VARCHAR(100));
CREATE TYPE "T_DATA" AS TABLE ("text" VARCHAR(256));
CREATE TYPE "T_RESULTS" AS TABLE ("Score1" FLOAT);

-- create signature table then generate stored procedure
CREATE COLUMN TABLE "SIGNATURE" ("POSITION" INTEGER, "SCHEMA_NAME" NVARCHAR(256), "TYPE_NAME" NVARCHAR(256), "PARAMETER_TYPE" VARCHAR(7));
INSERT INTO "SIGNATURE" VALUES (1, 'FAKENEWS', 'T_PARAMS', 'IN');
INSERT INTO "SIGNATURE" VALUES (2, 'FAKENEWS', 'T_DATA', 'IN');
INSERT INTO "SIGNATURE" VALUES (3, 'FAKENEWS', 'T_RESULTS', 'OUT');
CALL "SYS"."AFLLANG_WRAPPER_PROCEDURE_CREATE" ('EML', 'PREDICT', 'FAKENEWS', 'CLASSIFY_NEWS', "SIGNATURE");

-- create tables
CREATE TABLE "PARAMS" LIKE "T_PARAMS";
CREATE TABLE "RESULTS" LIKE "T_RESULTS";

-- run time

-- data to be scored
CREATE VIEW "NEWS_SENTANCE" AS (SELECT '445 4999 400 41 3016 334 1806 1160 4999' as "text" from dummy);
CREATE VIEW "NEWS_SENTANCE_FAKE" AS (SELECT '161 45 145 30 4999 107 1112 947 25' as "text" from dummy);

-- params
TRUNCATE TABLE "PARAMS";
INSERT INTO "PARAMS" VALUES ('Model', 'fakenews');
--INSERT INTO "PARAMS" VALUES ('Model', 'saved_model%predict'); -- mandatory: model name (optional: signature name)
--INSERT INTO "PARAMS" VALUES ('Deadline', '1000'); -- optional: max milliseconds to wait

-- scoring : results inline
CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?);

Validation


To perform an end to end validation we should test both python and HANA are behaving correctly.
The validation dataset is all IDs above 31,498 (80% of 39,373), here I will check 31,505 and 31,506.

In the Base Table in HANA we can see the 2 original sentences



Using SQL initially developed by Nidhi, for preparing the input data for training, we can validate our steps.  We can see the key tokens in our sentence.  Any TOKEN_IDs above 5000 are called rare so all become 4999.



In python we can view our data assigned to "trn_data", which was loaded as a Numpy array.
Note the Python IDs start at zero so they will be 1 less than the HANA IDs
print(trn_data[31504])
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 445 4999 400 41 3016 334 1806 1160 4999]

In python we can check the predicted value, we can see it predicted 0 for the 31505 (real news) and almost 1 for 31506 to be classified as fake news.  This exactly matches the source data flags - but we didn't provide this information to our model!
model.predict(trn_data[31504:31506])
array([[ 0.03409993],
[ 0.99994373]], dtype=float32)

We can use these 2 sentences encoded as numbers from the validation data. We can query our TF model from SAP HANA to confirm the results reflect the validation predictions.  Below we pass in '445 4999 400 41 3016 334 1806 1160 4999' and the predicted value is almost 0, therefore it has been identified as real news.  We would typically round this value to 0.



Changing the input data (View) to the next sentence ('161 45 145 30 4999 107 1112 947 25') in our dataset shows it really can tell the difference! Again we would round 0.9999 to 1.



 

Lessons Learned & Issues Encountered


Yes, we did learn a lot, found some issues, learned some Python, Keras, TensorFlow, TensorFlow Serving, AWS and of course the HANA EML integration.

With the EML you are passing TENSORS, so your input table becomes a TF tensor and the Output table corresponds to the output tensor of your model.
The input table/column name DOES matter, the output does not seem to.
The output column datatypes and number of columns do matter

Some errors and resolutions below
-- scoring : results inline
CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?);

java.sql.SQLWarning: wrong number or types of parameters in call: Column mismatch happened at "FAKENEWS.CLASSIFY_NEWS": Declared type "VARCHAR(256)" of attribute "text" not same as assigned type "VARCHAR(11)"

Initially I thought this was causing us an issue, but actually it is just a warning, so can be safely ignored, as long as your data fits within the lower limits.

 
Could not execute 'CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?)' in 137 ms 909 µs . 
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Metamodel" failed (5)

Could not execute 'CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?)' in 137 ms 909 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Metamodel" failed (5)

This error is more problematic, and after checking the indexserver and scriptserver diagnosis files I was stuck for a while.  This error means that the EML cannot find the matching signature on the TensorFlow Server.  We had to change the PARAMS table that defines which model and signature we are calling.  It could indicate a problem with EML metadata, removing the %predict also seemed to fixed this.
TRUNCATE TABLE "PARAMS";
INSERT INTO "PARAMS" VALUES ('Model', 'fakenews%predict');

 

Could not execute 'CALL "CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?)' in 131 ms 198 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Predict" failed (14)

Upon checking the destination, we can see that the TensorFlow Serving appears not to running as expected.



Restarting the Server with a command similar to below will fix this for us.
nohup tensorflow_model_server --port=9000 --model_name=fakenews --model_base_path=/home/ubuntu/FakeNews 

 

Could not execute 'CALL "MULTI_CLASSIFY_NEWS" ("PARAMS", "NEWS_SENTANCE", ?)' in 134 ms 560 µs .
SAP DBTech JDBC: [423]: AFL error: search table error: _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Predict" failed (3)

Looking at the Diagnosis Files, helps us here - Shows that the input value is not in the expected range, we defined the vocab size as 5000, and have supplied 9999.  This should be handled in our pre-processing step.
[1256]{200896}[7/-1] 2017-12-19 14:52:31.262719 e LJIT cePopCustomLjit.cpp(00639) : _SYS_AFL.EML:PREDICT: [423] (range 3) AFL error exception: RPC "Prediction/Predict" failed (3)
[8590]{200896}[3/-1] 2017-12-19 14:54:31.781945 e AFL_EML impl.cpp(00272) : PredictClient::predict failed (3): indices[0,33] = 9999 is not in [0, 5000)%0A%09 [[Node: embedding_1_1/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _output_shapes=[[1,40,32]], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1_1/embeddings/read, embedding_1_1/Cast)]]
[8590]{200896}[3/-1] 2017-12-19 14:54:31.782200 e LJIT cePopCustomLjit.cpp(00620) : Llang Runtime Error: Exception::SQLException423: RPC "Prediction/Predict" failed (3)
7 Comments