Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
karishma_kapur
Employee
Employee
3,770
Background

As we probably know by now, SAP’s Datasphere provides customers and users the ability to federate their data from different sources in real time into virtual tables in SAP Datasphere. These virtual tables query your remote source on the fly and allow you access to these data tables, should you want to perform further analytics on it and/or combine data sources. However, querying massive amounts of data can result in slow query times. As such, this blog focuses on some key points one should consider to increase their query performance when federating data from Big Query. These points cover geographic locations, column selection, joins, partitions, where clauses, reduction of data, and more.

It should be noted that Big Query does have its own support in SAP Datasphere for federation, making the connection creation seamless.

For this blog, I will be using the combined train and test dataset from the advertisements dataset provided by Kaggle here and here. For the revenues column in test data, I inserted the predictions given from here.

 

SAP Datasphere Performance Features

SAP Datasphere provides several features and tools to help one monitor their queries to increase performance. These features are bullet pointed below.

  • System Monitor. Monitor the performance of your system. You can see what tasks failed and dive deeper into the logs and view the statement that failed specifically.

  • Restrict Remote Table Data Loads. Remove unnecessary columns and create filters. This will reduce the volume of data that is loaded in the remote table.

    • Use the Data Builder to remove the unnecessary columns, define a central filter to load only the data that is needed, add new columns available in the source, or reinsert previously excluded columns.



  • Cloud Connector to improve performance for federation and replication from SAP HANA on-premise.

  • Run in Analytical mode may improve view performance, particularly if a union is performed.

  • Remote Table Monitor. Create statistics for your remote table to improve query performance.

  • Remote Query Monitor. Track queries sent to your source and analyze them.

  • Activate the OLAP hint to view performance.



 

Additional steps to increase query performance specific to Big Query’s federation

  1. First and foremost, it is important to note that the geographic location that your SAP Datasphere instance and hyperscaler data source are in matters. If the regions are the same, the latency (or data transfer) time is reduced, thereby increasing your query performance.


 

  1. Big Query is a columnar storage database, so there are benefits in performance if ones queries only the columns they need. This is because the database is organized by fields (columns), thus reducing the number of disks that will need to be visited and minimizing and the amount of extra data that must be held in memory. We are only processing the relevant columns in the query.


 

  1. As stated by Big Query's documentation, order by and join on int64 columns rather than string columns. Joining on strings slows down the query as the strings are compared. By using int64, the comparison is faster, thereby increasing your query performance.


SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad


SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue


If we run these queries in Big Query, we see (pictures below) that the slot time and the byte shuffled reduced when we changed from ordering by ad to revenue. This is because big query int64 takes up 8 bytes of space, whereas a string takes up space depending on its length. As such, with int64 having smaller bytes to scan, the query will run faster.



Figure 1: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad in Big Query


 


Figure 2: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue in Big Query


 

  1. Use partitions and clusters when creating your table. It can reduce data scans. If you're not sure what a partitioned table is, please read about it here.


Please note these queries were run using the sqlscript language in the sql editor.


output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;


This query took 0 seconds in big query and 1 second in SAP Datasphere.



Figure 3: Execution time in Big Query and SAP Datasphere when using a non-partitioned table


output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues_partitioned" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;


This query took 0 seconds in big query and 846 ms in SAP Datasphere.



Figure 4: Execution time in Big Query and SAP Datasphere when using a partitioned table


Here, the first query was performed on the non-partitioned table and the second query was performed on the partitioned and clustered table. The table was partitioned based on clicks with the range set from 0 to 1772, and the interval as 100. The table was clustered based on adgroup and ad.


If we run this query in Big Query, we can further analyze and view how the elapsed time and bytes shuffled reduced. The first picture below is the query details from ad_revenues and the second picture shows the query details from ad_revenues_partitioned. The elapsed time reduced by around 70 ms and the bytes shuffled reduced by over 200 KB.


select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues` GROUP BY adgroup, ad, revenue



Figure 5: Detailed information on execution of query on a non-partitioned table in Big Query


select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues_partitioned` GROUP BY adgroup, ad, revenue



Figure 6: Detailed information on execution of query on a partitioned table in Big Query


 

However, the real benefit of the partitioning and clustering is shown through the following queries:


output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;


This query took 0 seconds in Big Query and 862 ms in SAP Datasphere.



Figure 7: Execution time in Big Query and SAP Datasphere of query on non-partitioned table


 

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues_partitioned " WHERE "clicks" > 1000 LIMIT 200000;
return :output;


This query took 0 seconds in Big Query and 779 ms in SAP Datasphere.



Figure 8: Execution time in Big Query and SAP Datasphere of query on partitioned table


 

Due to the partitioning, the records read were reduced to only the records that needed to be read, thereby reducing the query time.


Again, if we further analyze this query in BigQuery, we see that the elapsed time, slot time, and records read reduced.


SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000



Figure 9: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000 in Big Query


 

SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000



Figure 10: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000 in Big Query


 

  1. According to Big Query's documentation, it is a best practice to use bool, int, float, or date columns in the where clause. This is because operations on these data types are faster than operations on String or Byte columns.


output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;


This query took 0 seconds in big query and 862 ms in SAP Datasphere.



Figure 11: Execution time in Big Query and SAP Datasphere of query with WHERE clause on int column


 

output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "adgroup" LIKE 'adgroup 1' LIMIT 2000000;
return :output;


This query took 1 seconds in big query and 23 seconds in SAP Datasphere.



Figure 12: Execution time in Big Query and SAP Datasphere of query with WHERE clause on string column


 

Again, if we analyze on the big query side, we see that the elapsed time was smaller when we used clicks as the column in the where clause, as opposed to adgroup. It is important to note that if you would like to use a where condition on the adgroup, one could create a column that has a number representation of the adgroups or create a view in big query that does the calculation automatically, to make the query runtime be faster.


SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000


This query took 161 ms in Big Query.



Figure 13: Detailed information on execution of query with WHERE clause on int column in Big Query


SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE adgroup LIKE 'adgroup 1'


This query took 1 second in Big Query.



Figure 14: Detailed information on execution of query with WHERE clause on string column in Big Query


 

  1. Reduce data (if you can) before using a JOIN.


In option1's query here, we are telling SAP Datasphere to fetch all of the rows from big query and then join each row with the advertisement_data table. This slows down the query a lot.


SELECT ad1."ad", ad1."product_focus", sum("revenue")
FROM "advertisement_data" as ad1
JOIN "ad_revenues" as ar on ar."ad"=ad1."ad"
GROUP BY ad1."ad",ad1."product_focus"
ORDER BY sum("revenue") DESC


In option2's query here, we are instead telling SAP Datasphere to fetch a smaller set of data (the data that we really need) to perform the join. As such, the query time is significantly faster, as only 75 rows are federated over instead of over 4 million.


SELECT ad1."ad", ad1."product_focus", sumrevenues
FROM
(
(
SELECT "ad", sum("revenue") as sumrevenues
FROM "ad_revenues"
GROUP BY "ad"
ORDER BY sumrevenues DESC
) as r
JOIN "advertisement_data" AS ad1
ON ad1."ad" = r."ad"
)
ORDER BY sumrevenues DESC


You can view the time difference from the SAP Datasphere queries here:


Option 1's query to big query took 1 minute to complete.



Figure 15: Execution time of query in SAP Datasphere with JOIN that does not reduce columns


 

Option 2's query to big query too 529 ms to complete.



Figure 16: Execution time in SAP Datasphere of query with JOIN that reduces columns


 

  1. SAP Datasphere provides two options for replication - snapshots and real-time replication. Some sources do not support real-time replication (as is with the case for Big Query’s connector). To see information on if other source connectors supports real-time replication, please look here.





    • Snapshot - copies the full set of data from the source object into SAP Datasphere.

      • Improves performance when accessing data, but data is now stored in SAP Datasphere and takes up memory.

      • Replicated data is stored in a replica table






Loading a new snapshot for the ad_revenues_partitioned table in Big Query, allows us to query from the snapshotted table instead of the actual table in Big Query, thus also increasing query performance.


It took about 4 minutes to create this snapshot.



Figure 17: How to load a snapshot


 

To update this table, you can create a schedule. A schedule will run snapshot replication asynchronously and recurrently in the background according to the settings defined in the schedule.



Figure 18: How to create a schedule


 



    • Real-time replication - copies the full set of data from your source recurrently and asynchronously in the background. This is good for copying data changes in real-time into SAP Datasphere. However, real-time replication isn't supported for Big Query’s connection.




 

  1. Turn on in-memory storage for your table to move data from being stored on the disk to the memory.



 

Conclusion

In summary, this blog has focused on several ways one can analyze and increase their query performance when federating data from Big Query to SAP Datasphere. We discussed the features provided in SAP Datasphere that help one further analyze their queries in SAP Datasphere, as well as 8 different ways to structure your queries to increase your federation query performance. While some of these observations and statements may be obvious to some, this blog aims to highlight the importance of query structures and additional features that SAP Datasphere provides when federating data.

Please let me know if this blog was helpful to you!

Thank you 🙂
4 Comments