
SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad
SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue
If we run these queries in Big Query, we see (pictures below) that the slot time and the byte shuffled reduced when we changed from ordering by ad to revenue. This is because big query int64 takes up 8 bytes of space, whereas a string takes up space depending on its length. As such, with int64 having smaller bytes to scan, the query will run faster.
Figure 1: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY ad in Big Query
Figure 2: Execution time of SELECT adgroup, ad, impressions, clicks, conversions, revenue FROM `project_name.bq_performance.ad_revenues` ORDER BY revenue in Big Query
Please note these queries were run using the sqlscript language in the sql editor.
output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;
This query took 0 seconds in big query and 1 second in SAP Datasphere.
Figure 3: Execution time in Big Query and SAP Datasphere when using a non-partitioned table
output = select "adgroup", "ad", TO_DECIMAL(avg("clicks")) as averageclicks, "revenue" from "ad_revenues_partitioned" GROUP BY "adgroup", "ad", "revenue" LIMIT 4000000;
return :output;
This query took 0 seconds in big query and 846 ms in SAP Datasphere.
Figure 4: Execution time in Big Query and SAP Datasphere when using a partitioned table
Here, the first query was performed on the non-partitioned table and the second query was performed on the partitioned and clustered table. The table was partitioned based on clicks with the range set from 0 to 1772, and the interval as 100. The table was clustered based on adgroup and ad.
If we run this query in Big Query, we can further analyze and view how the elapsed time and bytes shuffled reduced. The first picture below is the query details from ad_revenues and the second picture shows the query details from ad_revenues_partitioned. The elapsed time reduced by around 70 ms and the bytes shuffled reduced by over 200 KB.
select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues` GROUP BY adgroup, ad, revenue
Figure 5: Detailed information on execution of query on a non-partitioned table in Big Query
select adgroup, ad, avg(clicks) as averageclicks, revenue from `project_name.bq_performance.ad_revenues_partitioned` GROUP BY adgroup, ad, revenue
Figure 6: Detailed information on execution of query on a partitioned table in Big Query
However, the real benefit of the partitioning and clustering is shown through the following queries:
output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;
This query took 0 seconds in Big Query and 862 ms in SAP Datasphere.
Figure 7: Execution time in Big Query and SAP Datasphere of query on non-partitioned table
output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues_partitioned " WHERE "clicks" > 1000 LIMIT 200000;
return :output;
This query took 0 seconds in Big Query and 779 ms in SAP Datasphere.
Figure 8: Execution time in Big Query and SAP Datasphere of query on partitioned table
Due to the partitioning, the records read were reduced to only the records that needed to be read, thereby reducing the query time.
Again, if we further analyze this query in BigQuery, we see that the elapsed time, slot time, and records read reduced.
SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000
Figure 9: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000 in Big Query
SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000
Figure 10: Execution time of SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues_partitioned` WHERE clicks > 1000 in Big Query
output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "clicks" > 1000 LIMIT 200000;
return :output;
This query took 0 seconds in big query and 862 ms in SAP Datasphere.
Figure 11: Execution time in Big Query and SAP Datasphere of query with WHERE clause on int column
output = SELECT "adgroup", "ad", "clicks" FROM "ad_revenues" WHERE "adgroup" LIKE 'adgroup 1' LIMIT 2000000;
return :output;
This query took 1 seconds in big query and 23 seconds in SAP Datasphere.
Figure 12: Execution time in Big Query and SAP Datasphere of query with WHERE clause on string column
Again, if we analyze on the big query side, we see that the elapsed time was smaller when we used clicks as the column in the where clause, as opposed to adgroup. It is important to note that if you would like to use a where condition on the adgroup, one could create a column that has a number representation of the adgroups or create a view in big query that does the calculation automatically, to make the query runtime be faster.
SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE clicks > 1000
This query took 161 ms in Big Query.
Figure 13: Detailed information on execution of query with WHERE clause on int column in Big Query
SELECT adgroup, ad, clicks FROM `project_name.bq_performance.ad_revenues` WHERE adgroup LIKE 'adgroup 1'
This query took 1 second in Big Query.
Figure 14: Detailed information on execution of query with WHERE clause on string column in Big Query
In option1's query here, we are telling SAP Datasphere to fetch all of the rows from big query and then join each row with the advertisement_data table. This slows down the query a lot.
SELECT ad1."ad", ad1."product_focus", sum("revenue")
FROM "advertisement_data" as ad1
JOIN "ad_revenues" as ar on ar."ad"=ad1."ad"
GROUP BY ad1."ad",ad1."product_focus"
ORDER BY sum("revenue") DESC
In option2's query here, we are instead telling SAP Datasphere to fetch a smaller set of data (the data that we really need) to perform the join. As such, the query time is significantly faster, as only 75 rows are federated over instead of over 4 million.
SELECT ad1."ad", ad1."product_focus", sumrevenues
FROM
(
(
SELECT "ad", sum("revenue") as sumrevenues
FROM "ad_revenues"
GROUP BY "ad"
ORDER BY sumrevenues DESC
) as r
JOIN "advertisement_data" AS ad1
ON ad1."ad" = r."ad"
)
ORDER BY sumrevenues DESC
You can view the time difference from the SAP Datasphere queries here:
Option 1's query to big query took 1 minute to complete.
Figure 15: Execution time of query in SAP Datasphere with JOIN that does not reduce columns
Option 2's query to big query too 529 ms to complete.
Figure 16: Execution time in SAP Datasphere of query with JOIN that reduces columns
Loading a new snapshot for the ad_revenues_partitioned table in Big Query, allows us to query from the snapshotted table instead of the actual table in Big Query, thus also increasing query performance.
It took about 4 minutes to create this snapshot.
Figure 17: How to load a snapshot
To update this table, you can create a schedule. A schedule will run snapshot replication asynchronously and recurrently in the background according to the settings defined in the schedule.
Figure 18: How to create a schedule
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
16 | |
14 | |
13 | |
9 | |
9 | |
9 | |
8 | |
7 | |
7 | |
7 |