cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

Table Paging Performance

0 Likes
1,605

I currently have a working solution that pages through a column table that is created on the fly (not a temporary table) that can contain up to 300 million + records using OFFSET and LIMIT to select data by blocks. I'm noticing that the performance seems to get slower as OFFSET increases and I'm wondering if there's a better way to do this like maybe having an indexed column starting at 1 through the last record and selecting based on this index instead of using OFFSET and LIMIT, or perhaps deleting large numbers of records already processed from the on the fly table every so often to reduce the rows needed to search

Accepted Solutions (1)

Accepted Solutions (1)

lbreddemann
Active Contributor
0 Likes

By definition OFFSET and LIMIT always work from the "start" of an ordered result set by skipping the number of records defined by OFFSET and ignoring remaining records after LIMIT number of records have been returned. This explains why the queries get "slower" with increasing OFFSET: more data has to be skipped.

From the question, it is not clear why you loop over the records. The OFFSET approach usually does not make for efficient batch processing as it does not lend easily to parallel processing or direct access to the records of the current batch.

Deleting records after processing can work but involves locking and data changes in the same structure (your table) that you are currently working off - that likely won't be very well performing.
One approach to improve on this can be to take the primary keys of your table and partition them based on your batch size. The batches can then be processed based on the primary key ranges. For the parallel batch-wise processing, you might want to look into the MAP/MERGE SQLScript functions.

0 Likes

Hi Lars,

The table in question doesn't have a primary key, just the data so I'll add a primary key and do the selects based on the key values (1 - 50000, 50001 - 100000, etc...) instead of using OFFSET and LIMIT. Individual table partitioning is something I'm new to but I did find some info about partitioning by a dynamic threshold. Is this what you mean in terms of table partitioning?

lbreddemann
Active Contributor
0 Likes

“Just the Data” - I believe now would be a good time to share what that data is that has no identifier but just exists as a huge set.

Please don’t take the term partitioning as a pointer to the table storage feature. I was only talking about subdividing the data set into non-overlapping parts.

0 Likes

It's basically just name and address/tradestyle data that our app uses to load into a Solr document store. We create a table on the fly to contain the records we want to index into Solr and then page through those records as they are sent to Solr. The table we create contains only this data and no primary key. When the process is done, we delete the on the fly table

lbreddemann
Active Contributor
0 Likes

I don't know what "Tradestyle" data looks like - so I can only provide some general hints.

If I had to make up a unique ID for each record you create on the fly in this scenario, I'd probably look into using a SEQUENCE with a rather large cache (10000 or so) and allow cycles that are larger than you possibly want to load into the ad hoc table.

As you mentioned that this is basically a staging table I also would consider to disable logging for this table.

0 Likes

Hey Lars,

I think I found what I need. After creating the table I can do the following and it will generate an integer primary key

column for the table with values that I can use to select batch blocks by:

ALTER TABLE <TABLE NAME> ADD(ID INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY)
lbreddemann
Active Contributor
0 Likes

Yes, you could use the identity column, but you should definitively use the extended syntax options to specify caching and cycle behaviour of the underlying sequence,

0 Likes

Hi Lars,

I can certainly look into that but won't that mean that I'm creating a separate sequence object that I will then have to cleanup as soon as I'm finished with the staging table? Also, I understand what you mean by the cycling behavior but not caching. Does caching in this case mean setting the min/max values?

lbreddemann
Active Contributor
0 Likes

When you use the IDENTITY column, HANA automatically creates a sequence object. There is no need to "clean up" this object manually when you should drop or TRUNCATE the table - HANA handles this automatically (just like indexes on tables).

The CACHE parameter for sequences determines how many values are fetched from the sequence object in one go (that is before writing back the the new MAX value to the catalog entry). No caching (default) means that the current new max value always gets stored back - which is rather not good for performance. As you don't need guaranteed gap-free sequences but just unique values, make sure not to "pay for gap-free" w/o any benefit.

And no, caching does not mean setting min/max values at all - that's a completely different thing.

0 Likes

Hey Lars,

It took a while to find some documentation that had some examples but I found some and now I see what you are talking about. Here's the line from above edited to contain the new parameters:

ALTER TABLE <TABLE NAME> ADD (ID INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MAXVALUE 350000000 NO CYCLE CACHE 1000))

The cache param seems to not allow large cache amounts (I can't find the max for the cache anywhere in docs) but 1000 is probaby better than nothing in this case

Taesuk
Product and Topic Expert
Product and Topic Expert

Hi Steve,

You've mentioned that cache param doesn't seem to allow a large cache size. The document mentions it is an unsigned integer which is 2^31-1 in the following CREATE SEQUENCE. What errors did you get when trying to set higher than 1000?

0 Likes

This is the error I see:

org.springframework.dao.DataIntegrityViolationException: StatementCallback; SQL [ALTER TABLE MDR_PRODUCTION.DATA_INDEX_PERFORMANCE_TEST_TABLE ADD (ID INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1 MAXVALUE 1000000 NO CYCLE CACHE 1000000))]; [314]: numeric overflow: '1000000'; nested exception is com.sap.db.jdbc.exceptions.jdbc40.SQLDataException: [314]: numeric overflow: '1000000'

Answers (0)