on 2014 Mar 19 6:15 AM
Hi folks
I have two questions related to columnar table definition.
1. What is the purpose of column store type.
While defining a columnar table, what is the purpose of column store type (STRING ,CS_FIXEDSTRING,CS_INT etc) , when I define a table using the UI I see that the column is showing STRING but when I goto EXPORT SQL it does not show. Is this mandatory or optional ?
2.VARCHAR Vs. CHAR - In the UI when I create the table I do not see the CHAR option , but I do see lot of discussion where people are using CHAR for defining the columnar table. Not sure why UI dropdown does not show it. I also read that we should avoid using VARCHAR as those columns are not compressed, is that true, I thought the column store gives compression for all the columns. Are there certain columns which cannot be compressed .
Please let me know where I can find more information about these two questions.
Poonam
Hi Poonam
the CS_-data types are the data types that are used internally in the column store. They can be supplied but it is not at all required or recommended to do so.
SAP HANA will automatically use the correct CS_-data type for every SQL data type in your table definitions.
To be very clear about this: don't use the CS_-data types directly. Just stick to the SQL data types.
Concerning VARCHAR vs CHAR: fixed character data types are not supported anymore and don't show up anymore in the documentation.
I have no idea why you believe that VARCHAR columns are not compressed but this is just a myth.
create column table charcompr (fchar char(20), vchar varchar(20));
insert into charcompr (
select lpad ('x', to_int (rand()*20), 'y'), null from objects cross join objects);
-- same data into both columns
update charcompr set vchar = fchar;
-- perform the delta merge and force a compression optimization
merge delta of charcompr;
update charcompr with parameters ('OPTIMIZE_COMPRESSION' ='FORCE');
-- check the memory requirements
select COLUMN_NAME, MEMORY_SIZE_IN_TOTAL, UNCOMPRESSED_SIZE, COUNT, DISTINCT_COUNT, COMPRESSION_TYPE
from m_cs_columns where table_name ='CHARCOMPR'
COLUMN_NAME MEMORY_SIZE_IN_TOTAL UNCOMPRESSED_SIZE COUNT DISTINCT_COUNT COMPRESSION_TYPE
FCHAR 3661 70285738 6692569 20 RLE
VCHAR 3661 70285738 6692569 20 RLE
We see: compression and memory requirements are the same for both fixed and variable character sizes.
- Lars
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Lars,
Thanks for your reply.
Am unable to execute the sql query you have framed. It is giving me some exceptions like this:
So I have used this query:
insert into charcompr (
select lpad ('x', to_int (rand()*20), 'y'), lpad ('x', to_int (rand()*20), 'y') from objects );
And after the merge and compression, This is the result am getting:
It went for INDIRECT Encoding instead of RLE, and I see there is a significance difference in the memory consumed here.
Also while loading data into a table, and if we know that a particular column has always a fixed length of characters if we use CHAR instead of VARCHAR , does it improve the loading performance?
I did tried on some sample, but I don't see any such deviations there, so would like to know your comments as well.
Regards,
Krishna Tangudu
Hey Krishna
The error for the query indicates that you ran out memory (well, the cross product of objects table can become large ), but you found the way out there.
Concerning the compression:
The choice of compression algorithms depends on several factors including the total size of table data and data distribution.
In your example you are actually not storing the same information in both columns, as both rand() function calls will lead to different values for each record.
That has been the reason for my two step approach in the first place.
Loading performance: well, I doubt that it would make a big difference here. VARCHAR is by far the prominent data type nowadays. You also need such an implementation for mult-ibyte character sets anyhow, so there's really no point to work around this.
Besides, fixed length var always means "fixed length" and not maximum length. The client software must allocate for the fixed memory requirement of the char columns for all rows - regardless whether the column is just half full or even NULL.
That's not the case with VARCHAR, so I highly recommend to focus on that.
- Lars
Thank you very much for your reply Lars. Yes it was with Memory, I tested in another local system and it worked fine.
Lars Breddemann wrote:
Besides, fixed length var always means "fixed length" and not maximum length. The client software must allocate for the fixed memory requirement of the char columns for all rows - regardless whether the column is just half full or even NULL.
That's not the case with VARCHAR, so I highly recommend to focus on that.
Exactly this is the point I have in my mind on favoring CHAR.
My bad I forgot about the RAND() am using in the insert query and assumed that it will insert the same data in both the columns.
Now it makes clear why you used 2-step approach. Now If i see the result the below:(It intelligently shifts the compression technique to RLE)
Yep. Compression works in a similar way for both VARCHAR and CHAR. Adding to the list on preferring VARCHAR to CHAR. And if unicode support required we can use NVARCHAR.
Regards,
Krishna Tangudu
Hi Lars
thanks for sharing the steps. here is the link where I read that varchar would not compress. You can check the link and see if that needs to updated to show the correct information.
My object table has less count so when I ran your commands, I got less data - 4493 records from my objects table. You are right -- there is no difference between char and varchar.
Where can I find more info on compression type ?
Regards
Poonam
Hi Poonam,
For more details on Compression, have a look on the below link:
Regards,
Krishna Tangudu
Hi Poonam
Yes I know this blog post.
And this is a good example for that one should be really cautious with what to believe on the web.
As long as you don't find version information and steps to reproduce the claimed facts you should be suspicious.
Even if you do have these information, it always pays out to double check (just as we've seen in this thread) if the claimed facts are actually true. That's regardless of who the author of the article/blog post is.
- Lars
User | Count |
---|---|
62 | |
9 | |
8 | |
6 | |
6 | |
6 | |
6 | |
6 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.