cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

statistics and density

Former Member
0 Likes
952

some fundamental questions regarding usage of density.

if we are looking for a particular column value and it is represented in a frequency cell in a histogram then we can calculate the number of rows involved.

but if a searched column value or values is within a range cell, then to accurately calculate the number of rows involved, then we need density?

is this a right assumption for the need for gathering density statistics?

if so, range cell density makes sense where with a calculation of an average number of duplicates an accurate guess can be made for the number of rows involved for a value or range or values.

but even with this will a single range cell density value for a whole table suffice - or would it be better to have separate range cell density values for each range?

also what purpose does total density which includes frequency cells serve? in the sense that when a particular column value represented by frequency cell gives you the actual number of rows involved with that value, what is the need for a density calculation for such a value?

I also have questions regarding statistics skew which I will raise later.

appreciate the insights.

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Likes

I think Simon Ogden's treatment of estimated row counts deserves some mention here:

http://scn.sap.com/docs/DOC-57087

Former Member
0 Likes

kevin, thanks for the pointer. I am going through the document.

but my questions are primarily conceptual/theoretical. so appreciate any clarification at that level from resident gurus.

simon_ogden
Participant
0 Likes

It's all about trade-offs really. Maintaining statistics and processing them during optimisation can be a costly operation so it is a case of picking a model which given the constraints you have to work with will suffice. However it will still have its limitations. These limitations can be minimised in some areas.

The relevancy/usefulness of the range cell density may depend on:

  • The data
  • The number of steps in the histogram
  • Sampling rate
  • Method used to gather the stats (hash based or sort based)

We'll ignore the data bit. Yes that can be controlled but let's not go redesigning data models (much as they might need it!)

When you search for a single fixed search argument it will use the lower of the range cell density and the individual range cell weight (upper limit) if that value falls within a range cell.

The idea here is it picks the most selective of the the two. It's not foolproof in any way and the bigger the tables and the less uniform the distribution of the values are, the more chances you might have individual value skew within a range cell and large discrepancies between the various weights of each range cell.

You mimimise the chance of this by increasing the number of steps.

By specifying a higher number of steps you increase the chances of frequency cells for values that previously may have skewed weights within range cells. It follows that you will most likely have less range cells with more steps. Increasing histogram step count on large tables can have a dramatic effect on the range cell density (and rightly so).

Total density is the overall measure of the average run of duplicates and in versions prior to 15, it was used to cost joins.

In version15 and 16, it'll still get used (there may be others as well):

  • For joins when one side of the join does not have a histogram available (if there is a histogram on both sides, it'll merge the histograms)
  • Unknowns such as  - select blah where column in (select column2 from table)
  • It'll also be used if you have compatibility_mode enabled.
  • It'll also be used to cost joins in any query with 7 or more tables during the alternative greedy costing (used to prime the search engine)

In an ideal world you have statistics that tell you the weight of every value in every column and every combination of every value of every column. It doesn't take long for the storage required to be in the terabyte range and for maintenance to take weeks (it takes long enough already!).

There are other ways of representing data but at this point, ASE is what it is.

Former Member
0 Likes

thanks for the clarification simon.

so if one table in a join does not have a histogram and the other table does, then the available histogram is used for both tables?

even then why total density and not range cell weight or frequency cell weight or range cell density if a particular value is sought?

also a statistics skew is due to ir-reconciliation between range cell density and total density?

in what kind of situation (join etc) are the two compared?

appreciate the insights.

simon_ogden
Participant
0 Likes

If there is a histogram available for only one side of a join clause then the total density of the column will be used. There is no other option. The total density is in essence a consolidation of a columns available histogram. If you have no knowledge of what values are being passed to the inner side of a join then it has to assume it could be any value, therefore total density is as good as it gets.

It won't optimise based on individual values for a join unless of course one or other side also has a search argument. If that's the case then that of course is taken into account via transitive closure.

Statistical skew is is merely a consequence and extraction of data skew. If you have a very selective range cell density and a much higher total density then this implies the data is skewed toward certain values. I.e. you have some large weight frequency cells.

This skew can be removed from the perspective of the statistics by running sp_modifystats.

This method was more useful pre version 15. Join histograms on 15 eliminated the need to do this in a lot of circumstances.

Former Member
0 Likes

simon thanks for the reply.

let us suppose there is a select on tables tab1 and tab2 and the where clause goes like this :

where tab1.col1 = 1200

and tab1.col1 = tab2.col1

so let us suppose tab1 has an index on col1 but there is no index on tab2 and so no statistics/histogram for tab2.col1.

but still there is a basis for this search because tab1.col1 has a specified searched value of 1200.

so the range cell weight or frequency weight or range cell density for that value from tab1 can still be used as an approximate value for tab2.col1.

right?

next example, let us suppose that the where clause only reads as :

where tab1.col1 = tab2.col1

in which case the data from the entire table is being sought. so the total number of rows from tab1 which has the statistics could still be used even if tab2 contains lesser rows.

and anyway total density which only has an average value of duplicates for all key values, cannot be used as an approximation for the total number of rows in the table.

so where is the necessity for a value like total density? that is my primary question from the opening post.

I can see many meaningful ways in which the number of rows sought can be determined without a value like total density, but can't see a practical use for that particular value.

sorry for beating this issue like the proverbial dead horse 🙂 - just trying to understand this.

also btw appreciate it if you could point me to the manual pages on ver 15 join histograms. thanks.

simon_ogden
Participant
0 Likes

The simple answer to the first part is yes. In terms of the logical processing of the optimizer, this:

where tab1.col1 = 1200

and tab1.col1 = tab2.col1

equates to

where tab1.col1 = 1200

and tab1.col1 = tab2.col1

and tab2.col1 = 1200

Don't assume there is precedence when it comes to the predicate, all available options will be considered. Often people assume that maybe the search argument filtering takes place before a join and the answer to that is sometimes it will sometimes is won't, it's entirely dependant on the plan selection.

I think we've covered total density, but let me clarify.

Take your simple join predicate

where tab1.col1 = tab2.col1

which is equivalent to

where tab2.col1 = tab1.col1

So don't think about one being the outer and one being the inner, that's the job of the optimizer.

If you have statistics for tab1.col1 (which will mean a histogram and total density) then what does the optimizer know?

It knows the row count of both tables, it also *might* know if one or other side is unique (single key unique index maybe). Let's assume there isn't a unique index on tab2.col1 or tab1.col1.

The histogram for tab1.col1 can't be used as it has no clue what values might be getting passed in from tab2.col1 and no search argument to filter it.

Say it picks tab2 as the outer table, that means X rows being passed into the join; for each one of these rows, it has to try to guess how many rows will match from tab1, the only possible way of making this guess (in ASE)  is to use the total density, the value of tab2.col1 could be anything at all so it has to use the average weight which is the total density. Some other vendors use 1/'unique values', but using this value isn't as representative of the data as it assumes an even distribution.

100 rows in tab2 for example, row count of tab1 is 50,  total density of tab1.col1 = 0.2, so for each row in tab2 it has to assume (remember it has no clue about actual values) it'll return 10 rows from tab1, so it estimates 1000 rows come out of the join. In reality this could be correct, somewhat correct or entirely incorrect, it's a total guess, but it has to be.

--------------

I don't think there is any documentation on join histograms (on SAP/Sybase website) but to put it simply, think of it like this rather simple example where a histogram is available for both sides of a join. Total density is best calculated as the sum of the square of all the weights so for tab1.col1 below (0.2^2+0.3^2+0.4^2+0.1^2) = 0.30

Rows=1000, total density =  0.30               Rows=1000, total density =  0.38

1 = 0.2                                                              1 = 0.5

2 = 0.3                                                              3 = 0.2

3 = 0.4                                                              6 = 0.3

4 = 0.1

Join histogram:-

Rows= 1000*1000 = 1000000(i.e. a million rows if no join condition, cartesian product)

1 = 0.5 * 0.2

2 = 0.3 * 0

3 = 0.4 * 0.2

4 = 0.1 * 0

6 = 0 * 0.3

1 = 0.1

2 = 0

3 = 0.08

4 = 0

6 = 0

0.18 * 1000000

= 180,000 rows

So if you had no other filtering in this query, then this join *will* produce 180,000 rows.

The estimate using the total density (which would have been the method used to calculate estimate row count  on ASE 12.5.x) is

1000 * 0.30 * 1000 = 300,000 rows.

As soon as you start adding other entries into the predicate things obviously becomes more complicated.

My advice is to start looking at some outputs if you really want to see what is going on.

set option show long --this is the only trace that will give you all the bits you need (can be huge)

set option show_histograms long --output from this can be huge also

set statement_cache off --if you don't do this, it'll just use a stored plan after the first execution.

set switch on 3604 --required to redirect to the client  from STDOUT on the server

set statistics time,io,plancost, resource on --execution stats, compare actuals to estimates etc.

Take care which client you use, your best and safest option remains plain isql (or other simple ct-lib program); other more complex tools can use multiple connections and or issue additional sql which might give you some unexpected output.

Former Member
0 Likes

If you have statistics for tab1.col1 (which will mean a histogram and total density) then what does the optimizer know?

It knows the row count of both tables, it also *might* know if one or other side is unique (single key unique index maybe). Let's assume there isn't a unique index on tab2.col1 or tab1.col1.

The histogram for tab1.col1 can't be used as it has no clue what values might be getting passed in from tab2.col1 and no search argument to filter it.

>>> so here for each row in tab2 it is trying to find the matching record in tab1 - right?

Say it picks tab2 as the outer table,

>>> is that more likely as tables without indexes are not generally considered optimal as the inner table?

that means X rows being passed into the join; for each one of these rows, it has to try to guess how many rows will match from tab1, the only possible way of making this guess (in ASE)  is to use the total density,

>>> why??? if tab1 has a histogram, then the number of matching rows for each key value from tab2 in tab1 can be easily determined either from frequency weight or range weight or frequency density from the histogram of tab1?

i am not sure if i am missing something basic here.

simon btw, thanks so much for the write up on the join histograms.

simon_ogden
Participant
0 Likes

>>> so here for each row in tab2 it is trying to find the matching record in tab1 - right?

Yep

>>> is that more likely as tables without indexes are not generally considered optimal as the inner table?

It doesn't really matter for this example, but remember there are multiple join types and also that columns can have statistics without the need for an index (if they have been manually created), but yes, you are 'probably' correct.

>>> why??? if tab1 has a histogram, then the number of matching rows for each key value from tab2 in tab1 can be easily determined either from frequency weight or range weight or frequency density from the histogram of tab1?

I think this is the bit I need to be clearer on. We're talking about compilation here, compilation takes place before execution.

At compilation time, it has to pick a query plan to pass to the execution phase. At compilation time (for the first example I gave with no stats for tab2) it has no knowledge of what the values are in tab2, so when it comes to picking a query plan it has to assume that the rows in tab2 could have any value for col1. By that I mean ANY value, it could contain any value legal for the data type, it's not restricted to the values in the histogram for tab1.col1.

If they have any value at all, then it then the best guess for each possible value is the total density.

The estimates originate during compilation, they are not amended during execution when the values in tab2 become known, those values becomes the actual values.

Even with a histogram for tab1.col1 this is of no help at all. Take the below, how would you think it could estimate a row count using the histogram?

tab2.col1                              tab1.col1

Rows=1000                        Rows=1000, total density =  0.38

(no histogram)                   1 = 0.5

                                              3 = 0.2

                                              6 = 0.3

Former Member
0 Likes

thanks for the reply and patience simon.

i understand the dynamics of compilation and i don't think my question is regarding that.

i think my primary question is that if tab2 does not have a histogram, then why cannot tab1's range/frequency weights and range density serve as a basis for evaluating tab2 ?

so if the query is looking for the the key value 1 (from your sample data) from tab2, why not simply use the frequency weight from tab1?

or for values 1 and 3, as per your sample data it could still use the frequency weights of the two. or if one of the values sought is part of a range cell, then could it not use the range density?

instead of using 380 rows from total density, why not use 700 rows from the frequency weights?

simon_ogden
Participant
0 Likes

No worries 🙂

i think my primary question is that if tab2 does not have a histogram, then why cannot tab1's range/frequency weights and range density serve as a basis for evaluating tab2


The total density IS the combined average of tab1's range/frequency weights and range density. If it estimated 700 it would making the assumption tab2.col1 is either 1 or 3, it has no idea what tab2 contains. The example I am using here is about a query with that single join condition only, no other entry in the predicate that is looking for a specific value.

Answers (0)