on 2016 Apr 20 5:07 PM
--- sybase version ---
Adaptive Server Enterprise/15.7.0/EBF 22639 SMP SP52 /P/Sun_svr4/OS 5.10/ase157sp5x/3293/64-bit/FBO/Wed May 21 02:49:14 2014
--- end version ---
Hi community,
I'm starting reading/reading again some chapters of 'ASE 15.7 Performance and Tuning Series: Query Processing and Abstract Plans'.
I think I get the 'big picture' of the need of the optimizer but I'm looking for way to better understand it and the information it needs to provide the best plan to the execution engine.
Is there any document/presentation that will present a kind of 'hands-on understanding the optimizer'?
For example, I get a rough image when I read this
"
The query optimizer processes join predicates the same way it processes search
arguments, in that it uses statistics, number of rows in the table, index heights,
and the cluster ratios for the index and data pages to determine which index and
join method provides the cheapest access. In addition, the query optimizer also
uses join density estimates derived from join histograms that give accurate
estimates of qualifying joining rows and the rows to be scanned in the outer and
inner tables. The query optimizer also must decide on the optimal join ordering
that will yield the most efficient query plan. The next sections describe the key
techniques used in processing joins.
"
Those words would be meaningfull to me if I can see in practice how the output of 'optdiag statistics' helps the optimizer choosing the right plan, how some dbcc traceon(traceflags) can help me understanding the optimizer work.
Thanks in advance for your input.
Simon
Hi all,
Just to complete my remark, after reading the document 'Performance and Tuning Series: Improving Performance with Statistical Analysis' (doc ID DC00976-01-1570-01) I notice the following information on page 29:
Statistics for column group: "pub_id", "type"
Last update of column statistics: Apr 8 2008 9:22:40:963AM
Range cell density: 0.0000000362887320
Total density: 0.0000000362887320
Range selectivity: default used (0.33)
In between selectivity: default used (0.25)
Unique range values: 0.0000000160149449
Unique total values: 0.0000000160149449
Average column width: default used (8.00)
Statistics for column group: "pub_id", "type", "pubdate"
Last update of column statistics: Apr 8 2008 9:22:40:963AM
Range cell density: 0.0000000358937986
Total density: 0.0000000358937986
Range selectivity: default used (0.33)
In between selectivity: default used (0.25)
Unique range values: 0.0000000158004305
Unique total values: 0.0000000158004305
Average column width: 2.0000000000000000
With 5000 rows in the table, the increasing precision of the optimizer’s
estimates of rows to be returned depends on the number of search arguments
used in the query:
• An equality search argument on only pub_id results in the estimate that
0.0335391029690461 * 5000 rows, or 168 rows, will be returned.
• Equality search arguments for all three columns result in the estimate that
0.0002011791956201 * 5000 rows, or only 1 row, will be returned.
I'm wondering where both values 0.0335391029690461 and 0.0002011791956201 come from?
let's assume the optimizer knows the number of rows to be returned (statistics up-to-date), how can I check using a index is cheaper than a tablescan? what's the formulae I could use to compute the costs for each access as an exercise to better understand the mechanism.
Last, are there any traceflags that will help me understanding how the optimized is picking a plan?
Thanks
Simon
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Simon,
There is simply too much to cover here in order for a good understanding looking at your starting point. It requires a base knowledge of statistics and histogramming as well as experience at actually debugging query processing on ASE.
My advice is to start running some queries ( with statement cache off) using set option show_lio_costing on and set statistics plancost. There is no clear documentation that will explain everything you see but the only way to understand it is to work with it day in and day out.
Once you start to make sense of the individual lio costing ( you'll need to look at optdiag output to) you can expand up to set option show_search_engine for join evaluation then finally set option show long for a deep dive. If you want to go further you then have to start looking at show_histograms and so it goes on.
Start simple and look at single search predicate queries and expand up. Create a table, put a row in it, create an index, add another row different values, update stats, and so on and so on. Look at the data available at each iteration.
There is plenty of unofficial documentation around if you look for it ( even on this site).
The figures you mention are histogram frequency cells ( see optdiag), holding essentially a weight which is a proportion of a domain which matches a given value.
Cheers,
Simon
Thanks Simon for you input,
I've started running some tests with a table with a table having two columns and 10001 rows
create table simon (numero numeric(10,6),nom varchar(50))
-- default lock scheme is datarows
this table has an index on simon(numero)
Before executing the next SQL for testing i set the following flags
set statistics io, time, plancost
set showplan on
set option show_lio_costing on
dbcc traceon(3604,9528)
select * from simon where numero between 90 and 94
Maybe you could help me understanding the output?
Beginning selection of qualifying indexes for table 'simon',
Estimating selectivity of index 'simon.simon_nd0', indid 2
numero >= 90.000000
numero <= 94.000000
Estimated selectivity for numero,
selectivity = 0.003413662,
scan selectivity 0.003413662, filter selectivity 0.003413662
34.14003 rows, 1 pages
Data Row Cluster Ratio 0.01049763
Index Page Cluster Ratio 1
Data Page Cluster Ratio 0.07202833
using no index prefetch (size 4K I/O)
in index cache 'default data cache' (cacheid 0) with LRU replacement
using no table prefetch (size 4K I/O)
in data cache 'default data cache' (cacheid 0) with LRU replacement
Data Page LIO for 'simon_nd0' on table 'simon' = 34.14003
I put in bold, the part I'm not sure of.
My understanding is
if I look at the histogram
46 0.00199980 <= 89.977648
47 0.00199980 <= 92.350715
48 0.00199980 <= 94.652860
49 0.00199980 <= 96.226424
I have almost two range corresponding to my between so an estimation of 0.00199980 * 2 = 0.0039996 which is not too far from 0.003413662
If I multiply this selectivity with the total number of rows we basically have 34 rows to be returned which correspond to 1 page, am I correct? (how is compute this 1 pages?)
Then is displayed some information from optdiag and the fact that prefect won't be used.
Then I don't understand how 'Data Page LIO for 'simon_nd0' on table 'simon' = 34.14003' is computed.
Could you highlight me on this?
Thanks in advance.
Simon
You are correct in some of your assumptions. Because the range spans more than one histogram, we have to aggregate the two. However in using histograms, we also use the table level selectivity for the column to predict how many of a given value within the cell there actually is. Hence the selectivity of 0.00341. Since it is a fairly small index, given the average rows per page, all the index LEAF rows are likely all on one page as noted.
However, because you did a select *, then we need to return not only the index key value, but also the columns not contained in the index (non-covered columns). As a result, we need to do an LIO (and possibly a PIO) for each index leaf row - which we already estimated to be 34.14.....Sooooo....we need to do 34 page lookups. Why not just say 34 vs. 34.14.....because when doing joins and other operations, the fractions may provide a bit more accuracy.
Interpolation is the process of determining mathematically where your range of 90-94 falls within the histogram. Basically, you start by finding a percentage of the weight where the lower bound falls between the cells, and multiply that by the weight of the cell. In this case, Cell 47:
0.00199980 * ( 1.000000 - ( (90.000000 - 89.977648)/ (92.350715 - 89.977648)) ) = 0.001981
Then, add the values of the weights of each cell within the range (in this case, there are none as there are only 2 cells, lower bound cell 47, and upper bound cell 48) until you reach the cell of the upper bound, and do another percentage calculation (cell 48):
0.00199980 * ( (94.000000 - 92.350715)/ (94.652860 - 92.350715)) = 0.001433
0.001981 + 0.001433 = 0.003414
This is your scan selectivity (rounded to 6 places in my math because of your displayed statistics precision). Using the precision of the real data it's actually 0.003413662. Multiply scan selectivity times your number of rows:
0.003413662 * 10001 = 34.140033662
So, the optimizer estimates that approx 34.14003 rows match your bounded range criteria. The output above is estimating how much IO will be necessary to read the index values (1 page of non-clustered index pages, as index pages only contain the key value, and a pointer to the data row), and for each matching index row, then accessing an entry in the base table. So, Data Page LIO is the logical io for 34 rows to the base table. Non-clustered index is a separate physical structure, and each entry points to a row in the base table which is itself a separate physical structure.
User | Count |
---|---|
68 | |
8 | |
8 | |
6 | |
6 | |
6 | |
6 | |
6 | |
6 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.