on 2011 Jun 08 1:26 PM
[Just to follow-up on this question]:
I'm doing linear regression tests with SQL Anywhere's builtin OLAP functions. Say, for a very simplified example, I would assume a linear correlation between the columns x and y in a table MyTable.
So I would generate a linear function with
select REGR_COUNT(y, x) as cnt, round(REGR_SLOPE(y, x), 4) as slope, round(REGR_INTERCEPT(y, x), 4) as yIntercept, round(REGR_R2(my, x), 4) as fitness from MyTable where x > 0 and y > 0;
This works well generally. However, what would be a senseful method to exclude outliers?
A simple test for maximum/minimum values (or a ranking) seems inadequate as outliers would be defined as based on their value pairs, not just on the y value.
Currently, Im trying to use the above query as common table expression and then to check for those pairs that have a bigger deviation compared to the generated linear function:
with CTE_LR as (select REGR_COUNT(y, x) as cnt, round(REGR_SLOPE(y, x), 4) as slope, round(REGR_INTERCEPT(y, x), 4) as yIntercept, round(REGR_R2(my, x), 4) as fitness from MyTable where x > 0 and y > 0) select x, y, round(slope * x + yIntercept, 4) as yCalc, abs(yCalc - y) as absDiff, abs(yCalc - y) / y as relDiff from MyTable M, CTE_LR where x > 0 and y > 0 order by relDiff desc, x, y;
However, this helps to detect outliers post-mortem, but obviously they have already influenced the linear regression. I could then build another regression without these outliers (say, those with a certain relative deviation) but that again might exclude the "wrong outliers" based on them being part of the previous regression.
Therefore I would like a way to exclude them beforehand. Is there a (not too complicated) way to do so?
The word percentile comes to my mind, exclude the <5% and the >95% percentile of the data range. See the PERCENT_RANK function and select the set for your regression based on the PERCENT_RANK.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
That seems helpful (and I haven't been aware of that variation of RANK()), but...
...the problem is I'm not looking for absolute values but maximum/minimum y values w.r.t. the according x values.
So basically it's a question how to do a "partition" (or a group by) over a continuous range of double values - and unfortunately not a range which is evenly filled with x values.
User | Count |
---|---|
67 | |
10 | |
10 | |
10 | |
10 | |
8 | |
8 | |
7 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.