Quantcast

outliers v skew

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

outliers v skew

Campo, Marc
We have developed a least squares regression model with a three level categorical predictor (and about 3 covariates) and an outcome (SF-12) that is skewed.  The sf-12 scores are skewed - negatively - in each group of the primary predictor.  This can be unavoidable if your mean score is above national norms.  The residuals are similarly skewed (in each predictor category) but slightly less so. 

The skew results from a series of outliers in each group, almost all of which are negative.   Without the outliers, the residual distributions are
somewhat normal.   We have a large sample (about 1,445) and I would
say there are about 25-30 observations with standardized residuals
between -3 and -5.5.   With the exception of a couple of the cases
most of these don't seem to cause much havoc.  They are interesting
and may deserve to influence the coefficients.  Hard to tell - even
from a clinical perspective.

So, is this a question of non normality or an outlier problem or
both.......both robust and median regression results in coefficients that are about 1 less than standard regression for one level of the dummy coded predictor, and similar for the other.  (there is also an age interaction so we report across 3 levels of age).  Transformations of the DV don't seem to add much. Looking at a variety of transformations  only the cubic seemed to make things even close to normal.   Transformations would also make this difficult to interpret.   



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: outliers v skew

nshephard
Administrator
On Tue, Nov 30, 2010 at 4:05 PM, Campo, Marc <[hidden email]> wrote:

> We have developed a least squares regression model with a three level categorical predictor (and about 3 covariates) and an outcome (SF-12) that is skewed.  The sf-12 scores are skewed - negatively - in each group of the primary predictor.  This can be unavoidable if your mean score is above national norms.  The residuals are similarly skewed (in each predictor category) but slightly less so.
>
> The skew results from a series of outliers in each group, almost all of which are negative.   Without the outliers, the residual distributions are
> somewhat normal.   We have a large sample (about 1,445) and I would
> say there are about 25-30 observations with standardized residuals
> between -3 and -5.5.   With the exception of a couple of the cases
> most of these don't seem to cause much havoc.  They are interesting
> and may deserve to influence the coefficients.  Hard to tell - even
> from a clinical perspective.
>
> So, is this a question of non normality or an outlier problem or
> both.......both robust and median regression results in coefficients that are about 1 less than standard regression for one level of the dummy coded predictor, and similar for the other.  (there is also an age interaction so we report across 3 levels of age).  Transformations of the DV don't seem to add much. Looking at a variety of transformations  only the cubic seemed to make things even close to normal.   Transformations would also make this difficult to interpret.
>

This seems virtually identical to the following post/thread on
MedStats a couple of days ago (except you appear to have posted to
Statalist from your work email address as opposed to your personal
Yahoo! one)...

http://groups.google.com/group/medstats/browse_thread/thread/95e098ccc98fdd76

Have you tried the suggestions others made there?

What was unsatisfactory/lacking in those responses?

Neil

--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution

Email - [hidden email]
Website - http://kimura-no-ip.org/
Photos - http://www.flickr.com/photos/slackline/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: outliers v skew

Nick Cox
In reply to this post by Campo, Marc
Various reflections:

I don't know what SF-36 means. I don't know what nation you are referring to. (This is an international list, so you shouldn't assume that people are from your own country.) I infer a medical context, so I sprinkle my answer with words like clinical, as if I knew something of your field.

The least important assumption that might be made about a regression model is that the outcome, conditional on the predictors, is normal. I am seeing here only information that the outcome distribution, conditional on some categories, is skewed. Clinically you might well regard your categorical predictor as the most important, but neither the regression nor Stata knows or cares about that. All the predictors matter in this respect.

Tastes vary, but thinking of your individuals as (a large group that is fairly normal, considering) PLUS (a small group that is more awkward) seems unlikely to be fruitful unless there are _independent_ clinical grounds for thinking that there is a mixture of groups.

Rather, the signals are that you have slight skewness and the question is whether it will bite you. One of the other things you can do is omit the "outliers" and see what difference that makes. A better thing to do is to use not different transformations but different link functions in a generalised linear model and to see how far the predictions and the model figures of merit vary with link functions. Ideally, you will see little variation.

My prejudice is that most things declared outliers are not convincingly outliers. It takes more than being big (or small, as the case may be) and far out in a tail.

If it takes a cube transformation to show much difference, the implication is that your skewness is not really biting. The cube is a very strong transformation and one I would only use if there were dimensional grounds, as if I had some lengths and some volumes. (Even then, I would cube root the volumes, rather than cube the lengths!)

Nick
[hidden email]

Campo, Marc

We have developed a least squares regression model with a three level categorical predictor (and about 3 covariates) and an outcome (SF-12) that is skewed.  The sf-12 scores are skewed - negatively - in each group of the primary predictor.  This can be unavoidable if your mean score is above national norms.  The residuals are similarly skewed (in each predictor category) but slightly less so. 

The skew results from a series of outliers in each group, almost all of which are negative.   Without the outliers, the residual distributions are
somewhat normal.   We have a large sample (about 1,445) and I would
say there are about 25-30 observations with standardized residuals
between -3 and -5.5.   With the exception of a couple of the cases
most of these don't seem to cause much havoc.  They are interesting
and may deserve to influence the coefficients.  Hard to tell - even
from a clinical perspective.

So, is this a question of non normality or an outlier problem or
both.......both robust and median regression results in coefficients that are about 1 less than standard regression for one level of the dummy coded predictor, and similar for the other.  (there is also an age interaction so we report across 3 levels of age).  Transformations of the DV don't seem to add much. Looking at a variety of transformations  only the cubic seemed to make things even close to normal.   Transformations would also make this difficult to interpret.   


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Loading...