Quantcast

Re:st: difference between "Spearman" and "pwcorr / correlate"

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re:st: difference between "Spearman" and "pwcorr / correlate"

Mike Lacy

Stas Kolenikov <[hidden email]> wrote:

 >
 >Inference for Pearson's moment correlation relies on normality of the
 >data. Spearman rank correlation is free of any assumptions, but there
 >is no population characteristic that it estimates, which makes
 >interpretation and asymptotic inference somewhat weird. If one is
 >significant and the other is not, you are making either type I or type
 >II error somewhere.
 >
 >On 10/6/09, Ashwin Ananthakrishnan <[hidden email]> wrote:
 >> Hi,
 >>
 >>  In examining the correlation between two variables, what is the
 >difference in utility of the Spearman correlation co-efficient (stata
 >command 'spearman') and the Pearson correlation co-efficient (stata
 >command "pwcorr" or "correlate")?

In the angels on the head of a pin vein:
Of possible interest in this regard is that the Spearman coefficient
is the same as the Pearson calculated on the ranked values of the
variables (ties getting the average rank).  I would agree that this
is not a terribly interesting population parameter, but isn't this
nevertheless an estimable/testable population characteristic?

Regards,
=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy
Fort Collins CO USA
(970) 491-6721 office






*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: st: difference between "Spearman" and "pwcorr / correlate"

Stas Kolenikov
>  >Inference for Pearson's moment correlation relies on normality of the
>  >data. Spearman rank correlation is free of any assumptions, but there
>  >is no population characteristic that it estimates, which makes
>  >interpretation and asymptotic inference somewhat weird. If one is
>  >significant and the other is not, you are making either type I or type
>  >II error somewhere.
>  In the angels on the head of a pin vein:
>  Of possible interest in this regard is that the Spearman coefficient is the
> same as the Pearson calculated on the ranked values of the variables (ties
> getting the average rank).  I would agree that this is not a terribly
> interesting population parameter, but isn't this nevertheless an
> estimable/testable population characteristic?

If you have a finite population, then of course you will have Spearman
correlation for it. Although if you want to set up any asymptotic
framework, you will be trying to hit a moving target. I don't think
there is a meaningful definition of Spearman correlation for infinite
populations/continuous variables, although I might be mistaken. On the
other hand, Kendall's tau, as Nick Cox quoted from Roger Newson, has
explicit population analogues in probabilities of concordant and
discordant pairs of observations.

The question is: if the correlation estimate is 0.5, what does it say?
For Pearson moment correlation, it means that the proportion of
explained variance in a bivariate regression is 0.25. For Kendall's
tau, it means that for every discordant pair of observations, there
are three concordant pairs (i.e., Prob[ concordant ] = 3 Prob[
discordant ] = 3/4 ). For Spearman rank correlation, you can only say
that the variables are positively associated, but not much more.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: st: difference between "Spearman" and "pwcorr / correlate"

Roger Newson
There IS an interpretation of the Spearman correlation for continuous variables in an infinite population. In that case, if the random variables are X and Y, then the Spearman rho(X,Y) is simply the Pearson correlation of F_X(X) and F_Y(Y), where F_X(.) and F_Y(.) are the population cumulative distribution functions of X and Y respectively. And a Pearson correlation, as always, is a measure of linearity.

The two main problems with the Spearman rho are that (a) it is ONLY a measure of linearity between 2 cumulative distribution functions (with no interpretation as a difference between concordance and discordance probabilities), and that (b) the Central Limit Theorem works a lot less quickly for the sample Spearman rho than for the sample Kendall tau-a, especially under the null hypothesis of zero correlation (see Kendall and Gibbons, 1990).

Best wishes

Roger


References

Kendall, M. G., and J. D. Gibbons. 1990. Rank Correlation Methods. 5th ed. Oxford, UK: Oxford University Press.


Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [hidden email]
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:
http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/

Opinions expressed are those of the author, not of the institution.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Stas Kolenikov
Sent: 07 October 2009 21:27
To: [hidden email]
Subject: Re: st: difference between "Spearman" and "pwcorr / correlate"

>  >Inference for Pearson's moment correlation relies on normality of the
>  >data. Spearman rank correlation is free of any assumptions, but there
>  >is no population characteristic that it estimates, which makes
>  >interpretation and asymptotic inference somewhat weird. If one is
>  >significant and the other is not, you are making either type I or type
>  >II error somewhere.
>  In the angels on the head of a pin vein:
>  Of possible interest in this regard is that the Spearman coefficient is the
> same as the Pearson calculated on the ranked values of the variables (ties
> getting the average rank).  I would agree that this is not a terribly
> interesting population parameter, but isn't this nevertheless an
> estimable/testable population characteristic?

If you have a finite population, then of course you will have Spearman
correlation for it. Although if you want to set up any asymptotic
framework, you will be trying to hit a moving target. I don't think
there is a meaningful definition of Spearman correlation for infinite
populations/continuous variables, although I might be mistaken. On the
other hand, Kendall's tau, as Nick Cox quoted from Roger Newson, has
explicit population analogues in probabilities of concordant and
discordant pairs of observations.

The question is: if the correlation estimate is 0.5, what does it say?
For Pearson moment correlation, it means that the proportion of
explained variance in a bivariate regression is 0.25. For Kendall's
tau, it means that for every discordant pair of observations, there
are three concordant pairs (i.e., Prob[ concordant ] = 3 Prob[
discordant ] = 3/4 ). For Spearman rank correlation, you can only say
that the variables are positively associated, but not much more.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: st: difference between "Spearman" and "pwcorr / correlate"

Nick Cox
(a) is on all fours with "the problem with pink is that it isn't blue".
That is, (a) amounts to saying that the problem with Spearman's rank is
that it's not Kendall's tau. True, but the reverse is equally true.

That aside, I think most users of rank correlation would be happy to
acknowledge advantages and disadvantages of each such measure, and
indeed to note that they should give similar results in practice. For
example, given the property emphasised earlier in the thread that

Spearman(x, y) = Pearson(rank(x), rank(y))

one of many possibilities for Spearman correlations is that they offer a
route to a robustified PCA. (You can be sure that the eigenproperties
are OK.)

Nick
[hidden email]

Newson, Roger B

There IS an interpretation of the Spearman correlation for continuous
variables in an infinite population. In that case, if the random
variables are X and Y, then the Spearman rho(X,Y) is simply the Pearson
correlation of F_X(X) and F_Y(Y), where F_X(.) and F_Y(.) are the
population cumulative distribution functions of X and Y respectively.
And a Pearson correlation, as always, is a measure of linearity.

The two main problems with the Spearman rho are that (a) it is ONLY a
measure of linearity between 2 cumulative distribution functions (with
no interpretation as a difference between concordance and discordance
probabilities), and that (b) the Central Limit Theorem works a lot less
quickly for the sample Spearman rho than for the sample Kendall tau-a,
especially under the null hypothesis of zero correlation (see Kendall
and Gibbons, 1990).

References

Kendall, M. G., and J. D. Gibbons. 1990. Rank Correlation Methods. 5th
ed. Oxford, UK: Oxford University Press.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: st: difference between "Spearman" and "pwcorr / correlate"

Nick Cox
In reply to this post by Stas Kolenikov
There's a tacit criterion here, that techniques must have simple verbal
interpretations. I am as much in favour of simple verbal interpretations
as the next person -- nay, on average, more so -- but while they're a
bonus when available insisting on them would deprive you of much that is
indispensable.

What's the simple verbal interpretation of (say) eigenvectors or an SVD?


Nick
[hidden email]

Stas Kolenikov

If you have a finite population, then of course you will have Spearman
correlation for it. Although if you want to set up any asymptotic
framework, you will be trying to hit a moving target. I don't think
there is a meaningful definition of Spearman correlation for infinite
populations/continuous variables, although I might be mistaken. On the
other hand, Kendall's tau, as Nick Cox quoted from Roger Newson, has
explicit population analogues in probabilities of concordant and
discordant pairs of observations.

The question is: if the correlation estimate is 0.5, what does it say?
For Pearson moment correlation, it means that the proportion of
explained variance in a bivariate regression is 0.25. For Kendall's
tau, it means that for every discordant pair of observations, there
are three concordant pairs (i.e., Prob[ concordant ] = 3 Prob[
discordant ] = 3/4 ). For Spearman rank correlation, you can only say
that the variables are positively associated, but not much more.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: st: difference between "Spearman" and "pwcorr / correlate"

Stas Kolenikov
On Thu, Oct 8, 2009 at 11:33 AM, Nick Cox <[hidden email]> wrote:
> There's a tacit criterion here, that techniques must have simple verbal
> interpretations. I am as much in favour of simple verbal interpretations
> as the next person -- nay, on average, more so -- but while they're a
> bonus when available insisting on them would deprive you of much that is
> indispensable.
>
> What's the simple verbal interpretation of (say) eigenvectors or an SVD?

The eigenproblems are very visual. The eigenvalues tell you by how
much a unit vector will change its length, and eigenvectors give those
specific vectors and directions of where the change is exact: the
vector stretches without any rotation. If we talk about an
eigenproblem for a covariance matrix, then the eigenvalues are the
"radii" of an rugby/American football of the points in multivariate
space, and eigenvectors are again directions that give the orientation
of that rugby ball relative to the "official" axes. SVDs can be
explained by the -biplot-s, although with greater effort.

I usually want to know what I am estimating. Then I can eyeball
something along the lines of "the difference between the unknown
population distribution function and the sample distribution is such
and such, and hence by an appropriate version of the influence
function expansions and/or the delta-method, the difference between
the unknown parameter and the estimate at hand must be of such and
such order." Thanks to Roger, I now have a better clue of what I am
estimating with Spearman correlation. And there are probably a dozen
other rank-type correlations that would make at least as much sense as
(linear) correlation of the cdfs.

One other comparison can be made regarding the computational
requirements. Spearman's rho is O( n log(n) ) due to sorting, while
Kendall's tau is O( n^2 ) for the pairwise comparisons. Of course
Pearson's moment correlation is O( n ), it's just manipulation of
sums. One would only see differences in timing of Pearson and Spearman
with the sample sizes such that -sort- takes a noticeable amount of
time, while Kendall's tau is slow with more than 100 observations.

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: st: difference between "Spearman" and "pwcorr / correlate"

Nick Cox
My point needs rephrasing. I draw a distinction between verbal
definitions or characterisations on the one hand and verbal analogies on
the other. The difference lies in whether you can take the verbal
statements and reconstruct the formula or method from them; with mere
analogies you can't do that. However, Pearson correlations are pretty
much defined by their square being the fraction of variance explained by
the corresponding regression, modulo sign of course. In contrast, if I
explain Spearman correlation in verbal terms as a measure of
monotonicity that does not imply the particular formula used.

Nick
[hidden email]

Stas Kolenikov

On Thu, Oct 8, 2009 at 11:33 AM, Nick Cox <[hidden email]> wrote:
> There's a tacit criterion here, that techniques must have simple
verbal
> interpretations. I am as much in favour of simple verbal
interpretations
> as the next person -- nay, on average, more so -- but while they're a
> bonus when available insisting on them would deprive you of much that
is
> indispensable.
>
> What's the simple verbal interpretation of (say) eigenvectors or an
SVD?

The eigenproblems are very visual. The eigenvalues tell you by how
much a unit vector will change its length, and eigenvectors give those
specific vectors and directions of where the change is exact: the
vector stretches without any rotation. If we talk about an
eigenproblem for a covariance matrix, then the eigenvalues are the
"radii" of an rugby/American football of the points in multivariate
space, and eigenvectors are again directions that give the orientation
of that rugby ball relative to the "official" axes. SVDs can be
explained by the -biplot-s, although with greater effort.

I usually want to know what I am estimating. Then I can eyeball
something along the lines of "the difference between the unknown
population distribution function and the sample distribution is such
and such, and hence by an appropriate version of the influence
function expansions and/or the delta-method, the difference between
the unknown parameter and the estimate at hand must be of such and
such order." Thanks to Roger, I now have a better clue of what I am
estimating with Spearman correlation. And there are probably a dozen
other rank-type correlations that would make at least as much sense as
(linear) correlation of the cdfs.

One other comparison can be made regarding the computational
requirements. Spearman's rho is O( n log(n) ) due to sorting, while
Kendall's tau is O( n^2 ) for the pairwise comparisons. Of course
Pearson's moment correlation is O( n ), it's just manipulation of
sums. One would only see differences in timing of Pearson and Spearman
with the sample sizes such that -sort- takes a noticeable amount of
time, while Kendall's tau is slow with more than 100 observations.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Loading...