Dear Statalist,
Dear Statalist,

although it's not a particularly Stata specific question , I am hoping to get advise on the following (basic?) question: I am using the following command to get a correlation matrix quietly estpost correlate `vars', matrix esttab using correlations.csv, not unstack compress noobs star(* 0.10 ** 0.05 *** 0.01) long b(%9.2f) replace `vars' containts a battery of mostly metric variables. Besides the metric variables, there is also three dummy variables. I am wondering now if the reported (relatively high) correlation coefficients among the dummy variables and between some of the metric variables and the dummy variables are actually meaningful. How to interpret them / which correlation test to use? Thank's a lot, Christian
Try Nick`s http://www.stata.com/statalist/archive/2008-11/msg00933.html

Martin
Hi Martin,
Hi Martin,

thx a lot, I already had a look at this discussion and Nick's article. Unfortunately, I can't find a clear answer to my question. First, the discussion / article is about correlation a continous and a discrete variable, I am wondering though how to deal with a dummy (0 / 1) variable (perhaps the same way?). Second, in his mail Nick says - if I understood him right - that the correlation between a discrete and continous variable can make sense, but that it depends (on what?). Still, I find it surprising that there is no comment / option in the correlate commands of Stata, as I figure that this is a frequently occuring issue (as Nick mentions). Afterall, I might be usual (and easiest way?) to exclude dummy variables from a correlation matrix?`

Best Christian
". First, the discussion / article is about correlation a continous and a discrete variable, I am wondering though how to deal with a dummy (0 / 1) variable (perhaps the same way?)."

Is a dummy not a special case of a discrete variable? So the methods applicable to all discrete variables should be applicable to the dummy case as well...

Martin
The correlation of a dichotomous variable and a continuous (normal?) variable is closely related to the t-test. If you look at the p-value it is exactly the same as the p-value from a t-test for the continuous variable using the dichotomous variable as the 'by' variable. It's meaning depends on whether the continuous variable is close to normal
Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001
In psychometrics, there are concepts of polychoric and polyserial
correlations. The first one is between two ordinal variables, and the second one is between an ordinal variable and a continuous variable. If your variables are truly nominal (like gender or geography), then the correlations are likely meaningless, although you can meaningfully ask whether the distributions of the continuous variables differ between the values of the discrete variable (answered by ANOVA, Kruskal-Wallis test and such). I wrote -polychoric- package some while ago that computes these correlations. On Mon, Sep 21, 2009 at 9:45 AM, Christian Weiß <[hidden email]> wrote: > Dear Statalist, > > although it's not a particularly Stata specific question , I am hoping > to get advise on the following (basic?) question: > > > I am using the following command to get a correlation matrix > > quietly estpost correlate `vars', matrix > esttab using correlations.csv, not unstack compress noobs star(* 0.10 > ** 0.05 *** 0.01) long b(%9.2f) replace > > `vars' containts a battery of mostly metric variables. Besides the > metric variables, there is also three dummy variables. > > I am wondering now if the reported (relatively high) correlation > coefficients among the dummy variables and between some of the metric > variables and the dummy variables are actually meaningful. How to > interpret them / which correlation test to use? > -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
Even the extensive Stata manuals do not take on the task of trying to explain all the statistical possibilities and pitfalls associated with every technique implemented. Why should they?
The literature -- in this case make that literatures -- is hopelessly divided on whether 1. Correlation makes full sense only for (approximately) continuous variables. 2. You need different ideas of correlation when at least one variable is not (approximately) continuous. Even with two dummies or indicators, there is a range of possibilities. It is better to think out what makes sense for your project than to be in fear that what you are doing is incorrect according to some authorities or experts. Nick [hidden email] Christian Weiß thx a lot, I already had a look at this discussion and Nick's article. Unfortunately, I can't find a clear answer to my question. First, the discussion / article is about correlation a continous and a discrete variable, I am wondering though how to deal with a dummy (0 / 1) variable (perhaps the same way?). Second, in his mail Nick says - if I understood him right - that the correlation between a discrete and continous variable can make sense, but that it depends (on what?). Still, I find it surprising that there is no comment / option in the correlate commands of Stata, as I figure that this is a frequently occuring issue (as Nick mentions). Afterall, I might be usual (and easiest way?) to exclude dummy variables from a correlation matrix?` * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
The point-biserial correlation is a correlation between a dichotomous variable and a continuous one. http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
Peter Lachenbruch noted that this ends up being the same math as the t-test. Unfortunately, skew in the dichotomous variable tends to reduce correlations. Thus methods such as the biserial correlation (special case of the polyserial that Stas mentioned) "fix up" the correlation at the cost of making some assumptions about what the dichotomous variable that may or may not be true in practice. In essence, if you are willing to assume that the dichotomous variable comes from an underlying normal distribution, you can boost the correlation. However, if you are wrong and it's not, you may end up coming to the wrong conclusion. You can certainly define measures of dependence between a nominal and continuous variable but this is going to get tricky because a nominal variable isn't really a variable (instead it is K-1 indicator variables, where K is the number of categories). Jay * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
