st: problem with dividing dataset into equally sized groups

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

st: problem with dividing dataset into equally sized groups

Gisella Young
I am trying to divide my dataset into equally sized groups on the basis of an income variable (eg 100 groups from lowest to highest income). I have tried several methods but the groups are not equally sized. For example,

-xtile cat=income, n(100)-
 (similarly with pctile)
and
-sumdist income, n(100) qgp(cat)-

It produces the desired number of groups but they are not equally sized. (Which I see by looking at the frequencies when I say -tab cat- thereafter). The differences are not small - some groups are many times larger than others. This is not because of weighting as I have tried even without weights. It is also not related to the size of groups. I wonder whether it might be because of clustering of incomes around certain values (e.g. 10 000, 15 000) and all of those values being lumped into certain categories.

Can anyone suggest a way to partition the sample into equally sized groups?


This actually stems from an earlier thread (but no need to read that for the above) about plotting a chart of income distribution with the occupational composition of each percentile. Austin's suggestion (below) comes close to that. However, even with his code the groups are not equally sized, but they are sized the same as when I use the sumdist or xtile commands mentioned above.

best,
Gisella

--- On Mon, 12/1/08, Austin Nichols <[hidden email]> wrote:

> From: Austin Nichols <[hidden email]>
> Subject: Re: st: how to make an area graph showing distribution?
> To: [hidden email]
> Date: Monday, December 1, 2008, 2:02 AM
> Gisella Young <[hidden email]>:
> It may be that you are looking for a simple stacked bar
> graph over
> income quintiles or deciles or the like, as opposed to a
> parametric
> smooth over income quantiles.  If so, you might want to
> adapt one of
> this pair of example graphs to your needs:
>
> clear all
> sysuse nlsw88
> ren industry i
> tab i, g(ind)
> g w=round(uniform()*20)
> la var w "fake survey weight"
> _pctile wage [pw=w], nq(5)
> g q=1 if wage<=r(r1)
> forv i=2/5 {
>  replace q=`i' if wage>r(r`=`i'-1') &
> wage<=r(r`i')
>  }
> loc y
> forv i=1/12 {
>  loc l "`=substr("`: var la
> ind`i''",4,.)'"
>  loc y `"`y' lab(`i'
> "`l'")"'
>  loc lv`i' `"la var ind`i' "`l'"
> "'
>  }
> gr bar ind* [pw=w], stack over(q) name(b) leg(`y')
> collapse ind* [pw=w], by(q)
> forv i=2/12 {
>  replace ind`i'=ind`i'+ind`=`i'-1'
>  }
> loc v
> forv i=1/12 {
>  `lv`i''
>  loc v "ind`i' `v'"
>  }
> tw bar `v' q, name(tw)
>
> Note that the commands above destroy the data in memory, so
> make sure
> you -preserve- or -save- first as appropriate.  Also note
> that there
> is no guarantee that the distributions of income by
> occupation, or
> occupation by income category, display any sort of
> stochastic
> dominance that would allow easy ranking of occupations.
>
> See also
> http://www.stata.com/capabilities/graphexamples.html
>
>
> On Sun, Nov 30, 2008 at 10:37 AM, Maarten buis
> <[hidden email]> wrote:
> > --- Gisella Young <[hidden email]>
> wrote:
> >> On Maarten Buis's suggestion, I am not sure
> why I would really need
> >> a regression - I get from his email that this is
> basically for
> >> smoothing?
> >
> > Yes, as income in the example dataset (and I assume in
> your dataset as
> > well) is a continuous variable, there just aren't
> enough cases for each
> > income value to estimate the proportions.
> >
> >> Since I actually want to plot the actual data (but
> realise
> >> that this needs smoothing),
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




     

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

st: RE: problem with dividing dataset into equally sized groups

Martin Weiss-5
Line for the server...

Try -egen, cut()- with the - group(#)- option.


HTH
Martin


-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Gisella Young
Sent: Tuesday, December 02, 2008 3:29 PM
To: [hidden email]
Subject: st: problem with dividing dataset into equally sized groups

I am trying to divide my dataset into equally sized groups on the basis of
an income variable (eg 100 groups from lowest to highest income). I have
tried several methods but the groups are not equally sized. For example,

-xtile cat=income, n(100)-
 (similarly with pctile)
and
-sumdist income, n(100) qgp(cat)-

It produces the desired number of groups but they are not equally sized.
(Which I see by looking at the frequencies when I say -tab cat- thereafter).
The differences are not small - some groups are many times larger than
others. This is not because of weighting as I have tried even without
weights. It is also not related to the size of groups. I wonder whether it
might be because of clustering of incomes around certain values (e.g. 10
000, 15 000) and all of those values being lumped into certain categories.

Can anyone suggest a way to partition the sample into equally sized groups?


This actually stems from an earlier thread (but no need to read that for the
above) about plotting a chart of income distribution with the occupational
composition of each percentile. Austin's suggestion (below) comes close to
that. However, even with his code the groups are not equally sized, but they
are sized the same as when I use the sumdist or xtile commands mentioned
above.

best,
Gisella

--- On Mon, 12/1/08, Austin Nichols <[hidden email]> wrote:

> From: Austin Nichols <[hidden email]>
> Subject: Re: st: how to make an area graph showing distribution?
> To: [hidden email]
> Date: Monday, December 1, 2008, 2:02 AM
> Gisella Young <[hidden email]>:
> It may be that you are looking for a simple stacked bar
> graph over
> income quintiles or deciles or the like, as opposed to a
> parametric
> smooth over income quantiles.  If so, you might want to
> adapt one of
> this pair of example graphs to your needs:
>
> clear all
> sysuse nlsw88
> ren industry i
> tab i, g(ind)
> g w=round(uniform()*20)
> la var w "fake survey weight"
> _pctile wage [pw=w], nq(5)
> g q=1 if wage<=r(r1)
> forv i=2/5 {
>  replace q=`i' if wage>r(r`=`i'-1') &
> wage<=r(r`i')
>  }
> loc y
> forv i=1/12 {
>  loc l "`=substr("`: var la
> ind`i''",4,.)'"
>  loc y `"`y' lab(`i'
> "`l'")"'
>  loc lv`i' `"la var ind`i' "`l'"
> "'
>  }
> gr bar ind* [pw=w], stack over(q) name(b) leg(`y')
> collapse ind* [pw=w], by(q)
> forv i=2/12 {
>  replace ind`i'=ind`i'+ind`=`i'-1'
>  }
> loc v
> forv i=1/12 {
>  `lv`i''
>  loc v "ind`i' `v'"
>  }
> tw bar `v' q, name(tw)
>
> Note that the commands above destroy the data in memory, so
> make sure
> you -preserve- or -save- first as appropriate.  Also note
> that there
> is no guarantee that the distributions of income by
> occupation, or
> occupation by income category, display any sort of
> stochastic
> dominance that would allow easy ranking of occupations.
>
> See also
> http://www.stata.com/capabilities/graphexamples.html
>
>
> On Sun, Nov 30, 2008 at 10:37 AM, Maarten buis
> <[hidden email]> wrote:
> > --- Gisella Young <[hidden email]>
> wrote:
> >> On Maarten Buis's suggestion, I am not sure
> why I would really need
> >> a regression - I get from his email that this is
> basically for
> >> smoothing?
> >
> > Yes, as income in the example dataset (and I assume in
> your dataset as
> > well) is a continuous variable, there just aren't
> enough cases for each
> > income value to estimate the proportions.
> >
> >> Since I actually want to plot the actual data (but
> realise
> >> that this needs smoothing),
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




     

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

st: RE: problem with dividing dataset into equally sized groups

Nick Cox
In reply to this post by Gisella Young
Exactly equal-sized groups are only guaranteed if

1. the number of observations is an exact multiple of the number of
groups (which usually bites minutely)

2. there are no problems with ties (which often bites substantially).

Your problem is evidently #2.

You can only force equal-sized groups if you assign the same value to
different groups in at least some cases. You can always force that by
perturbing your data with random noise before passing them to -xtile-,
but that's hardly a satisfactory approach.

But the whole approach is pretty unsatisfactory anyway: this kind of
subdivision throws away information which is not obviously dispensable.

I've not been following this thread carefully but my impression is that
you've had some excellent advice from Maarten Buis that you've chosen to
ignore. That's your prerogative, but you'll get diminishing returns from
asking small variants on the same question. A modern approach to this
uses some kind of smoothing to try to get over the granularity in your
data, which you can do in a controlled way.

Nick
[hidden email]

Gisella Young

I am trying to divide my dataset into equally sized groups on the basis
of an income variable (eg 100 groups from lowest to highest income). I
have tried several methods but the groups are not equally sized. For
example,

-xtile cat=income, n(100)-
 (similarly with pctile)
and
-sumdist income, n(100) qgp(cat)-

It produces the desired number of groups but they are not equally sized.
(Which I see by looking at the frequencies when I say -tab cat-
thereafter). The differences are not small - some groups are many times
larger than others. This is not because of weighting as I have tried
even without weights. It is also not related to the size of groups. I
wonder whether it might be because of clustering of incomes around
certain values (e.g. 10 000, 15 000) and all of those values being
lumped into certain categories.

Can anyone suggest a way to partition the sample into equally sized
groups?

This actually stems from an earlier thread (but no need to read that for
the above) about plotting a chart of income distribution with the
occupational composition of each percentile. Austin's suggestion (below)
comes close to that. However, even with his code the groups are not
equally sized, but they are sized the same as when I use the sumdist or
xtile commands mentioned above.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

Re: st: RE: problem with dividing dataset into equally sized groups

David Elliott
Unique ranking before cutting will force as equal size groups as
possible while simply using -egen ..cut()- will not.

Eg:
sysuse auto, clear
* Per Martin's suggestion
egen group1 = cut(mpg), group(4)
lab var group1 "Just use cut(mpg)"
tab group1
* Alternate using ranking first
egen rank = rank(mpg), unique
egen group2=cut(rank), group(4)
lab var group2 "Use rank(mpg), then cut(rank)"
tab group2
table group2 group1,stubw(15) row col

DCE

On Tue, Dec 2, 2008 at 10:45 AM, Nick Cox <[hidden email]> wrote:

> Exactly equal-sized groups are only guaranteed if
>
> 1. the number of observations is an exact multiple of the number of
> groups (which usually bites minutely)
>
> 2. there are no problems with ties (which often bites substantially).
>
> Your problem is evidently #2.
>
> You can only force equal-sized groups if you assign the same value to
> different groups in at least some cases. You can always force that by
> perturbing your data with random noise before passing them to -xtile-,
> but that's hardly a satisfactory approach.
>
> But the whole approach is pretty unsatisfactory anyway: this kind of
> subdivision throws away information which is not obviously dispensable.
>
> I've not been following this thread carefully but my impression is that
> you've had some excellent advice from Maarten Buis that you've chosen to
> ignore. That's your prerogative, but you'll get diminishing returns from
> asking small variants on the same question. A modern approach to this
> uses some kind of smoothing to try to get over the granularity in your
> data, which you can do in a controlled way.
>
> Nick
> [hidden email]
>
> Gisella Young
>
> I am trying to divide my dataset into equally sized groups on the basis
> of an income variable (eg 100 groups from lowest to highest income). I
> have tried several methods but the groups are not equally sized. For
> example,
>
> -xtile cat=income, n(100)-
>  (similarly with pctile)
> and
> -sumdist income, n(100) qgp(cat)-
>
> It produces the desired number of groups but they are not equally sized.
> (Which I see by looking at the frequencies when I say -tab cat-
> thereafter). The differences are not small - some groups are many times
> larger than others. This is not because of weighting as I have tried
> even without weights. It is also not related to the size of groups. I
> wonder whether it might be because of clustering of incomes around
> certain values (e.g. 10 000, 15 000) and all of those values being
> lumped into certain categories.
>
> Can anyone suggest a way to partition the sample into equally sized
> groups?
>
> This actually stems from an earlier thread (but no need to read that for
> the above) about plotting a chart of income distribution with the
> occupational composition of each percentile. Austin's suggestion (below)
> comes close to that. However, even with his code the groups are not
> equally sized, but they are sized the same as when I use the sumdist or
> xtile commands mentioned above.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



--
David Elliott
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
David Elliott MD, MSc
Nova Scotia Dept. of Health
Reply | Threaded
Open this post in threaded view
|

RE: st: RE: problem with dividing dataset into equally sized groups

Nick Cox
Yes indeed; but this is still arbitrary and (I believe) not
reproducible. Inside -egen, rank()- there is a -sort- that can not be
made stable. Normally this does not bite but with "unique" ranks it
could.  

Fuzzing with random noise as suggested is at least reproducible in that
you can set the seed.

Nick
[hidden email]

David Elliott

Unique ranking before cutting will force as equal size groups as
possible while simply using -egen ..cut()- will not.

Eg:
sysuse auto, clear
* Per Martin's suggestion
egen group1 = cut(mpg), group(4)
lab var group1 "Just use cut(mpg)"
tab group1
* Alternate using ranking first
egen rank = rank(mpg), unique
egen group2=cut(rank), group(4)
lab var group2 "Use rank(mpg), then cut(rank)"
tab group2
table group2 group1,stubw(15) row col

On Tue, Dec 2, 2008 at 10:45 AM, Nick Cox <[hidden email]> wrote:

> Exactly equal-sized groups are only guaranteed if
>
> 1. the number of observations is an exact multiple of the number of
> groups (which usually bites minutely)
>
> 2. there are no problems with ties (which often bites substantially).
>
> Your problem is evidently #2.
>
> You can only force equal-sized groups if you assign the same value to
> different groups in at least some cases. You can always force that by
> perturbing your data with random noise before passing them to -xtile-,
> but that's hardly a satisfactory approach.
>
> But the whole approach is pretty unsatisfactory anyway: this kind of
> subdivision throws away information which is not obviously
dispensable.
>
> I've not been following this thread carefully but my impression is
that
> you've had some excellent advice from Maarten Buis that you've chosen
to
> ignore. That's your prerogative, but you'll get diminishing returns
from
> asking small variants on the same question. A modern approach to this
> uses some kind of smoothing to try to get over the granularity in your
> data, which you can do in a controlled way.
>
> Gisella Young
>
> I am trying to divide my dataset into equally sized groups on the
basis

> of an income variable (eg 100 groups from lowest to highest income). I
> have tried several methods but the groups are not equally sized. For
> example,
>
> -xtile cat=income, n(100)-
>  (similarly with pctile)
> and
> -sumdist income, n(100) qgp(cat)-
>
> It produces the desired number of groups but they are not equally
sized.
> (Which I see by looking at the frequencies when I say -tab cat-
> thereafter). The differences are not small - some groups are many
times

> larger than others. This is not because of weighting as I have tried
> even without weights. It is also not related to the size of groups. I
> wonder whether it might be because of clustering of incomes around
> certain values (e.g. 10 000, 15 000) and all of those values being
> lumped into certain categories.
>
> Can anyone suggest a way to partition the sample into equally sized
> groups?
>
> This actually stems from an earlier thread (but no need to read that
for
> the above) about plotting a chart of income distribution with the
> occupational composition of each percentile. Austin's suggestion
(below)
> comes close to that. However, even with his code the groups are not
> equally sized, but they are sized the same as when I use the sumdist
or
> xtile commands mentioned above.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/