# st: ranking with weights

9 messages
Open this post in threaded view
|

## st: ranking with weights

 Hello I am trying to find a way to rank weighted data (since the egen function -rank- does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like -g rank1=sum(weight)-. But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted  total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.  I include a sample dataset below: expenditure       weighting        rank       rank1      weighted_rank 10                          341            1           341          341 12                          1065          2.5        1406         ??? 12                          98             2.5        1504 15                          254            4          1758 ....... thanks, Cindy       * *   For searches and help try: *   http://www.stata.com/help.cgi?search*   http://www.stata.com/support/statalist/faq*   http://www.ats.ucla.edu/stat/stata/
Open this post in threaded view
|

## Re: st: ranking with weights

 Cindy, What are the analytic units (people? regions?).  What are the   "weights"? What is "expenditure"?  How is it measured.  What do you   mean that some regions are "less sampled" than others.  It's not   clear, for example, if this is a sample, and if so, of what? So,   please describe the  study design in detail.  Last question: what is   the purpose of the ranking? -Steve On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote: > Hello > > I am trying to find a way to rank weighted data (since the egen   > function -rank- does not work with weights). A simple way would be   > order the data in terms of variable that I have interest in   > (monthly expenditure) and then create a new variable like -g   > rank1=sum(weight)-. But, there is problem. Some of my observations   > are "tied" as they have the same level of expenditure. Using the   > simple method I mention means that some observations are ranked   > above others even though they have same level of expenditure. This   > is a problem as the weights are large so you find that 2   > observations are ranked with bug gap in between even though same   > level of expenditure. It is even bigger problem because the weights   > might be correlated with some other variables I am interested in   > (like region, since some regions are less sampled than other). I   > also try multiplying the expenditure ranking by the weight, but   > this gives wrong results (for example they do not add up to weighted >  total). Can anyone help? In other words, I would like for all   > observations with same expenditure to have same rank, which I   > assume would be some average of all the weighted observations   > having that same expenditure.  I include a sample dataset below: > > expenditure       weighting        rank       rank1      weighted_rank > 10                          341            1           341           > 341 > 12                          1065          2.5        1406         ??? > 12                          98             2.5        1504 > 15                          254            4          1758 > ....... > > thanks, > > Cindy > > > > > > > * > *   For searches and help try: > *   http://www.stata.com/help.cgi?search> *   http://www.stata.com/support/statalist/faq> *   http://www.ats.ucla.edu/stat/stata/* *   For searches and help try: *   http://www.stata.com/help.cgi?search*   http://www.stata.com/support/statalist/faq*   http://www.ats.ucla.edu/stat/stata/
Open this post in threaded view
|

## Re: st: ranking with weights

 Thanks for your reply. The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling. Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like  -g rank=sum(weight)-. The problem comes because of ties. If i could -expand- my dataset using weights, then i could simply say -egen rank =rank(expenditure)- ; the problem is that dataset is too large for this. thanks, Cindy ----- Original Message ---- From: Steven Samuels <[hidden email]> To: [hidden email] Sent: Tuesday, 2 December, 2008 18:53:40 Subject: Re: st: ranking with weights Cindy, What are the analytic units (people? regions?).  What are the "weights"? What is "expenditure"?  How is it measured.  What do you mean that some regions are "less sampled" than others.  It's not clear, for example, if this is a sample, and if so, of what? So, please describe the  study design in detail.  Last question: what is the purpose of the ranking? -Steve On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote: > Hello > > I am trying to find a way to rank weighted data (since the egen function -rank- does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like -g rank1=sum(weight)-. But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted >  total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure..  I include a sample dataset below: > > expenditure       weighting        rank       rank1      weighted_rank > 10                          341            1           341          341 > 12                          1065          2.5        1406         ??? > 12                          98             2.5        1504 > 15                          254            4          1758 > ....... > > thanks, > > Cindy > > > > > > > * > *   For searches and help try: > *  http://www..stata.com/help.cgi?search> *  http://www.stata.com/support/statalist/faq> *  http://www.ats.ucla.edu/stat/stata/* *   For searches and help try: *  http://www.stata.com/help.cgi?search*  http://www.stata.com/support/statalist/faq*  http://www.ats.ucla.edu/stat/stata/      * *   For searches and help try: *   http://www.stata.com/help.cgi?search*   http://www.stata.com/support/statalist/faq*   http://www.ats.ucla.edu/stat/stata/
Open this post in threaded view
|

## RE: st: ranking with weights

 The following example code with a toy dataset may help: . list expenditure frequency      +---------------------+      | expend~e   freque~y |      |---------------------|   1. |     1000       8000 |   2. |     1000      10000 |   3. |     2000       6000 |   4. |     2000       9000 |   5. |     3000       8000 |      |---------------------|   6. |     3000       4000 |   7. |     4000       7000 |   8. |     4000       6000 |   9. |     5000       6000 |  10. |     6000       5000 |      |---------------------|  11. |     7000       4000 |  12. |     8000       3000 |  13. |     9000       2000 |  14. |    10000       1000 |      +---------------------+ . bysort expend : gen totalfreq = sum(frequency) . by expend : replace totalfreq = totalfreq[_N] (4 real changes made) . by expend : gen first = _n == 1 . gen rank = sum(totalfreq * first) . replace rank = rank - 0.5 * totalfreq (14 real changes made) . list      +------------------------------------------------+      | expend~e   freque~y   totalf~q   first    rank |      |------------------------------------------------|   1. |     1000       8000      18000       1    9000 |   2. |     1000      10000      18000       0    9000 |   3. |     2000       6000      15000       1   25500 |   4. |     2000       9000      15000       0   25500 |   5. |     3000       8000      12000       1   39000 |      |------------------------------------------------|   6. |     3000       4000      12000       0   39000 |   7. |     4000       7000      13000       1   51500 |   8. |     4000       6000      13000       0   51500 |   9. |     5000       6000       6000       1   61000 |  10. |     6000       5000       5000       1   66500 |      |------------------------------------------------|  11. |     7000       4000       4000       1   71000 |  12. |     8000       3000       3000       1   74500 |  13. |     9000       2000       2000       1   77000 |  14. |    10000       1000       1000       1   78500 |      +------------------------------------------------+ There is a little inaccuracy there: the average of ranks 1...18000 is strictly 9000.5 not 9000, so you may want to make the appropriate corrections. Nick [hidden email] Cindy Gao The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling. Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like  -g rank=sum(weight)-. The problem comes because of ties. If i could -expand- my dataset using weights, then i could simply say -egen rank =rank(expenditure)- ; the problem is that dataset is too large for this. Steven Samuels Cindy, What are the analytic units (people? regions?).  What are the "weights"? What is "expenditure"?  How is it measured.  What do you mean that some regions are "less sampled" than others.  It's not clear, for example, if this is a sample, and if so, of what? So, please describe the  study design in detail.  Last question: what is the purpose of the ranking? On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote: > I am trying to find a way to rank weighted data (since the egen function -rank- does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like -g rank1=sum(weight)-. But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted >  total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure..  I include a sample dataset below: > > expenditure       weighting        rank       rank1      weighted_rank > 10                          341            1           341          341 > 12                          1065          2.5        1406         ??? > 12                          98             2.5        1504 > 15                          254            4          1758 > ....... * *   For searches and help try: *   http://www.stata.com/help.cgi?search*   http://www.stata.com/support/statalist/faq*   http://www.ats.ucla.edu/stat/stata/
Open this post in threaded view
|

## Re: st: ranking with weights

 In reply to this post by Cindy Gao -- Cindy-- The weights are not likely to be frequency weights (fweights)   --they are probability weights (pweights), possibly post-stratified.   If they are whole numbers than someone has rounded them.  You till   haven't answered the question: why do you want to rank the   households?   Quantities calculated in samples are estimates of   population quantities.  What population quantities are you trying to   estimate with the ranks?  If you are trying to estimate percentiles,   the -pctile- command will take pweights. -Steve On Dec 2, 2008, at 2:16 PM, Cindy Gao wrote: > Thanks for your reply. > > The observations (analytic units) are households. Expenditure is   > the monthly expenditure of household. This is household survey   > data. The weights are frequency weights, to weight the sample to   > the whole country. The weights are likely to vary across for   > example regions, to compensate for oversampling or undersampling. > > > Basically I need to rank all households according to their   > expenditure, from lowest to highest. But, I must take account of   > the weightings. If for example there are 2 households with the same   > expenditure, they must be ranked the same and this rank must take   > account of weightings. If there were no ties (households with same   > expenditure), I could achieve mission by generating a variable   > "rank", like  -g rank=sum(weight)-. The problem comes because of   > ties. If i could -expand- my dataset using weights, then i could   > simply say -egen rank =rank(expenditure)- ; the problem is that   > dataset is too large for this. > ---- Original Message ---- > From: Steven Samuels <[hidden email]> > > Cindy, What are the analytic units (people? regions?).  What are   > the "weights"? What is "expenditure"?  How is it measured.  What do   > you mean that some regions are "less sampled" than others.  It's   > not clear, for example, if this is a sample, and if so, of what?   > So, please describe the  study design in detail.  Last question:   > what is the purpose of the ranking? > > -On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote: > >> I am trying to find a way to rank weighted data (since the egen   >> function -rank- does not work with weights). A simple way would be   >> order the data in terms of variable that I have interest in   >> (monthly expenditure) and then create a new variable like -g   >> rank1=sum(weight)-. But, there is problem. Some of my observations   >> are "tied" as they have the same level of expenditure. Using the   >> simple method I mention means that some observations are ranked   >> above others even though they have same level of expenditure. This   >> is a problem as the weights are large so you find that 2   >> observations are ranked with bug gap in between even though same   >> level of expenditure. It is even bigger problem because the   >> weights might be correlated with some other variables I am   >> interested in (like region, since some regions are less sampled   >> than other). I also try multiplying the expenditure ranking by the   >> weight, but this gives wrong results (for example they do not add   >> up to weighted >>  total). Can anyone help? In other words, I would like for all   >> observations with same expenditure to have same rank, which I   >> assume would be some average of all the weighted observations   >> having that same expenditure..  I include a sample dataset below: >> >> expenditure       weighting        rank       rank1       >> weighted_rank >> 10                          341            1             >> 341          341 >> 12                          1065          2.5        1406         ??? >> 12                          98             2.5        1504 >> 15                          254            4          1758 >> . * *   For searches and help try: *   http://www.stata.com/help.cgi?search*   http://www.stata.com/support/statalist/faq*   http://www.ats.ucla.edu/stat/stata/
Open this post in threaded view
|

## Re: st: ranking with weights

Open this post in threaded view
|