

Hello
I am trying to find a way to rank weighted data (since the egen function rank does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like g rank1=sum(weight). But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted
total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure. I include a sample dataset below:
expenditure weighting rank rank1 weighted_rank
10 341 1 341 341
12 1065 2.5 1406 ???
12 98 2.5 1504
15 254 4 1758
.......
thanks,
Cindy
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Cindy, What are the analytic units (people? regions?). What are the
"weights"? What is "expenditure"? How is it measured. What do you
mean that some regions are "less sampled" than others. It's not
clear, for example, if this is a sample, and if so, of what? So,
please describe the study design in detail. Last question: what is
the purpose of the ranking?
Steve
On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
> Hello
>
> I am trying to find a way to rank weighted data (since the egen
> function rank does not work with weights). A simple way would be
> order the data in terms of variable that I have interest in
> (monthly expenditure) and then create a new variable like g
> rank1=sum(weight). But, there is problem. Some of my observations
> are "tied" as they have the same level of expenditure. Using the
> simple method I mention means that some observations are ranked
> above others even though they have same level of expenditure. This
> is a problem as the weights are large so you find that 2
> observations are ranked with bug gap in between even though same
> level of expenditure. It is even bigger problem because the weights
> might be correlated with some other variables I am interested in
> (like region, since some regions are less sampled than other). I
> also try multiplying the expenditure ranking by the weight, but
> this gives wrong results (for example they do not add up to weighted
> total). Can anyone help? In other words, I would like for all
> observations with same expenditure to have same rank, which I
> assume would be some average of all the weighted observations
> having that same expenditure. I include a sample dataset below:
>
> expenditure weighting rank rank1 weighted_rank
> 10 341 1 341
> 341
> 12 1065 2.5 1406 ???
> 12 98 2.5 1504
> 15 254 4 1758
> .......
>
> thanks,
>
> Cindy
>
>
>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search> * http://www.stata.com/support/statalist/faq> * http://www.ats.ucla.edu/stat/stata/*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Thanks for your reply.
The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling.
Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like g rank=sum(weight). The problem comes because of ties. If i could expand my dataset using weights, then i could simply say egen rank =rank(expenditure) ; the problem is that dataset is too large for this.
thanks,
Cindy
 Original Message 
From: Steven Samuels < [hidden email]>
To: [hidden email]
Sent: Tuesday, 2 December, 2008 18:53:40
Subject: Re: st: ranking with weights
Cindy, What are the analytic units (people? regions?). What are the "weights"? What is "expenditure"? How is it measured. What do you mean that some regions are "less sampled" than others. It's not clear, for example, if this is a sample, and if so, of what? So, please describe the study design in detail. Last question: what is the purpose of the ranking?
Steve
On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
> Hello
>
> I am trying to find a way to rank weighted data (since the egen function rank does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like g rank1=sum(weight). But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted
> total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.. I include a sample dataset below:
>
> expenditure weighting rank rank1 weighted_rank
> 10 341 1 341 341
> 12 1065 2.5 1406 ???
> 12 98 2.5 1504
> 15 254 4 1758
> .......
>
> thanks,
>
> Cindy
>
>
>
>
>
>
> *
> * For searches and help try:
> * http://www..stata.com/help.cgi?search> * http://www.stata.com/support/statalist/faq> * http://www.ats.ucla.edu/stat/stata/*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


The following example code with a toy dataset may help:
. list expenditure frequency
++
 expend~e freque~y 

1.  1000 8000 
2.  1000 10000 
3.  2000 6000 
4.  2000 9000 
5.  3000 8000 

6.  3000 4000 
7.  4000 7000 
8.  4000 6000 
9.  5000 6000 
10.  6000 5000 

11.  7000 4000 
12.  8000 3000 
13.  9000 2000 
14.  10000 1000 
++
. bysort expend : gen totalfreq = sum(frequency)
. by expend : replace totalfreq = totalfreq[_N]
(4 real changes made)
. by expend : gen first = _n == 1
. gen rank = sum(totalfreq * first)
. replace rank = rank  0.5 * totalfreq
(14 real changes made)
. list
++
 expend~e freque~y totalf~q first rank 

1.  1000 8000 18000 1 9000 
2.  1000 10000 18000 0 9000 
3.  2000 6000 15000 1 25500 
4.  2000 9000 15000 0 25500 
5.  3000 8000 12000 1 39000 

6.  3000 4000 12000 0 39000 
7.  4000 7000 13000 1 51500 
8.  4000 6000 13000 0 51500 
9.  5000 6000 6000 1 61000 
10.  6000 5000 5000 1 66500 

11.  7000 4000 4000 1 71000 
12.  8000 3000 3000 1 74500 
13.  9000 2000 2000 1 77000 
14.  10000 1000 1000 1 78500 
++
There is a little inaccuracy there: the average of ranks 1...18000 is strictly 9000.5 not 9000, so you may want to make the appropriate corrections.
Nick
[hidden email]
Cindy Gao
The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling.
Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like g rank=sum(weight). The problem comes because of ties. If i could expand my dataset using weights, then i could simply say egen rank =rank(expenditure) ; the problem is that dataset is too large for this.
Steven Samuels
Cindy, What are the analytic units (people? regions?). What are the "weights"? What is "expenditure"? How is it measured. What do you mean that some regions are "less sampled" than others. It's not clear, for example, if this is a sample, and if so, of what? So, please describe the study design in detail. Last question: what is the purpose of the ranking?
On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
> I am trying to find a way to rank weighted data (since the egen function rank does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like g rank1=sum(weight). But, there is problem. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted
> total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.. I include a sample dataset below:
>
> expenditure weighting rank rank1 weighted_rank
> 10 341 1 341 341
> 12 1065 2.5 1406 ???
> 12 98 2.5 1504
> 15 254 4 1758
> .......
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/



Cindy The weights are not likely to be frequency weights (fweights)
they are probability weights (pweights), possibly poststratified.
If they are whole numbers than someone has rounded them. You till
haven't answered the question: why do you want to rank the
households? Quantities calculated in samples are estimates of
population quantities. What population quantities are you trying to
estimate with the ranks? If you are trying to estimate percentiles,
the pctile command will take pweights.
Steve
On Dec 2, 2008, at 2:16 PM, Cindy Gao wrote:
> Thanks for your reply.
>
> The observations (analytic units) are households. Expenditure is
> the monthly expenditure of household. This is household survey
> data. The weights are frequency weights, to weight the sample to
> the whole country. The weights are likely to vary across for
> example regions, to compensate for oversampling or undersampling.
>
>
> Basically I need to rank all households according to their
> expenditure, from lowest to highest. But, I must take account of
> the weightings. If for example there are 2 households with the same
> expenditure, they must be ranked the same and this rank must take
> account of weightings. If there were no ties (households with same
> expenditure), I could achieve mission by generating a variable
> "rank", like g rank=sum(weight). The problem comes because of
> ties. If i could expand my dataset using weights, then i could
> simply say egen rank =rank(expenditure) ; the problem is that
> dataset is too large for this.
>  Original Message 
> From: Steven Samuels < [hidden email]>
>
> Cindy, What are the analytic units (people? regions?). What are
> the "weights"? What is "expenditure"? How is it measured. What do
> you mean that some regions are "less sampled" than others. It's
> not clear, for example, if this is a sample, and if so, of what?
> So, please describe the study design in detail. Last question:
> what is the purpose of the ranking?
>
> On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
>
>> I am trying to find a way to rank weighted data (since the egen
>> function rank does not work with weights). A simple way would be
>> order the data in terms of variable that I have interest in
>> (monthly expenditure) and then create a new variable like g
>> rank1=sum(weight). But, there is problem. Some of my observations
>> are "tied" as they have the same level of expenditure. Using the
>> simple method I mention means that some observations are ranked
>> above others even though they have same level of expenditure. This
>> is a problem as the weights are large so you find that 2
>> observations are ranked with bug gap in between even though same
>> level of expenditure. It is even bigger problem because the
>> weights might be correlated with some other variables I am
>> interested in (like region, since some regions are less sampled
>> than other). I also try multiplying the expenditure ranking by the
>> weight, but this gives wrong results (for example they do not add
>> up to weighted
>> total). Can anyone help? In other words, I would like for all
>> observations with same expenditure to have same rank, which I
>> assume would be some average of all the weighted observations
>> having that same expenditure.. I include a sample dataset below:
>>
>> expenditure weighting rank rank1
>> weighted_rank
>> 10 341 1
>> 341 341
>> 12 1065 2.5 1406 ???
>> 12 98 2.5 1504
>> 15 254 4 1758
>> .
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank  0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank  0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion?
many thanks,
Cindy
 Original Message 
From: Nick Cox < [hidden email]>
To: [hidden email]
Sent: Tuesday, 2 December, 2008 19:45:41
Subject: RE: st: ranking with weights
The following example code with a toy dataset may help:
. list expenditure frequency
++
 expend~e freque~y 

1.  1000 8000 
2.  1000 10000 
3.  2000 6000 
4.  2000 9000 
5.  3000 8000 

6.  3000 4000 
7.  4000 7000 
8.  4000 6000 
9.  5000 6000 
10.  6000 5000 

11.  7000 4000 
12.  8000 3000 
13..  9000 2000 
14.  10000 1000 
++
. bysort expend : gen totalfreq = sum(frequency)
. by expend : replace totalfreq = totalfreq[_N]
(4 real changes made)
. by expend : gen first = _n == 1
. gen rank = sum(totalfreq * first)
. replace rank = rank  0.5 * totalfreq
(14 real changes made)
. list
++
 expend~e freque~y totalf~q first rank 

1.  1000 8000 18000 1 9000 
2.  1000 10000 18000 0 9000 
3.  2000 6000 15000 1 25500 
4.  2000 9000 15000 0 25500 
5.  3000 8000 12000 1 39000 

6.  3000 4000 12000 0 39000 
7.  4000 7000 13000 1 51500 
8.  4000 6000 13000 0 51500 
9.  5000 6000 6000 1 61000 
10.  6000 5000 5000 1 66500 

11.  7000 4000 4000 1 71000 
12.  8000 3000 3000 1 74500 
13.  9000 2000 2000 1 77000 
14.  10000 1000 1000 1 78500 
++
There is a little inaccuracy there: the average of ranks 1...18000 is strictly 9000.5 not 9000, so you may want to make the appropriate corrections.
Nick
[hidden email]
Cindy Gao
The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling.
Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like g rank=sum(weight). The problem comes because of ties. If i could expand my dataset using weights, then i could simply say egen rank =rank(expenditure) ; the problem is that dataset is too large for this.
Steven Samuels
Cindy, What are the analytic units (people? regions?). What are the "weights"? What is "expenditure"? How is it measured. What do you mean that some regions are "less sampled" than others. It's not clear, for example, if this is a sample, and if so, of what? So, please describe the study design in detail. Last question: what is the purpose of the ranking?
On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
> I am trying to find a way to rank weighted data (since the egen function rank does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like g rank1=sum(weight). But, there is problem.. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted
> total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.. I include a sample dataset below:
>
> expenditure weighting rank rank1 weighted_rank
> 10 341 1 341 341
> 12 1065 2.5 1406 ???
> 12 98 2.5 1504
> 15 254 4 1758
> .......
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


My bias is to believe my code to be good until you show that it isn't. You haven't done that so far as I can see.
The principle I am using is that used generally throughout statistics, that the rank applied to a bunch of tied values is (a) the same for all those tied values (b) the average of the ranks that would have been applied had those values all been distinct but otherwise still lower than all higher values and higher than all lower values. Thus, the average rank for the lowest 18000 "observations" is 9000 (or so; subject to the detail mentioned in my previous post). This is equivalent to what you would have got with your hypothetical route starting with expand.
The example dataset deliberately included cases in which particular expenditures occurred once and also cases in which other particular expenditures occurred more than once. As far as I can see, the "ranks" check out regardless.
Naturally, if you want another definition of ranks, you need different code. As Steve Samuels has I think implied from a different but not contradictory viewpoint, your use of ranks in this context is a bit iffy in the presence of (massively) tied data and you can't expect to keep all the properties of ranks that you might desire or expect.
See also the discussion of ranks in the manual entry for egen. Some years ago (~1999) when programming what became the track and field options of egen, rank() I came up with those names because I couldn't find any (statistical or other) literature discussion of alternative ranking conventions, although it was evident from sports that they exist. I still haven't seen any despite continually twitching antennae. Names apart, the manual entry does give details on various different reasonable interpretations of ranks.
Nick
[hidden email]
Cindy Gao
Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank  0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank  0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion?
Nick Cox
The following example code with a toy dataset may help:
. list expenditure frequency
++
 expend~e freque~y 

1.  1000 8000 
2.  1000 10000 
3.  2000 6000 
4.  2000 9000 
5.  3000 8000 

6.  3000 4000 
7.  4000 7000 
8.  4000 6000 
9.  5000 6000 
10.  6000 5000 

11.  7000 4000 
12.  8000 3000 
13..  9000 2000 
14.  10000 1000 
++
. bysort expend : gen totalfreq = sum(frequency)
. by expend : replace totalfreq = totalfreq[_N]
(4 real changes made)
. by expend : gen first = _n == 1
. gen rank = sum(totalfreq * first)
. replace rank = rank  0.5 * totalfreq
(14 real changes made)
. list
++
 expend~e freque~y totalf~q first rank 

1.  1000 8000 18000 1 9000 
2.  1000 10000 18000 0 9000 
3.  2000 6000 15000 1 25500 
4.  2000 9000 15000 0 25500 
5.  3000 8000 12000 1 39000 

6.  3000 4000 12000 0 39000 
7.  4000 7000 13000 1 51500 
8.  4000 6000 13000 0 51500 
9.  5000 6000 6000 1 61000 
10.  6000 5000 5000 1 66500 

11.  7000 4000 4000 1 71000 
12.  8000 3000 3000 1 74500 
13.  9000 2000 2000 1 77000 
14.  10000 1000 1000 1 78500 
++
There is a little inaccuracy there: the average of ranks 1...18000 is strictly 9000.5 not 9000, so you may want to make the appropriate corrections.
Nick
[hidden email]
Cindy Gao
The observations (analytic units) are households. Expenditure is the monthly expenditure of household. This is household survey data. The weights are frequency weights, to weight the sample to the whole country. The weights are likely to vary across for example regions, to compensate for oversampling or undersampling.
Basically I need to rank all households according to their expenditure, from lowest to highest. But, I must take account of the weightings. If for example there are 2 households with the same expenditure, they must be ranked the same and this rank must take account of weightings. If there were no ties (households with same expenditure), I could achieve mission by generating a variable "rank", like g rank=sum(weight). The problem comes because of ties. If i could expand my dataset using weights, then i could simply say egen rank =rank(expenditure) ; the problem is that dataset is too large for this.
Steven Samuels
Cindy, What are the analytic units (people? regions?). What are the "weights"? What is "expenditure"? How is it measured. What do you mean that some regions are "less sampled" than others. It's not clear, for example, if this is a sample, and if so, of what? So, please describe the study design in detail. Last question: what is the purpose of the ranking?
On Dec 2, 2008, at 12:54 PM, Cindy Gao wrote:
> I am trying to find a way to rank weighted data (since the egen function rank does not work with weights). A simple way would be order the data in terms of variable that I have interest in (monthly expenditure) and then create a new variable like g rank1=sum(weight). But, there is problem.. Some of my observations are "tied" as they have the same level of expenditure. Using the simple method I mention means that some observations are ranked above others even though they have same level of expenditure. This is a problem as the weights are large so you find that 2 observations are ranked with bug gap in between even though same level of expenditure. It is even bigger problem because the weights might be correlated with some other variables I am interested in (like region, since some regions are less sampled than other). I also try multiplying the expenditure ranking by the weight, but this gives wrong results (for example they do not add up to weighted
> total). Can anyone help? In other words, I would like for all observations with same expenditure to have same rank, which I assume would be some average of all the weighted observations having that same expenditure.. I include a sample dataset below:
>
> expenditure weighting rank rank1 weighted_rank
> 10 341 1 341 341
> 12 1065 2.5 1406 ???
> 12 98 2.5 1504
> 15 254 4 1758
> .......
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Cindy Gao
I think the only error in Nick's code is one he already flagged
himself, i.e. the average of 1 and 2 is not 1 but 1.5. So the rank
need not start at 1 and need not end at the sum of weights (except in
the special case where the first/last obs has weight one and has no
ties). Perhaps the point is more clear in this example:
clear
input exp freq
1000 1
1000 1
2000 5999
2000 9000
3000 8000
3000 4000
3000 4000
10000 1000
end
bysort exp: gen tf = sum(freq)
by exp: replace tf = tf[_N]
by exp: gen first = _n == 1
gen rank = sum(tf * first)
replace rank = rank(tf1)/2
g p=sum(freq)
replace p=rank/p[_N]
loc adj=p[1]/2
replace p=p`adj'
li, noo clean
The above assumes no missing values or zero weights; real data may
have missing or zero freq or missing exp requiring modification
depending on what you hope to achieve. (E.g. should a person with zero
weight get the rank of tied cases or a missing rank? What if there are
no tied cases?)
The variable p measures the rank between 0 and 1 and the oddness of
`adj' pertains to whether you want p to range from w[1]>0 to 1 or to
range from w[1]/2 to 1w[1]/2 which some find more intuitively
appealing (also handy if you want to apply various transformations to
p that require it to be strictly between 0 and 1).
On Tue, Dec 2, 2008 at 4:59 PM, Nick Cox < [hidden email]> wrote:
> My bias is to believe my code to be good until you show that it isn't. You haven't done that so far as I can see.
>
> The principle I am using is that used generally throughout statistics, that the rank applied to a bunch of tied values is (a) the same for all those tied values (b) the average of the ranks that would have been applied had those values all been distinct but otherwise still lower than all higher values and higher than all lower values. Thus, the average rank for the lowest 18000 "observations" is 9000 (or so; subject to the detail mentioned in my previous post). This is equivalent to what you would have got with your hypothetical route starting with expand.
>
> The example dataset deliberately included cases in which particular expenditures occurred once and also cases in which other particular expenditures occurred more than once. As far as I can see, the "ranks" check out regardless.
>
> Naturally, if you want another definition of ranks, you need different code. As Steve Samuels has I think implied from a different but not contradictory viewpoint, your use of ranks in this context is a bit iffy in the presence of (massively) tied data and you can't expect to keep all the properties of ranks that you might desire or expect.
>
> See also the discussion of ranks in the manual entry for egen. Some years ago (~1999) when programming what became the track and field options of egen, rank() I came up with those names because I couldn't find any (statistical or other) literature discussion of alternative ranking conventions, although it was evident from sports that they exist. I still haven't seen any despite continually twitching antennae. Names apart, the manual entry does give details on various different reasonable interpretations of ranks.
>
> Nick
> [hidden email]
>
> Cindy Gao
>
> Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank  0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank  0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Thank you all  Steven Nick and Austin for your generous help. Finally, I get it.
best,
Cindy
 Original Message 
From: Austin Nichols < [hidden email]>
To: [hidden email]
Sent: Tuesday, 2 December, 2008 22:58:06
Subject: Re: st: ranking with weights
Cindy Gao
I think the only error in Nick's code is one he already flagged
himself, i.e. the average of 1 and 2 is not 1 but 1.5. So the rank
need not start at 1 and need not end at the sum of weights (except in
the special case where the first/last obs has weight one and has no
ties). Perhaps the point is more clear in this example:
clear
input exp freq
1000 1
1000 1
2000 5999
2000 9000
3000 8000
3000 4000
3000 4000
10000 1000
end
bysort exp: gen tf = sum(freq)
by exp: replace tf = tf[_N]
by exp: gen first = _n == 1
gen rank = sum(tf * first)
replace rank = rank(tf1)/2
g p=sum(freq)
replace p=rank/p[_N]
loc adj=p[1]/2
replace p=p`adj'
li, noo clean
The above assumes no missing values or zero weights; real data may
have missing or zero freq or missing exp requiring modification
depending on what you hope to achieve. (E.g. should a person with zero
weight get the rank of tied cases or a missing rank? What if there are
no tied cases?)
The variable p measures the rank between 0 and 1 and the oddness of
`adj' pertains to whether you want p to range from w[1]>0 to 1 or to
range from w[1]/2 to 1w[1]/2 which some find more intuitively
appealing (also handy if you want to apply various transformations to
p that require it to be strictly between 0 and 1).
On Tue, Dec 2, 2008 at 4:59 PM, Nick Cox < [hidden email]> wrote:
> My bias is to believe my code to be good until you show that it isn't. You haven't done that so far as I can see.
>
> The principle I am using is that used generally throughout statistics, that the rank applied to a bunch of tied values is (a) the same for all those tied values (b) the average of the ranks that would have been applied had those values all been distinct but otherwise still lower than all higher values and higher than all lower values. Thus, the average rank for the lowest 18000 "observations" is 9000 (or so; subject to the detail mentioned in my previous post). This is equivalent to what you would have got with your hypothetical route starting with expand.
>
> The example dataset deliberately included cases in which particular expenditures occurred once and also cases in which other particular expenditures occurred more than once. As far as I can see, the "ranks" check out regardless.
>
> Naturally, if you want another definition of ranks, you need different code. As Steve Samuels has I think implied from a different but not contradictory viewpoint, your use of ranks in this context is a bit iffy in the presence of (massively) tied data and you can't expect to keep all the properties of ranks that you might desire or expect.
>
> See also the discussion of ranks in the manual entry for egen. Some years ago (~1999) when programming what became the track and field options of egen, rank() I came up with those names because I couldn't find any (statistical or other) literature discussion of alternative ranking conventions, although it was evident from sports that they exist. I still haven't seen any despite continually twitching antennae. Names apart, the manual entry does give details on various different reasonable interpretations of ranks.
>
> Nick
> [hidden email]
>
> Cindy Gao
>
> Thank you very much this helps a lot. However I wonder if there is a small "error" or if I am just misunderstanding. Should your last line of code ( replace rank = rank  0.5 * totalfreq) not maybe only apply to observations that are tied (ie same expenditure as other observations)? Otherwise for example the first observation in your example, which is not tied, is ranked as 9000 instead of its weight of 18000. I therefore try a small modification to your code (by expenditure: replace rank = rank  0.5 * totalfreq if _N != 1). when I do like this then the rank of the last observation (which is not tied) equals the sum of all the weights, whereas with your original the rank of the last observation is less than the sum of all the weights (less by half the weighting of the last observation). Now, I am not confident whether to use my modification or maybe I am just confused and I should stick with Nick's original suggestion?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/

