Dear Statalister,
I have a dataset with several variables, among which a discrete variable X that looks as follows. ------------------- X obs1 60 obs2 60 obs3 60 obs4 70 obs5 71 obs6 71 obs7 71 obs8 71 obs9 71 obs10 71 -------------------- My final purpose is to treat adjacent observations for which the variable X does not change by more than 10% as the same observation. In other words, I would like to collapse the dataset by X, but whenever the distance between two or more adjacent observations in X is less than 10%, I would like to collapse by a median of x. Before collapsing I tried to generate a median of X whenever the difference within X is less than 10%, and then collapse by X, but I am not succeding. Is this the right approach? Is there a way of collapsing specifying my requirement? Thank you in advance, Laura Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
At 06:36 PM 12/4/2008, Laura Grigolon wrote:
>Dear Statalister, > >I have a dataset with several variables, among which a discrete >variable X that looks as follows. > >------------------- > X >obs1 60 >obs2 60 >obs3 60 >obs4 70 >obs5 71 >obs6 71 >obs7 71 >obs8 71 >obs9 71 >obs10 71 >-------------------- > >My final purpose is to treat adjacent observations for which the >variable X does not change by more than 10% as the same observation. >In other words, I would like to collapse the dataset by X, but >whenever the distance between two or more adjacent observations in X >is less than 10%, I would like to collapse by a median of x. Before >collapsing I tried to generate a median of X whenever the >difference within X is less than 10%, and then collapse by X, but I >am not succeding. Is this the right approach? Is there a way of >collapsing specifying my requirement? > >Thank you in advance, >Laura I don't have a solution, but I'll alert you to some potential problems that I can see. There may be some ambiguity in how your problem is defined. Suppose you have this sequence of values: 60, 65, 70 65 is within 10% of 60; 70 is within 10% of 65; but 70 is not within 10% of 60. So does this define a cluster of "close" values? Does the 70 get put together with 60 by virtue of being linked through a 65? If so, then the clusters of close values would be, in part, determined by the order of the data. Is that what you have in mind? Another example: 901, 1000 -- no, 1000 is not within 10% of 901. 1000, 901 -- yes, 901 is within 10% of 1000. Or generally, if a is within 10% of b, it is not always the case that b is within 10% of a. Again, the order matters. So you need to ask, do you want the order to matter, and do you want to allow "linking" as in the 60,65,70 example? I believe that you do, since you mentioned "adjacent". (And maybe you want to have sorted the values first -- or maybe not, in which case there may be some existing natural order.) If so, then you can do something like this (untested): gen byte w10pct = abs(X/X[_n-1] -1) < .1 & _n >1 gen int cluster_id = sum(w10pct ==0) This way, cluster_id takes a new value every time a value of X occurs that is >= 10% different from the predecessor. You can then take a mean or median or whatever you want -- by cluster_id -- using egen. If, on the other hand, you don't want the order of data to matter, then you need to find some other way to group the X values into clusters. (Maybe sort, and them apply the algorithm described above.) HTH --David P.S., there is an interesting phenomenon here, with the order and linking effects, particularly if the X are sorted. You seem to want to seek a middle value of a cluster of values. And hopefully, the values will be within 10% of that middle value. On the other hand, the detection of the cluster is based on its leading value (lowest, if data are sorted). Another possibility is that you would want to avoid linking. In that case, the clusters should be determined whenever a value differs from its predecessor by more than 10%. But then you would test how close subsequent values are to that leading value. It's getting complicated. Good luck. --David * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
In reply to this post by Grigolon, Laura
Hello David,
first of all thank you for your reply. A small comment on your answer: “Do you want the order to matter, and do you want to allow "linking" as in the 60,65,70 example?” 1 -- Yes, the order matters, as you guessed. 2 -- At first I did not think about the issue of the “linking” effect. I used the commands you suggested (which work really finee) and I tested how far the middle value is from the leading value of the cluster (the lowest value). In my specific case, the linking effect is ok because the values of the cluster normally remain within 10% with respect to the middle value. Laura Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ |
Free forum by Nabble | Edit this page |