

Dear Roy,
My question was a general advice and also the approriate command. From what I understand from your message, you wrote a command called "dismatch" that I have to add to my STATA that do exactly what I need? It can be applied without any transformation to my case?
After adding the plugin, I need to append the table with the waterbodies to the farms and simply write "distmatch, id(id) near(1) long(long) lat(lat), km"
To adapt to my case, I need to write:
farm_id (waterbod_id) near (1) farm_X (waterb_X) farm_Y (waterbod_Y), km
Is that correct?
Thanks a lot!
Laura
________________________________________
From: Laura Platchkov
Sent: 10 September 2009 20:27
To: [hidden email]
Subject: exact command for distance ?
Dear Statalist users,
I was just wondering if someone would have an advice as to how to write a small program on STATA to compute distances between 2 datasets.
I have 2 datasets (.dta files). The first, called farms.dta, contains 800 observations that correspond each of them to a farm,with, among others, the 3 variables: farm_ID, farm_X (longitude) and farm_Y (latitude). The second called waterbodies.dta contains information about the locations of the centroids of 135 waterbodies, more precisely 3 variables: waterbody_ID, wat_X, wat_Y.
I want to calculate the distance of each of the farm to the nearest waterbody, in kilometers.
Now, I need to write a small do file explaining to STATA to calculate the distance using the following Great Circle Formula:
3963*acos(sin(y/57.2958)*sin(y2/57.2958)+cos(y/57.2958)*cos(y2/57.2958)*cos((x2/57.2958)(x/57.2958))), where x and y are the coordinates of the farms, and x2 and y2 are the coordinates of the bodies of water....
...for each observation to each of the waterbodies, but I only need the smallest distance.
I thought a loop would be the best, but I have small doubts about how to write the command on stata. In particular, I don't exactly know how to tell STATA to use 2 different datasets at the keep in 1same time and how to tell STATA to give me only the smallest distance. I want STATA to simply add the results as an additional variable in the first dataset (or create a 3d dataset with the farm_ID, farm_X, farm_Y and nearest distance if its easier).
My idea is to do a write in the dofile, something like this:
using farms
forvalues j =1/800
cross using waterbod
3963*acos(sin(y/57.2958)*sin(y2/57.2958)+cos(y/57.2958)*cos(y2/57.2958)*cos((x2/57.2958)(x/57.2958))), name near_dist
sort near_dist
keep in 1
label variable dist "nearest waterbody"
list farm_ID farm_X farm_Y near_dist
save ?????
I guess there are some mistakes... Does anybody has perhaps a suggestion as how to improve the command?
Thanks a lot!
Laura
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Laura
You don't actually need to download anything to solve this kind of
problem, or much harder similar problems, as illustrated by e.g.
http://www.stata.com/statalist/archive/200907/msg00261.htmlhttp://www.stata.com/statalist/archive/200701/msg00098.htmland similar posts.
I particularly doubt the final claim in the help file for distmatch
in the paragraph "Distance matching is computationally intensive.
Observations of 3,000 may take several minutes to complete. Other
methods typically take days if not weeks and requires extensive
userinvolvement."
But are you using the centroids of the bodies of water? That seems
like a strange measure of the distance to water!
You could try:
use farms, clear
local nf=_N
g double mindist=.
merge using waterbodies
local R=6367.44
qui forv i=1/`nf' {
local x1=farm_Y[`i']
local y1=farm_X[`i']
local x2 wat_Y
local y2 wat_X
g double L=(`y2'`y1')*_pi/180
replace L=(`y2'`y1'360)*_pi/180 if L<. & L>_pi
replace L=(`y2'`y1'+360)*_pi/180 if L<_pi
local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
su d, meanonly
replace mindist=r(min) in `i'
drop L d
}
drop _m waterbody_ID wat_X wat_Y
la var mindist "Distance to center of nearest body of water"
or adapt as appropriate...
On Fri, Sep 11, 2009 at 3:23 AM, Laura Platchkov < [hidden email]> wrote:
> Dear Roy,
>
> My question was a general advice and also the approriate command. From what I understand from your message, you wrote a command called "dismatch" that I have to add to my STATA that do exactly what I need? It can be applied without any transformation to my case?
>
> After adding the plugin, I need to append the table with the waterbodies to the farms and simply write "distmatch, id(id) near(1) long(long) lat(lat), km"
>
> To adapt to my case, I need to write:
>
> farm_id (waterbod_id) near (1) farm_X (waterb_X) farm_Y (waterbod_Y), km
>
> Is that correct?
>
> Thanks a lot!
>
> Laura
> ________________________________________
> From: Laura Platchkov
> Sent: 10 September 2009 20:27
> To: [hidden email]
> Subject: exact command for distance ?
>
> Dear Statalist users,
>
> I was just wondering if someone would have an advice as to how to write a small program on STATA to compute distances between 2 datasets.
>
> I have 2 datasets (.dta files). The first, called farms.dta, contains 800 observations that correspond each of them to a farm,with, among others, the 3 variables: farm_ID, farm_X (longitude) and farm_Y (latitude). The second called waterbodies.dta contains information about the locations of the centroids of 135 waterbodies, more precisely 3 variables: waterbody_ID, wat_X, wat_Y.
>
> I want to calculate the distance of each of the farm to the nearest waterbody, in kilometers.
>
> Now, I need to write a small do file explaining to STATA to calculate the distance using the following Great Circle Formula:
>
> 3963*acos(sin(y/57.2958)*sin(y2/57.2958)+cos(y/57.2958)*cos(y2/57.2958)*cos((x2/57.2958)(x/57.2958))), where x and y are the coordinates of the farms, and x2 and y2 are the coordinates of the bodies of water....
>
> ...for each observation to each of the waterbodies, but I only need the smallest distance.
>
> I thought a loop would be the best, but I have small doubts about how to write the command on stata. In particular, I don't exactly know how to tell STATA to use 2 different datasets at the keep in 1same time and how to tell STATA to give me only the smallest distance. I want STATA to simply add the results as an additional variable in the first dataset (or create a 3d dataset with the farm_ID, farm_X, farm_Y and nearest distance if its easier).
>
> My idea is to do a write in the dofile, something like this:
>
> using farms
> forvalues j =1/800
> cross using waterbod
> 3963*acos(sin(y/57.2958)*sin(y2/57.2958)+cos(y/57.2958)*cos(y2/57.2958)*cos((x2/57.2958)(x/57.2958))), name near_dist
> sort near_dist
> keep in 1
> label variable dist "nearest waterbody"
> list farm_ID farm_X farm_Y near_dist
> save ?????
>
> I guess there are some mistakes... Does anybody has perhaps a suggestion as how to improve the command?
>
> Thanks a lot!
> Laura
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Laura
Just in case you thought I was being flip, let me assure youI mean
itthe distance to the nearest centroid (of bodies of water) is not a
variable you can use for anything useful. However, if you have a file
of polygons for bodies of water rather than a file of centroids, the
method I outlined still works, finding then the distance to the
nearest vertex, which is a good approximation of the distance to
nearest water, and gets better as your polygon file gets more
detailed.
On Fri, Sep 11, 2009 at 9:28 AM, Austin Nichols < [hidden email]> wrote:
>
> But are you using the centroids of the bodies of water? That seems
> like a strange measure of the distance to water!
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


I have no problem downloading and using programs written by other people. I
have not used one of Austin's programs, but it is nice to know that it's there
if I need it. Could I write a program on regression discontinuity from scatch?
Sure. Give me two hours. But why would I? I am very grateful to those who have
shared their program with me, and I hope they find my programs useful too.
When a topic repeatedly come up on this list, it indicates an unaddressed
problem. Distance matcing is a topic of growing importance that is appearing
on this list with increasing frequency, despite the earlier assurance that
this is not a problem in search of a solution.
The solution implemented in distmatch is simple yet has never been implemented
before. It is a nonintensive solution to an intensive problem. The number of
matches that must be considered is N x N (it's actually N choose 2 repleated N
or N1 times, depending on how you count). This is a large number.
The problem with proposing a simple solution is that some people like to entertain
themselves thinking they could have done it on their own.
But it has not been done before, and the problem keeps coming back to this list,
as I said before.
How difficult is distance matching? My first stab was remarkably similar to another
program called nearest from ssc, which according to Nick Cox was not meant to
earn a good grade in any computer science course. I am guessing this is the most
obvious solution because this is also the one that Austin was suggesting.
My first program literally took several months to run with observations of about
30,000. I tried paralleling the codes (multiple computers), merge, gridsearching,
etc, before settling on the current form, which is at least 100 times faster than the
first one.
This rewriting of the program occurred over the course of two years. If
someone can do this in one sitting, go ahead. Good for them.
But anyone thinking that a casual user can be shown how to do this over the
Statalist is wasting everyone's time, which was clearly the case.
The current nonStata solution, widely used by economists, is to use ArcGIS or
ArcMap. They cost about $2000$6000. They usually take about several days if
not weeks of userwork. If you are using confidential data center (they usually
charge by the hour), that's another $2000 in expenses. Good luck using the latest
versions of these programs because they are even more difficult to use. Be grateful
if you never had to use one of these.
Roy
> Laura
> You don't actually need to download anything to solve this kind of
> problem, or much harder similar problems, as illustrated by e.g.
> http://www.stata.com/statalist/archive/200907/msg00261.html> http://www.stata.com/statalist/archive/200701/msg00098.html> and similar posts.
>
> I particularly doubt the final claim in the help file for distmatch
> in the paragraph "Distance matching is computationally intensive.
> Observations of 3,000 may take several minutes to complete. Other
> methods typically take days if not weeks and requires extensive
> userinvolvement."
>
> use farms, clear
> local nf=_N
> g double mindist=.
> merge using waterbodies
> local R=6367.44
> qui forv i=1/`nf' {
> local x1=farm_Y[`i']
> local y1=farm_X[`i']
> local x2 wat_Y
> local y2 wat_X
> g double L=(`y2'`y1')*_pi/180
> replace L=(`y2'`y1'360)*_pi/180 if L_pi
> replace L=(`y2'`y1'+360)*_pi/180 if L<_pi
> local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
> g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
> su d, meanonly
> replace mindist=r(min) in `i'
> drop L d
> }
> drop _m waterbody_ID wat_X wat_Y
> la var mindist "Distance to center of nearest body of water"
> or adapt as appropriate...
_________________________________________________________________
Hotmail® is up to 70% faster. Now good news travels really fast.
http://windowslive.com/online/hotmail?ocid=PID23391::T:WLMTAGL:ON:WL:enUS:WM_HYGN_faster:082009*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Roy
I also have no problem downloading others' work, and my hard drive is
cluttered with the output of Jann, Baum, Schaffer, Jenkins, Cox, and
many others. I seem to use one of Ben Jann's programs every day. And
one of my posted solutions on this topic requires downloading
vincenty (from SSC), which gives much better distance estimates,
though at a substantial time cost.
I am not even claiming that distmatch has no utilityno doubt many
will find it useful. But I'm afraid I don't see your point in this
post at allyou claim in the help file that distmatch "take several
minutes to complete" for 3000 obs, and other methods take "days if not
weeks" yet the method that I have outlined in several posts is
entirely general (i.e. it can be customized to produce any range of
statistics for any range of neighbors, which no program can claim to
do) and runs faster than distmatch in many cases, e.g. by a factor
of four or five here:
clear
tempfile p h
range pid 1 3000 3000
set seed 123
g p_lat=32+uniform()*10
g p_lon=120+uniform()*40
compress
save `p'
clear
input h_lat h_lon
42.103 80.065
42.103 80.065001
39.739 75.605
37.499 77.470
39.464 77.962
27.844 82.796
39.138 84.509
38.271 85.694
36.143 86.805
35.832 86.073
36.313 87.676
33.505 86.801
32.288 90.258
41.628 93.659
44.412 103.469
40.807 96.625
35.608 91.265
31.292 92.461
31.080 97.348
31.783 106.474
46.612 112.048
43.618 116.194
41.242 110.991
34.568 112.456
40.698 111.990
40.044 111.716
46.053 118.356
33.860 118.149
34.093 118.344
41.628 .
end
g id=_n
expand 100
bys id: replace id=id*100+_n
isid id
replace h_lon=h_lon+uniform()
replace h_lat=h_lat+uniform()
compress
save `h'
su
timer on 1
use `p'
local np=_N
g double dnear=.
g long idnear=.
merge using `h'
local R=6367.44
forv i=1/`np' {
local x1=p_lat[`i']
local y1=p_lon[`i']
local x2 h_lat
local y2 h_lon
qui {
g double L=(`y2'`y1')*_pi/180
replace L=(`y2'`y1'360)*_pi/180 if L<. & L>_pi
replace L=(`y2'`y1'+360)*_pi/180 if L<_pi
local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
su d, meanonly
replace dnear=r(min) in `i'
su id if d==r(min), meanonly
replace idnear=r(min) in `i'
}
drop L d
}
drop if _m==2
drop id h_lat h_lon _m
la var dnear "Distance to nearest hospital"
la var idnear "ID of nearest hospital"
timer off 1
l in 1/20, noo clean
tempfile m
save `m'
timer on 2
use `p'
ren p_lat h_lat
ren p_lon h_lon
append using `h'
distmatch, id(id) lat(h_lat) lon(h_lon) near(1) km
keep if pid<.
drop id h_lat h_lon
timer off 2
joinby pid using `m', unm(both)
ta _m
drop _m
order pid p_lat p_lon
l in 1/20, noo clean
timer list
*Plus, if you are working in a confidential data center, there is no
guarantee you can download a userwritten program from SSC, so it
seems worthwhile that you can get a solution with a few lines of
regular code (one unmatched merge and a loop over observations).
On Fri, Sep 11, 2009 at 12:29 PM, Roy Wada< [hidden email]> wrote:
> I have no problem downloading and using programs written by other people. I
> have not used one of Austin's programs, but it is nice to know that it's there
> if I need it. Could I write a program on regression discontinuity from scatch?
> Sure. Give me two hours. But why would I? I am very grateful to those who have
> shared their program with me, and I hope they find my programs useful too.
>
> When a topic repeatedly come up on this list, it indicates an unaddressed
> problem. Distance matcing is a topic of growing importance that is appearing
> on this list with increasing frequency, despite the earlier assurance that
> this is not a problem in search of a solution.
>
> The solution implemented in distmatch is simple yet has never been implemented
> before. It is a nonintensive solution to an intensive problem. The number of
> matches that must be considered is N x N (it's actually N choose 2 repleated N
> or N1 times, depending on how you count). This is a large number.
>
> The problem with proposing a simple solution is that some people like to entertain
> themselves thinking they could have done it on their own.
>
> But it has not been done before, and the problem keeps coming back to this list,
> as I said before.
>
> How difficult is distance matching? My first stab was remarkably similar to another
> program called nearest from ssc, which according to Nick Cox was not meant to
> earn a good grade in any computer science course. I am guessing this is the most
> obvious solution because this is also the one that Austin was suggesting.
>
> My first program literally took several months to run with observations of about
> 30,000. I tried paralleling the codes (multiple computers), merge, gridsearching,
> etc, before settling on the current form, which is at least 100 times faster than the
> first one.
>
> This rewriting of the program occurred over the course of two years. If
> someone can do this in one sitting, go ahead. Good for them.
>
> But anyone thinking that a casual user can be shown how to do this over the
> Statalist is wasting everyone's time, which was clearly the case.
>
> The current nonStata solution, widely used by economists, is to use ArcGIS or
> ArcMap. They cost about $2000$6000. They usually take about several days if
> not weeks of userwork. If you are using confidential data center (they usually
> charge by the hour), that's another $2000 in expenses. Good luck using the latest
> versions of these programs because they are even more difficult to use. Be grateful
> if you never had to use one of these.
>
> Roy
>
>> Laura
>> You don't actually need to download anything to solve this kind of
>> problem, or much harder similar problems, as illustrated by e.g.
>> http://www.stata.com/statalist/archive/200907/msg00261.html>> http://www.stata.com/statalist/archive/200701/msg00098.html>> and similar posts.
>>
>> I particularly doubt the final claim in the help file for distmatch
>> in the paragraph "Distance matching is computationally intensive.
>> Observations of 3,000 may take several minutes to complete. Other
>> methods typically take days if not weeks and requires extensive
>> userinvolvement."
>>
>> use farms, clear
>> local nf=_N
>> g double mindist=.
>> merge using waterbodies
>> local R=6367.44
>> qui forv i=1/`nf' {
>> local x1=farm_Y[`i']
>> local y1=farm_X[`i']
>> local x2 wat_Y
>> local y2 wat_X
>> g double L=(`y2'`y1')*_pi/180
>> replace L=(`y2'`y1'360)*_pi/180 if L_pi
>> replace L=(`y2'`y1'+360)*_pi/180 if L<_pi
>> local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
>> g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
>> su d, meanonly
>> replace mindist=r(min) in `i'
>> drop L d
>> }
>> drop _m waterbody_ID wat_X wat_Y
>> la var mindist "Distance to center of nearest body of water"
>> or adapt as appropriate...
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Austin,
Thanks for your feedback. You seem to be contradicting yourself
on occassions but some people do that now and then.
If vincenty is critical, then why are you now recommending codes
not based on vincenty? You already know that vincenty makes no
important differences for the distance less than 100 miles.
Please do make the calculations for us and tell us how this
will impact someone's research.
I agree distmatch can be made to run faster (it should recycle
previous rankings) but not for the reason you posted.
Your are forgeting to mention that your codes cannot perform ranking
or complete matching. It only looks for the minimum distance.
This has been pointed out you before.
I would post another comparison except for the fact that your codes
do not work for other matchings.
You seem to be creating a moving target with ad hoc fixes, and
suggest other people do the same. If they can do this, why would
they need you?
Are we stuck in the twilight zone where people does not need help
but in fact should be made to take one when offered.
There is something funny about people who claim exlusive expertise.
Let's agree it is a very bad idea to tell other people to not use
someone's else program.
Roy
P.S. You can take your download programs to the data center just
like other programss. Just put it in the current directory if you
still do not know how to do this.
> Roy
> I also have no problem downloading others' work, and my hard drive is
> cluttered with the output of Jann, Baum, Schaffer, Jenkins, Cox, and
> many others. I seem to use one of Ben Jann's programs every day. And
> one of my posted solutions on this topic requires downloading
> vincenty (from SSC), which gives much better distance estimates,
> though at a substantial time cost.
>
> I am not even claiming that distmatch has no utilityno doubt many
> will find it useful. But I'm afraid I don't see your point in this
> post at allyou claim in the help file that distmatch "take several
> minutes to complete" for 3000 obs, and other methods take "days if not
> weeks" yet the method that I have outlined in several posts is
> entirely general (i.e. it can be customized to produce any range of
> statistics for any range of neighbors, which no program can claim to
> do) and runs faster than distmatch in many cases, e.g. by a factor
> of four or five here:
>
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:enUS:SI_SB_facebook:082009*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Roy
We do seem to be in some sort of twilight zone, a realm of asymmetric
rules about evidence and civility, but I see no contradiction in what
I have postedI appreciate users posting code on SSC and elsewhere,
and in Laura's problem an easy solution is available (via an unmatched
merge and a loop over observations) without downloading any code,
which is not to say that a solution using a downloadable program would
not get there in fewer lines of code (or at least lines of code
visible in an email). However, a similar approach to mine (unmatched
merge and loop over observations) works for any type of problem of
matching one dataset to another, with various different calculations
done inside the loop. vincenty provides better accuracy, but at a
cost (one has to download it, and it is slower than simpler
calculations), though the accuracy may in fact be very important for
some problems where several neighbors are a similar distance from a
point and it is crucial to find the actual nearest neighbor (this is
not an issue for Laura, who only wants minimum distance, and can
tolerate a fairly large error).
My main point about distmatch you do not seem to have answered: the
help file makes a claim about its relative speed that seems
unsupported by the evidence. I have not recommended that people not
download it, but I maintain that the help file is inaccurate, and
should be redacted. I also recommend you add some guidance for folks
looking for a solution to Laura's problem, involving a second dataset,
as the examples in the help file don't seem to be transparent to users
as they stand, at least on how to approach the twodataset problem.
I maintain that the code below is a simple and elegant solution, using
only builtin commands and one reasonably fast call to merge (the
whole thing might be slightly faster in Mata, but at a cost of lost
transparency). The code works just as well if the second file is a
polygon file, in which case I would label the variable mindist
"Distance to nearest body of water" without mentioning it is the
nearest vertex of all polygons to which we are measuring distance; a
suitably detailed polygon file will make the distance suitably
accurate.
use farms, clear
local nf=_N
g double mindist=.
merge using waterbodies
local R=6367.44
qui forv i=1/`nf' {
local x1=farm_Y[`i']
local y1=farm_X[`i']
local x2 wat_Y
local y2 wat_X
g double L=(`y2'`y1')*_pi/180
replace L=(`y2'`y1'360)*_pi/180 if L<. & L>_pi
replace L=(`y2'`y1'+360)*_pi/180 if L<_pi
local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
su d, meanonly
replace mindist=r(min) in `i'
drop L d
}
drop _m waterbody_ID wat_X wat_Y
On Fri, Sep 11, 2009 at 5:47 PM, Roy Wada < [hidden email]> wrote:
> Austin,
>
> Thanks for your feedback. You seem to be contradicting yourself
> on occassions but some people do that now and then.
>
> If vincenty is critical, then why are you now recommending codes
> not based on vincenty? You already know that vincenty makes no
> important differences for the distance less than 100 miles.
>
> Please do make the calculations for us and tell us how this
> will impact someone's research.
>
> I agree distmatch can be made to run faster (it should recycle
> previous rankings) but not for the reason you posted.
>
> Your are forgeting to mention that your codes cannot perform ranking
> or complete matching. It only looks for the minimum distance.
>
> This has been pointed out you before.
>
> I would post another comparison except for the fact that your codes
> do not work for other matchings.
>
> You seem to be creating a moving target with ad hoc fixes, and
> suggest other people do the same. If they can do this, why would
> they need you?
>
> Are we stuck in the twilight zone where people does not need help
> but in fact should be made to take one when offered.
>
> There is something funny about people who claim exlusive expertise.
>
> Let's agree it is a very bad idea to tell other people to not use
> someone's else program.
>
> Roy
>
> P.S. You can take your download programs to the data center just
> like other programss. Just put it in the current directory if you
> still do not know how to do this.
>
>
>> Roy
>> I also have no problem downloading others' work, and my hard drive is
>> cluttered with the output of Jann, Baum, Schaffer, Jenkins, Cox, and
>> many others. I seem to use one of Ben Jann's programs every day. And
>> one of my posted solutions on this topic requires downloading
>> vincenty (from SSC), which gives much better distance estimates,
>> though at a substantial time cost.
>>
>> I am not even claiming that distmatch has no utilityno doubt many
>> will find it useful. But I'm afraid I don't see your point in this
>> post at allyou claim in the help file that distmatch "take several
>> minutes to complete" for 3000 obs, and other methods take "days if not
>> weeks" yet the method that I have outlined in several posts is
>> entirely general (i.e. it can be customized to produce any range of
>> statistics for any range of neighbors, which no program can claim to
>> do) and runs faster than distmatch in many cases, e.g. by a factor
>> of four or five here:
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Austin,
Distance matching requires all neighborhood points.
Your argument is like saying sum, meanonly runs faster than sum, detail.
If this is the biggest fault you can find, then it must be a pretty good
program.
Thanks for doublechecking all my work. I can now tell everybody that it
has been looked over carefully by Austin Nichols.
I agree that the help file can be expanded. I can add a few words on what
it manages to do, although this should be rather obvious to anyone working
with distances.
Roy
> Roy
> We do seem to be in some sort of twilight zone, a realm of asymmetric
> rules about evidence and civility, but I see no contradiction in what
> I have postedI appreciate users posting code on SSC and elsewhere,
> and in Laura's problem an easy solution is available (via an unmatched
> merge and a loop over observations) without downloading any code,
_________________________________________________________________
Ready for Fall shows? Use Bing to find helpful ratings and reviews on digital tv's.
http://www.bing.com/shopping/search?q=digital+tv's&form=MSHNCB&publ=WLHMTAG&crea=TEXT_MSHNCB_Vertical_Shopping_DigitalTVs_1x1*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


Well, Roy, do you have any idea of how long it will take if i have one dataset with 800 points (farms) and the other, with 200'000 (points of lakes lines) ???? Stata has been running the whole afternoon now, and I am wondering whether I shouldn't maybe reduce the number of observaitons of the second dataset...
Or Is there an amendment in the command that I could do in order for STATA to compute only the distances from the farms to the nearest lakes, and not from the points of the lakes to the nearest farms...?
Laura
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


...
To quote GH Hardy: "I am reluctant to intrude in a discussion concerning
matters of which I have no expert knowledge,..."
But, it seems to me that for each farm, you could create a quick screen
that would remove the polygon points that were unlikely to be close to
the farm.
If f_lat and f_long are the coordinates of the farm and the coordinates
for each polygon point are p_lat and p_long, then
gen fp_lat = f_lat  p_lat
gen fp_long = f_long  p_long
gen dist = sqrt(fp_lat^2 + fp_long^2)
qui sum dist, detail
keep if dist<`r(p5)'
If all your farms and lakes are in the UK, you would need to add a
constant to the longitudes before calculating the differences (modulo
180 or 360; I don't know how longitude is specified in the data).
______________________________________________
Kieran McCaul MPH PhD
WA Centre for Health & Ageing (M573)
University of Western Australia
Level 6, Ainslie House
48 Murray St
Perth 6000
Phone: (08) 92242701
Fax: (08) 9224 8009
email: [hidden email]
http://myprofile.cos.com/mccaul
http://www.researcherid.com/rid/B87512008______________________________________________
If you live to be one hundred, you've got it made.
Very few people die past that age  George Burns
Original Message
From: [hidden email]
[mailto: [hidden email]] On Behalf Of Laura
Platchkov
Sent: Sunday, 13 September 2009 12:59 AM
To: [hidden email]
Subject: RE: st: RE: exact command for distance ?
Well, Roy, do you have any idea of how long it will take if i have one
dataset with 800 points (farms) and the other, with 200'000 (points of
lakes lines) ???? Stata has been running the whole afternoon now, and I
am wondering whether I shouldn't maybe reduce the number of observaitons
of the second dataset...
Or Is there an amendment in the command that I could do in order for
STATA to compute only the distances from the farms to the nearest lakes,
and not from the points of the lakes to the nearest farms...?
Laura
*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/


We can keep the discussion here if others are interested. And you are
welcome.
Yes, that is what I mean by gridsearch. There is no point in comparing
something that is not in the neighborhood.
The quickiest way I have found is the cut the map into overlapping grids.
This is an old code that was in process of revision (about a year old),
but it will do 2 million observations in about 5 minutes assuming random
dispersion of points.
In Laura's case, she just need to reverse the comparison group.
Roy
clear
set mem 1g
set obs 2000000
gen id=_n
set seed 123
gen longit=uniform()*10000
gen latit=uniform()*10000
sum
cap program drop distance
program define distance
syntax [using], kernel(real) RADius(real)
gen X=round(longit,`kernel')
gen Y=round(latit,`kernel')
tempfile file0 file1
gen markX=.
save `file0'
replace markX=0 if markX==.
append using `file0'
replace markX=1 if markX==.
append using `file0'
replace markX=2 if markX==.
gen markY=.
save `file1'
replace markY=3 if markY==.
append using `file1'
replace markY=4 if markY==.
append using `file1'
replace markY=5 if markY==.
*sort id markX markY
replace X=X`kernel' if markX==1
replace X=X+`kernel' if markX==2
replace Y=Y`kernel' if markY==4
replace Y=Y+`kernel' if markY==5
gen mark=0 if markX==0 & markY==3
egen XY=group(X Y)
bys XY: gen count=_N
keep if count>1
* drop if the middle part (mark==0) is not there
bys XY: egen min=min(mark)
drop if min~=0
drop min
* renumber
drop XY
egen XY=group(X Y)
*** resorting is necessary
sort XY
drop count
bys XY: gen count=_N
*gen row=_n
*bys XY: gen begin=row if _n==1
*bys XY: replace begin=begin[_n1] if begin[_n1]~=.
*bys XY: gen end0=row if _n==_N
*bys XY: egen end=max(end0)
*drop end0
forval num=1/10 {
gen match`num'=.
gen dist`num'=.
}
egen max=max(XY)
local max=max
drop max
local place=1
while `place'=0 & `=mark[`first']'==0 {
if `dist'<=`radius' & `dist'>0 & `=mark[`first']'==0 {
di " `max' `=id[`first']' `=id[`second']' `dist'"
local num=`num'+1
qui replace match`num'=`=id[`second']' in `first'
qui replace dist`num'=`dist' in `first'
}
}
}
local place=`place'+`=count[`place']'
}
end
distance, kernel(.7) rad(.1)
>
> To quote GH Hardy: "I am reluctant to intrude in a discussion concerning
> matters of which I have no expert knowledge,..."
>
> But, it seems to me that for each farm, you could create a quick screen
> that would remove the polygon points that were unlikely to be close to
> the farm.
>
> If f_lat and f_long are the coordinates of the farm and the coordinates
> for each polygon point are p_lat and p_long, then
>
> gen fp_lat = f_lat  p_lat
> gen fp_long = f_long  p_long
> gen dist = sqrt(fp_lat^2 + fp_long^2)
> qui sum dist, detail
> keep if dist
> If all your farms and lakes are in the UK, you would need to add a
> constant to the longitudes before calculating the differences (modulo
> 180 or 360; I don't know how longitude is specified in the data).
>
_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/*
* For searches and help try:
* http://www.stata.com/help.cgi?search* http://www.stata.com/support/statalist/faq* http://www.ats.ucla.edu/stat/stata/

