Dear all,
I have a large (simulated) dataset with 400,000 observations (from overall 50,000 simulations each creating 8 observations). I need to perform a linear regression for each simulation separately. I noticed the following: 1) keeping all observations in the dataset and looping through the simulations is very inefficient i.e. it takes several hours to run e.g. * first example starts; run is an ID for simulation gen regcoeff = . forval s=1/50000 { regress x y if run==`s' replace regcoeff = _b[y] if _n==`s' } * first example ends 2) preserving and restoring is even more timeconsuming 3) I thought of creating a loop as before but load the data at the beginning and then keeping only the data for the particular simulation. However, it implies that the data is loaded 50,000times (because it comes from a server with suboptimal connection speed this is also not optimal) and it would make storage of the results also a little bit difficult * second example starts gen regcoeff = . save sim.dta, replace local coeff = 0 // dummy for first run of loop local p = 1 // dummy for first run of loop forval s=1/50000 { use sim.dta, clear replace regcoeff = `coeff' if _n==`p' save sim.dta, replace keep if run==`s' regress x y local coeff = _b[y] local p=`s' } use sim.dta, clear replace regcoeff = `coeff' if _n==`p' save sim.dta, replace * second example ends I am sure there is a better way of doing this. If there is anybody who has better ideas I would appreciate any suggestions/help. All the best Sven * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
Administrator

Depending on how much physical RAM you have and whether you're using a
32bit or 64bit OS (preferably later to allow access to RAM > 1Gb) you might consider using a tempfile to hold your data and reload from. A simplified example (not storing coefficients... use sim, clear tempfile t save `t', replace forval s=1/50000 { qui use `t' if(run == `s'), clear regress x y } Whether its actually quicker I've no idea, but it will certainly save on read/writes to disc/network drive. Neil  "Our civilization would be pitifully immature without the intellectual revolution led by Darwin"  Motoo Kimura, The Neutral Theory of Molecular Evolution Email  [hidden email] Website  http://kimuranoip.org/ Photos  http://www.flickr.com/photos/slackline/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
In reply to this post by Trelle Sven
 On Mon, 15/11/10, Trelle Sven wrote:
> I have a large (simulated) dataset with 400,000 > observations (from overall 50,000 simulations each creating > 8 observations). I need to perform a linear regression for > each simulation separately. I noticed the following: > > 1) keeping all observations in the dataset and looping > through the simulations is very inefficient i.e. it takes > several hours to run e.g. > * first example starts; run is an ID for simulation > gen regcoeff = . > forval s=1/50000 { > regress x y if run==`s' > replace regcoeff = _b[y] if _n==`s' > } > * first example ends An in condition is often quicker than an if condition. You need to do more work to make sure that the in condition is appropriate, but that is the price to pay. > 2) preserving and restoring is even more timeconsuming that makes sense > 3) I thought of creating a loop as before but load the data > at the beginning and then keeping only the data for the > particular simulation. Sounds like that would be slow also. Anyhow, before doing all this I would start with statsby, see: help statsby. Hope this helps, Maarten  Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl  * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
In reply to this post by Trelle Sven
Dear Sven,
50000 regressions on 8observations dataset of two variables should take about 30 seconds (see below). So don't generate the large dataset, but rather run the regressions right away when you generate your simulated data. You don't need to save the 50000x8 observations you generated, as [presumably] you are also doing it with Stata, so next time you simulate them with your dofile  they will be the same (don't forget to set the rnd seed) On the other hand, since you need only one coefficient from this trivial regression, you may ask yourself if the regress artillery is really necessary here, or a trivial formula, such as the one here: http://en.wikipedia.org/wiki/Regression_analysis would suffice (and be faster). In any case, don't forget to specify quietly. I am almost sure you don't have any intention to review the output of the 50,000 regressions, and that speeds up the program a lot. Best, Sergiy Radyakin. PS: I am strongly convinced you don't need access to above 1GB memory for the task of running univariate regressions on 8observations datasets. . do "R:\TEMP\STD04000000.tmp" . set rmsg on r; t=0.00 10:42:16 . sysuse auto, clear (1978 Automobile Data) r; t=0.00 10:42:16 . keep in 1/8 (66 observations deleted) r; t=0.00 10:42:16 . . forvalues i=1/50000 { 2. qui regress price weight 3. } r; t=26.53 10:42:42 . end of dofile r; t=26.53 10:42:42 On Mon, Nov 15, 2010 at 10:16 AM, Trelle Sven <[hidden email]> wrote: > Dear all, > I have a large (simulated) dataset with 400,000 observations (from > overall 50,000 simulations each creating 8 observations). I need to > perform a linear regression for each simulation separately. I noticed > the following: > > 1) keeping all observations in the dataset and looping through the > simulations is very inefficient i.e. it takes several hours to run e.g. > * first example starts; run is an ID for simulation > gen regcoeff = . > forval s=1/50000 { > regress x y if run==`s' > replace regcoeff = _b[y] if _n==`s' > } > * first example ends > > 2) preserving and restoring is even more timeconsuming > > 3) I thought of creating a loop as before but load the data at the > beginning and then keeping only the data for the particular simulation. > However, it implies that the data is loaded 50,000times (because it > comes from a server with suboptimal connection speed this is also not > optimal) and it would make storage of the results also a little bit > difficult > * second example starts > gen regcoeff = . > save sim.dta, replace > local coeff = 0 // dummy for first run of loop > local p = 1 // dummy for first run of loop > forval s=1/50000 { > use sim.dta, clear > replace regcoeff = `coeff' if _n==`p' > save sim.dta, replace > keep if run==`s' > regress x y > local coeff = _b[y] > local p=`s' > } > use sim.dta, clear > replace regcoeff = `coeff' if _n==`p' > save sim.dta, replace > * second example ends > > I am sure there is a better way of doing this. > If there is anybody who has better ideas I would appreciate any > suggestions/help. > > All the best > Sven > > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
> From: Sergiy Radyakin > Sent: Monday, November 15, 2010 4:54 PM > 50000 regressions on 8observations dataset of two variables > should take about 30 seconds (see below). See below > So don't generate the large dataset, but rather run the > regressions right away when you generate your simulated data. > You don't need to save the 50000x8 observations you > generated, as [presumably] you are also doing it with Stata, > so next time you simulate them with your dofile  they will > be the same (don't forget to set the rnd seed) No, the simulations were not done in Stata > On the other hand, since you need only one coefficient from > this trivial regression, you may ask yourself if the > regress artillery is really necessary here, or a trivial > formula, such as the one here: > http://en.wikipedia.org/wiki/Regression_analysis > would suffice (and be faster). Thanks, I will give it a try although I am not sure whether the regression is actually the problem (see response below) > In any case, don't forget to specify quietly. I am almost > sure you don't have any intention to review the output of the > 50,000 regressions, and that speeds up the program a lot. Yes, I do it quietly in my dofile but skipped it for the example codes. > . do "R:\TEMP\STD04000000.tmp" > . set rmsg on > r; t=0.00 10:42:16 > . sysuse auto, clear > (1978 Automobile Data) > r; t=0.00 10:42:16 > . keep in 1/8 > (66 observations deleted) > r; t=0.00 10:42:16 > . > . forvalues i=1/50000 { > 2. qui regress price weight > 3. } > r; t=26.53 10:42:42 > . > end of dofile > r; t=26.53 10:42:42 I have a large dataset (400,000 obs and not 8) and need to analyse a subset and that's probably the issue (not the regression itself or the loop). BW/Sven * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
In reply to this post by Maarten buis
> Maarten buis
> Sent: Monday, November 15, 2010 4:39 PM > > I have a large (simulated) dataset with 400,000 observations (from > > overall 50,000 simulations each creating > > 8 observations). I need to perform a linear regression for each > > simulation separately. I noticed the following: > > > > 1) keeping all observations in the dataset and looping through the > > simulations is very inefficient i.e. it takes several hours to run > > e.g. > > * first example starts; run is an ID for simulation gen regcoeff = . > > forval s=1/50000 { > > regress x y if run==`s' > > replace regcoeff = _b[y] if _n==`s' > > } > > * first example ends > > An in condition is often quicker than an if condition. > You need to do more work to make sure that the in condition > is appropriate, but that is the price to pay. I will try this. Thanks. > Anyhow, before doing all this I would start with statsby, > see: help statsby. As always, I wasn't 100% precise ... The statsby command is actually much quicker (and thanks for the advice). However, I also need to predict after each regression and apparently this is not possible with statsby. Consequently, I will try 1) the "in" condition instead of "if". 2) Use statsby and predict by hand using these results Sven * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ 
Free forum by Nabble  Edit this page 