Quantcast

Splitting a dataset efficiently/run regression repeatedly in subsets

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Splitting a dataset efficiently/run regression repeatedly in subsets

Trelle Sven
Dear all,
I have a large (simulated) dataset with 400,000 observations (from
overall 50,000 simulations each creating 8 observations). I need to
perform a linear regression for each simulation separately. I noticed
the following:

1) keeping all observations in the dataset and looping through the
simulations is very inefficient i.e. it takes several hours to run e.g.
* first example starts; run is an ID for simulation
gen regcoeff = .
forval s=1/50000 {
        regress x y if run==`s'
        replace regcoeff = _b[y] if _n==`s'
}
* first example ends

2) preserving and restoring is even more time-consuming

3) I thought of creating a loop as before but load the data at the
beginning and then keeping only the data for the particular simulation.
However, it implies that the data is loaded 50,000times (because it
comes from a server with suboptimal connection speed this is also not
optimal) and it would make storage of the results also a little bit
difficult
* second example starts
gen regcoeff = .
save sim.dta, replace
local coeff = 0 // dummy for first run of loop
local p = 1 // dummy for first run of loop
forval s=1/50000 {
        use sim.dta, clear
        replace regcoeff = `coeff' if _n==`p'
        save sim.dta, replace
        keep if run==`s'
        regress x y
        local coeff = _b[y]
        local p=`s'
}
use sim.dta, clear
replace regcoeff = `coeff' if _n==`p'
save sim.dta, replace
* second example ends

I am sure there is a better way of doing this.
If there is anybody who has better ideas I would appreciate any
suggestions/help.

All the best
Sven


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Splitting a dataset efficiently/run regression repeatedly in subsets

nshephard
Administrator
Depending on how much physical RAM you have and whether you're using a
32-bit or 64-bit OS (preferably later to allow access to RAM > 1Gb)
you might consider using a -tempfile- to hold your data and reload
from.  A simplified example (not storing coefficients...

use sim, clear
tempfile t
save `t', replace
forval s=1/50000 {
        qui use `t' if(run == `s'), clear
        regress x y
}

Whether its actually quicker I've no idea, but it will certainly save
on read/writes to disc/network drive.

Neil
--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution

Email - [hidden email]
Website - http://kimura-no-ip.org/
Photos - http://www.flickr.com/photos/slackline/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Splitting a dataset efficiently/run regression repeatedly in subsets

Maarten buis
In reply to this post by Trelle Sven
--- On Mon, 15/11/10, Trelle Sven wrote:

> I have a large (simulated) dataset with 400,000
> observations (from overall 50,000 simulations each creating
> 8 observations). I need to perform a linear regression for
> each simulation separately. I noticed the following:
>
> 1) keeping all observations in the dataset and looping
> through the simulations is very inefficient i.e. it takes
> several hours to run e.g.
> * first example starts; run is an ID for simulation
> gen regcoeff = .
> forval s=1/50000 {
>     regress x y if run==`s'
>     replace regcoeff = _b[y] if _n==`s'
> }
> * first example ends

An -in- condition is often quicker than an -if- condition. You
need to do more work to make sure that the -in- condition is
appropriate, but that is the price to pay.

> 2) preserving and restoring is even more time-consuming

that makes sense

> 3) I thought of creating a loop as before but load the data
> at the beginning and then keeping only the data for the
> particular simulation.

Sounds like that would be slow also.

Anyhow, before doing all this I would start with -statsby-,
see: -help statsby-.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------


     

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Splitting a dataset efficiently/run regression repeatedly in subsets

Sergiy Radyakin
In reply to this post by Trelle Sven
Dear Sven,

50000 regressions on 8-observations dataset of two variables should
take about 30 seconds (see below).
So don't generate the large dataset, but rather run the regressions
right away when you generate your simulated data.
You don't need to save the 50000x8 observations you generated, as
[presumably] you are also doing it with Stata, so
next time you simulate them with your do-file - they will be the same
(don't forget to set the rnd seed)

On the other hand, since you need only one coefficient from this
trivial regression, you may ask yourself if the -regress-
artillery is really necessary here, or a trivial formula, such as the one here:
http://en.wikipedia.org/wiki/Regression_analysis
would suffice (and be faster).

In any case, don't forget to specify -quietly-. I am almost sure you
don't have any intention to review the output of the
50,000 regressions, and that speeds up the program a lot.

Best,
Sergiy Radyakin.

PS: I am strongly convinced you don't need access to above 1GB memory
for the task of running univariate regressions on
8-observations datasets.


. do "R:\TEMP\STD04000000.tmp"
. set rmsg on
r; t=0.00 10:42:16
. sysuse auto, clear
(1978 Automobile Data)
r; t=0.00 10:42:16
. keep in 1/8
(66 observations deleted)
r; t=0.00 10:42:16
.
. forvalues i=1/50000 {
  2.    qui regress price weight
  3. }
r; t=26.53 10:42:42
.
end of do-file
r; t=26.53 10:42:42




On Mon, Nov 15, 2010 at 10:16 AM, Trelle Sven <[hidden email]> wrote:

> Dear all,
> I have a large (simulated) dataset with 400,000 observations (from
> overall 50,000 simulations each creating 8 observations). I need to
> perform a linear regression for each simulation separately. I noticed
> the following:
>
> 1) keeping all observations in the dataset and looping through the
> simulations is very inefficient i.e. it takes several hours to run e.g.
> * first example starts; run is an ID for simulation
> gen regcoeff = .
> forval s=1/50000 {
>        regress x y if run==`s'
>        replace regcoeff = _b[y] if _n==`s'
> }
> * first example ends
>
> 2) preserving and restoring is even more time-consuming
>
> 3) I thought of creating a loop as before but load the data at the
> beginning and then keeping only the data for the particular simulation.
> However, it implies that the data is loaded 50,000times (because it
> comes from a server with suboptimal connection speed this is also not
> optimal) and it would make storage of the results also a little bit
> difficult
> * second example starts
> gen regcoeff = .
> save sim.dta, replace
> local coeff = 0 // dummy for first run of loop
> local p = 1 // dummy for first run of loop
> forval s=1/50000 {
>        use sim.dta, clear
>        replace regcoeff = `coeff' if _n==`p'
>        save sim.dta, replace
>        keep if run==`s'
>        regress x y
>        local coeff = _b[y]
>        local p=`s'
> }
> use sim.dta, clear
> replace regcoeff = `coeff' if _n==`p'
> save sim.dta, replace
> * second example ends
>
> I am sure there is a better way of doing this.
> If there is anybody who has better ideas I would appreciate any
> suggestions/help.
>
> All the best
> Sven
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Splitting a dataset efficiently/run regression repeatedly in subsets

Trelle Sven

> From: Sergiy Radyakin
> Sent: Monday, November 15, 2010 4:54 PM

> 50000 regressions on 8-observations dataset of two variables
> should take about 30 seconds (see below).

See below

> So don't generate the large dataset, but rather run the
> regressions right away when you generate your simulated data.
> You don't need to save the 50000x8 observations you
> generated, as [presumably] you are also doing it with Stata,
> so next time you simulate them with your do-file - they will
> be the same (don't forget to set the rnd seed)

No, the simulations were not done in Stata

> On the other hand, since you need only one coefficient from
> this trivial regression, you may ask yourself if the
> -regress- artillery is really necessary here, or a trivial
> formula, such as the one here:
> http://en.wikipedia.org/wiki/Regression_analysis
> would suffice (and be faster).

Thanks, I will give it a try although I am not sure whether the
regression is actually the problem (see response below)
 
> In any case, don't forget to specify -quietly-. I am almost
> sure you don't have any intention to review the output of the
> 50,000 regressions, and that speeds up the program a lot.

Yes, I do it quietly in my do-file but skipped it for the example codes.

> . do "R:\TEMP\STD04000000.tmp"
> . set rmsg on
> r; t=0.00 10:42:16
> . sysuse auto, clear
> (1978 Automobile Data)
> r; t=0.00 10:42:16
> . keep in 1/8
> (66 observations deleted)
> r; t=0.00 10:42:16
> .
> . forvalues i=1/50000 {
>   2.    qui regress price weight
>   3. }
> r; t=26.53 10:42:42
> .
> end of do-file
> r; t=26.53 10:42:42

I have a large dataset (400,000 obs and not 8) and need to analyse a
subset and that's probably the issue (not the regression itself or the
loop).

BW/Sven
 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Splitting a dataset efficiently/run regression repeatedly in subsets

Trelle Sven
In reply to this post by Maarten buis
> Maarten buis
> Sent: Monday, November 15, 2010 4:39 PM

> > I have a large (simulated) dataset with 400,000 observations (from
> > overall 50,000 simulations each creating
> > 8 observations). I need to perform a linear regression for each
> > simulation separately. I noticed the following:
> >
> > 1) keeping all observations in the dataset and looping through the
> > simulations is very inefficient i.e. it takes several hours to run
> > e.g.
> > * first example starts; run is an ID for simulation gen regcoeff = .
> > forval s=1/50000 {
> >     regress x y if run==`s'
> >     replace regcoeff = _b[y] if _n==`s'
> > }
> > * first example ends
>
> An -in- condition is often quicker than an -if- condition.
> You need to do more work to make sure that the -in- condition
> is appropriate, but that is the price to pay.

I will try this. Thanks.
 
 
> Anyhow, before doing all this I would start with -statsby-,
> see: -help statsby-.

As always, I wasn't 100% precise ...
The statsby command is actually much quicker (and thanks for the advice). However, I also need to predict after each regression and apparently this is not possible with statsby.

Consequently, I will try

1) the "in" condition instead of "if".
2) Use statsby and predict by hand using these results

Sven


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Loading...