Quantcast

st: Working with large data set like a database

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

st: Working with large data set like a database

David Souther-3
I have a question about working with  very large data sets (combined
sizes ~ 40 gig) to run analysis when only 6 gig of memory is
available.
A second complicating factor is that I need to join some of these data
sets together based on a date range or similar join rule.  In Oracle,
I could query out only the columns I need and then join them to other
files using a rule- such as the dates being within "x" number of days
of each other.  I cannot get ""merge"" in stata to accept these kinds
of date ranges.  Here are an example of two datasets to join

***subdataset***
date1 var1 extravar extravar1
10/22/2008 3 44 44
02/01/2001 5 44 44
05/24/2005 9 44 44
12/12/2012 99 44 44
12/29/2012 100 44 44

***big dataset***
 date1 var2 extravar extravar1
10/20/2008 500 44 44
02/07/2001 500 44 44
05/20/2005 900 44 44
12/12/2015 990 44 44
01/01/1999 1000 44 44
01/01/1970 2000 44 44
01/01/1970 2222 44 44
12/01/2012 7777 44 55


I need to join by ""date1"" and load up a data set for analysis with
ONLY ""date1"", ""var1"", and ""extravar1"".  Thanks for helping.

DFS
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: st: Working with large data set like a database

Eric A. Booth
<>


On Jan 1, 2010, at 10:37 AM, David Souther wrote:
> I have a question about working with  very large data sets (combined
> sizes ~ 40 gig) to run analysis when only 6 gig of memory is
> available.

.What format are the data stored in?  If they are already in Stata format (.dta), then you can use the -use- command and specify the variables and observations you want to use, e.g.

use price mpg for in 1/25 if for<1 using auto.dta, clear

This would "query" (as you put it) out only those 3 variables in rows 1/25 where foreign<1.

However, if you were starting from a very large comma/tab-delimited file (and you weren't able to convert the file to Stata format via Stat Transfer & the file was too large to simply -insheet- ), you could use -chunky- (from SSC) to bring in the vars/rows you needed.  The help file for -chunky- has a loop for this process already spelled out.

> I cannot get ""merge"" in stata to accept these kinds
> of date ranges.

What commands did you attempt?  
You can use -nearmrg- (from SSC) to do some date range matches.  Here's an example of how to get it to work with your example data; however, please note a few things:  
(1) I am assuming you want to "update" (see the -merge- help file about this option) extravar1 since its value changes in the "big" dataset;
(2) my example matches the "sub" dataset record that is closest to & greater than the nearest match in the "big" dataset, then I remove records where the "sub" date is > 90 days from the "big" date

**************start
clear
**"sub" dataset**
input str10 date1 var1 extravar extravar1
"10/22/2008" 3 44 44
"02/01/2001" 5 44 44
"05/24/2005" 9 44 44
"12/12/2012" 99 44 44
"12/29/2012" 100 44 44
end
gen date2  = date(date1, "MDY")
sort  date2
drop date1 extravar  
compress
save "using.dta", replace

********
**"big" dataset**
clear
input str10 date1 var2 extravar extravar1
"10/20/2008" 500 44 44
"02/07/2001" 500 44 44
"05/20/2005" 900 44 44
"12/12/2015" 990 44 44
"01/01/1999" 1000 44 44
"01/01/1970" 2000 44 44
"01/01/1970" 2222 44 44
"12/01/2012" 7777 44 55
end
gen date2  = date(date1, "MDY")
sort  date2
drop date1
compress
save "master.dta", replace

********
**merge**
cap ssc install nearmrg
**using only what's needed**
clear
use date2  extravar1 using "master.dta"
nearmrg  using "using.dta", upper nearvar(date2) genmatch(match)  _merge(merge) update nokeep
compress
rename date2 date1
format date1 %td
format match %td
li date1 match var* extravar1
**keep only within certain date range of "x"=90 days**
g diff = match-date1
li date1 diff
keep if diff<=90     //-matching on 90 days from "big" date
drop diff merge
li *
save "combined.dta", replace
**your analysis here**
reg var1 ext

**************end

Note:   From your post it sounds like you do not want to load all of the 40G of data into memory at one time for analysis (that is, you only want some of the variables loaded in), but depending on the number of observations, your dataset may still be quite large, so you may want to consider the OS, physical memory, and virtual memory on your machine.  See these links:  
http://www.stata.com/statalist/archive/2007-05/msg00326.html
http://www.stata.com/statalist/archive/2009-12/msg00980.html



Good luck,

Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[hidden email]
Office: +979.845.6754


On Jan 1, 2010, at 10:37 AM, David Souther wrote:

> I have a question about working with  very large data sets (combined
> sizes ~ 40 gig) to run analysis when only 6 gig of memory is
> available.
> A second complicating factor is that I need to join some of these data
> sets together based on a date range or similar join rule.  In Oracle,
> I could query out only the columns I need and then join them to other
> files using a rule- such as the dates being within "x" number of days
> of each other.  I cannot get ""merge"" in stata to accept these kinds
> of date ranges.  Here are an example of two datasets to join
>
> ***subdataset***
> date1 var1 extravar extravar1
> 10/22/2008 3 44 44
> 02/01/2001 5 44 44
> 05/24/2005 9 44 44
> 12/12/2012 99 44 44
> 12/29/2012 100 44 44
>
> ***big dataset***
> date1 var2 extravar extravar1
> 10/20/2008 500 44 44
> 02/07/2001 500 44 44
> 05/20/2005 900 44 44
> 12/12/2015 990 44 44
> 01/01/1999 1000 44 44
> 01/01/1970 2000 44 44
> 01/01/1970 2222 44 44
> 12/01/2012 7777 44 55
>
>
> I need to join by ""date1"" and load up a data set for analysis with
> ONLY ""date1"", ""var1"", and ""extravar1"".  Thanks for helping.
>
> DFS
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/




*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

st: Chow Test for IV estimators in STATA

Sungbok Lee
Dear all,

Is there anyone that have experienced the same question with mine in STATA?
Please let me know how you did it.

My question is to test the equality of the two IV estimators. But I have no
idea of how to define an unrestricted regression expression in STATA.

*restricted model: all coefficients are equal between the two groups
ivreg y (x1 x2 = z1 z2)

*unrestircted model: the coefficient for x1 is different but the coefficient
for x2 is equal


javanfish.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: st: Chow Test for IV estimators in STATA

JOHN ANTONAKIS
Hi:

I suspect you could test this with a structural equation modeling
program (using Maximum Likelihood estimation) like glamm (I use Mplus
for those purposes because it is much faster with ML estimation than
glamm is; however, I don't know glamm well enough to make a definitive
statement on this). Basically, in the case of a structural-equation
modeling program you would define a stacked (multisample or multiple
group model) and the test for the equality of coefficient/2 across
groups (don't forget to allow the disturbances of y x1 and x2 to
correlate, otherwise you will get OLS estimates).

HTH,
John.

____________________________________________________

Prof. John Antonakis, Associate Dean
Faculty of Business and Economics
Department of Organizational Behavior
University of Lausanne
Internef #618
CH-1015 Lausanne-Dorigny
Switzerland

Tel ++41 (0)21 692-3438
Fax ++41 (0)21 692-3305

Faculty page:
http://www.hec.unil.ch/people/jantonakis

Personal page:
http://www.hec.unil.ch/jantonakis
____________________________________________________



On 01.01.2010 23:33, Sungbok Lee wrote:
 > Dear all,
 >
 > Is there anyone that have experienced the same question with mine in
STATA?
 > Please let me know how you did it.
 >
 > My question is to test the equality of the two IV estimators. But I
have no
 > idea of how to define an unrestricted regression expression in STATA.
 >
 > *restricted model: all coefficients are equal between the two groups
 > ivreg y (x1 x2 = z1 z2)
 >
 > *unrestircted model: the coefficient for x1 is different but the
coefficient
 > for x2 is equal
 >
 >
 > javanfish.
 >
 > *
 > *   For searches and help try:
 > *   http://www.stata.com/help.cgi?search
 > *   http://www.stata.com/support/statalist/faq
 > *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Loading...