Quantcast

merging datasets

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

merging datasets

Michael Eisenberg
Colleagues,

I have a database of about 20K men that I'd like to merge with another
database.  I have names (first, middle, and last) as well as date of
birth and social security number for most men.  Unfortunately, the
original database has some missing data on birthdate and social
security numbers.  The new database has most of the birthdate info as
well as the geographic information that I need.

Some men do have the same name.

Is there anyway to merge based on name if it doesn't uniquely identify
men?  I'd like to somehow match all men and then let me manually
compare based on visit dates to decide if it's likely the match is
correct.  If not, any suggestions?

Thanks for you help.

Mike
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: merging datasets

nshephard
Administrator
On Tue, Nov 9, 2010 at 4:50 AM, Michael Eisenberg
<[hidden email]> wrote:

> Colleagues,
>
> I have a database of about 20K men that I'd like to merge with another
> database.  I have names (first, middle, and last) as well as date of
> birth and social security number for most men.  Unfortunately, the
> original database has some missing data on birthdate and social
> security numbers.  The new database has most of the birthdate info as
> well as the geographic information that I need.
>
> Some men do have the same name.
>
> Is there anyway to merge based on name if it doesn't uniquely identify
> men?

You could investigate the utility of -nearmrg- by Michael Blasnik &
Katherine Smith available from the SSC archives..

ssc describe nearmrg
ssc install nearmrg

To read about it and then install it.

Neil




--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution

Email - [hidden email]
Website - http://kimura-no-ip.org/
Photos - http://www.flickr.com/photos/slackline/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: merging datasets

Michael Blasnik
In reply to this post by Michael Eisenberg
Rather than -nearmrg-, I think you'd be better off using joinby and
then using a set of logical conditions to identify the "good" matches.
 It may be tricky if you have people with the same names and missing
data on the auxiliary variables.

Michael



On Mon, Nov 8, 2010 at 11:50 PM, Michael Eisenberg
<[hidden email]> wrote:

> Colleagues,
>
> I have a database of about 20K men that I'd like to merge with another
> database.  I have names (first, middle, and last) as well as date of
> birth and social security number for most men.  Unfortunately, the
> original database has some missing data on birthdate and social
> security numbers.  The new database has most of the birthdate info as
> well as the geographic information that I need.
>
> Some men do have the same name.
>
> Is there anyway to merge based on name if it doesn't uniquely identify
> men?  I'd like to somehow match all men and then let me manually
> compare based on visit dates to decide if it's likely the match is
> correct.  If not, any suggestions?
>
> Thanks for you help.
>
> Mike
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: merging datasets

Michael N. Mitchell
In reply to this post by Michael Eisenberg
Dear Mike

   I don't know if this is helpful, but when I have encountered this kind of data, this is
the strategy that I have used. Call the two databases A and B. I would start by matching A
and B based on all of the variables (first, middle, last, dob, ssn). Some observations
will match on all criteria. Call those matches observations that met matched based on
criteria one. Take the remaining unmatched observations and then try and match them on a
looser criteria, for example everything but middle name. Call those matches criteria two
matches. Take the unmatched observations and try matching again on a looser criteria.
Repeat this process continuing to loosen up the matching criteria. At the end, I might be
matching based on a criteria that is too loose for my comfort (such as, last name only).
You can then do a frequency count, among the matching records, of how many matched at each
criteria level (including the criteria that is too loose for comfort). You can then weigh
the number of matches against the criteria to decide the optimal balance between matches
and quality of the match criteria.

I hope this helps,

Michael N. Mitchell
Data Management Using Stata      - http://www.stata.com/bookstore/dmus.html
A Visual Guide to Stata Graphics - http://www.stata.com/bookstore/vgsg.html
Stata tidbit of the week         - http://www.MichaelNormanMitchell.com



On 2010-11-08 8.50 PM, Michael Eisenberg wrote:

> Colleagues,
>
> I have a database of about 20K men that I'd like to merge with another
> database.  I have names (first, middle, and last) as well as date of
> birth and social security number for most men.  Unfortunately, the
> original database has some missing data on birthdate and social
> security numbers.  The new database has most of the birthdate info as
> well as the geographic information that I need.
>
> Some men do have the same name.
>
> Is there anyway to merge based on name if it doesn't uniquely identify
> men?  I'd like to somehow match all men and then let me manually
> compare based on visit dates to decide if it's likely the match is
> correct.  If not, any suggestions?
>
> Thanks for you help.
>
> Mike
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: merging datasets

Will Hauser
In reply to this post by Michael Eisenberg
Mike,
I have worked extensively with matching string and numeric variables
across several data sets.  It is a pain.  Michael Mitchell has given you
good advice I would build on it by noting the following:
(sorry, this will be lengthy)
Let me summarize by saying - do NOT use the update option, watch for
duplicates, and remove matches from each step.

1.  With large data sets such as yours (and mine) the joinby command
creates a mess.  I don't trust myself to sort it out with logical
commands.  The best approach is to first make matches that are sure
things (i.e. matches on all variables or those that really matter -
first name, last name, and birthday).  Hopefully the number of matches
remaining that were not 'perfect' is a manageable number so that you can
do some hand checking with the help of some dummy variables to identify
what info does not match.

2. You should know the merge command, at least in stata 10, will
duplicate entries in the master file if there are multiple matches in
the using data set that match the case in the master according to the
match criteria you set.  Stata does this *even though* the master file
is "held inviolate."  So, if you match based on last name only you'll
wind up with several duplicate entries for "Smith" where each Smith in
the master is match to every other "Smith" in the using data set.  
You'll want a unique identifier for each case in the master so you can
see what names end up being duplicated (command:  duplicates list
"idvar").  Also do a count before the merge and make sure you end up
back at that number at the end of the process.

3. Do the matching process across multiple 'passes'.  First pass uses
the most match variables and last pass uses the least.  Choose the match
variables carefully so that earlier passes are more trustworthy than
later passes (e.g. the pass based on last name and birthday should
probably come before the pass based on first initial and last name).  
When you have a match, remove it from the master (and in your case
probably the using file as well).  This is because subsequent matches
will use less info and there is no reason to think that they will be a
better match than a match from an earlier pass where more of the
variables matched.  When you are done with each merge just save the file
to a temporary file name, keep only the matches, and then save the file
as "first pass matches" or some equivalent.  Then re-open the temp file,
keep only the unmatched (i.e. _merge==1) and save as the "master data"
or some equivalent.  When you are done just append all the saved matches
data sets together.
*To keep things orderly I like to include the word "master" in the
filename for the master dataset and "using" for the filename for the
using dataset.  The syntax can become cumbersome and it's nice to
quickly see that I'm using the right datasets in the right places.

4. Clean your string variables carefully and thoroughly.  Look out for
individuals with hyphenated names, capitalizations (i.e. McCory), or
spaces (i.e. St Joseph).  Be sure suffixes aren't contaminating the last
name field.  Look out for extra spaces (leading, trailing, or internal)
and make the capitalization uniform (all caps or all lowercase).  Use
the string functions, in particular regexm, trim, itrim, and the user
written code 'strip'.  You can't clean your string variables enough!

5. In my experience first names tend to vary - one database may have
nicknames while the other may have formal names.  Resolve this with a
variable for first initial.  First match using the full first name along
with the other variables then the next pass should use the first initial
instead along with the other variables.  You may want to do this with
birth dates as well - first use the actual date and then just use the
year (or month and year).  Obviously make sure you convert dates from
string format into stata's numeric date format.

6. Name and label your _merge variable from each pass so you know where
the match came from, how trustworthy it is, what variables would've
matched, etc.  These are extremely handy when hand checking or writing
logical statements to drop bad matches.

7.  With a large data set the merge variables probably won't uniquely
identify the cases.  This same problem applies to the sort command.  For
this reason when you sort your data sets use the 'stable' option! If you
are like me and like to be able to go back and re-run the do file to
make changes or corrections in light of errors you find later on you may
have a real mess on you hands if you don't use the stable option (I
found out the hard way).  Have a unique sort order at the start (i.e.
sort by all the variables) and then always specify the stable option
with subsequent sorts.  I use numeric ids to refer to each case (i.e.
when dropping mismatches) and these ids would change each time I ran the
do file because the order of the dataset was not constant when I created
the ids after the match process.

8. Keep track of duplicates in the master dataset.  You can do this with
the egen 'group' command. Generate groups for each set of variables that
you match on.  For any group consisting of more then one tag it as
duplicated on varlist (your march variables) using a dummy variable.  
This is because some matches are not necessarily bad matches but they
may be uncertain matches.  In short, they may match based on the
variables you have data for but, because there are duplicates on those
variables, you can't say for sure if the match is correct or not.  For
example if you have 2 John Smiths who are both missing a birthday and
you have 2 John Smiths in the using data set who each have a birthday
then a merge will link these up (and generate 2 duplicates) but in the
end you have to throw these matches out because you don't know which
case in the master goes with which case in the using.  Having a variable
that identifies cases that have duplicates based on first and last name
will help you write a logical command to drop these uncertain matches.

9.  Last and finally, do NOT use the update command!  Let's say you have
a Franklin Smith in the master file, he is missing his birthday.  Now
you match based on last name and first initial and get a match for F
Smith in the using data set (he has a birthday listed).  The data in the
master file is not overwritten so only the missing data is carried over
into the master and your Franklin Smith now has a birthday.  Sounds
great but what you cannot see is that Franklin "F Smith" matched to
Frederick "F Smith" in the using file - almost certainly not the same
character but you cannot discern this info from the mater dataset.  For
this reason **the update command is a baaaad idea unless you are 100%
confident the match is and can only be correct.**

When a variable in the master has missing data you can get that data
into the master from the using by opening the using data set before the
merge, and duplicating that variable with a slightly different name
(e.g. varname_using).  Then, once you've eliminated bad matches you can
just copy that data from varname_using to varname if varname=="".  
***Because the merge will bring varname_using into the master file make
sure you drop it for the *unmatched cases* before the next match (you
can do this when you open the master to sort it before the next match).

One more tip:  depending on the complexity of the task, the directory
structure, and your cleverness you may be able to use a foreach loop to
do the merges.

If you have additional questions about anything I've written here please
feel free to contact me personally.  I'm not a stata guru but have used
it extensively for precisely the sort of task that you are undertaking
with some very large data sets (e.g. voter rolls).    If you'd like to
see some of my do files I've used for this purpose I'd be happy to
share.  Caveat emptor.

Will

On 11/8/2010 11:50 PM, Michael Eisenberg wrote:

> Colleagues,
>
> I have a database of about 20K men that I'd like to merge with another
> database.  I have names (first, middle, and last) as well as date of
> birth and social security number for most men.  Unfortunately, the
> original database has some missing data on birthdate and social
> security numbers.  The new database has most of the birthdate info as
> well as the geographic information that I need.
>
> Some men do have the same name.
>
> Is there anyway to merge based on name if it doesn't uniquely identify
> men?  I'd like to somehow match all men and then let me manually
> compare based on visit dates to decide if it's likely the match is
> correct.  If not, any suggestions?
>
> Thanks for you help.
>
> Mike
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Loading...