Predict in version 11

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Predict in version 11

Marnix Zoutenbier
Dear all,

I see a difference in the way predict works between Stata10 and 11.

Consider the following example
x1 testset y
1 1 12
2 1 13
3 1 14
4 2 .

And the commands
anova y x1 if testset==1
predict yhat

The following is the result in version 11
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . 12

While in version 10 the following dataset results
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . .

I prefer the version 10 way-of-working, because it gives me the opportunity
to identify observations that are in the testset (testset==2) and not in
the trainingset (testset==1).

Is it possible to obtain the same result in version 11 as in version 10,
other than switching with the version command before and after predict?

Thank you for your consideration,

Marnix Zoutenbier

______________________

Drs. Marnix Zoutenbier MTD CIRM
Senior Consultant

T: +31 (0)40 750 23 25
F: +31 (0)40 750 16 99
E: [hidden email]

CQM B.V.
PO Box 414, 5600 AK Eindhoven, The Netherlands
Vonderweg 16, 5616 RM Eindhoven, The Netherlands
KvK 17076484
I: www.cqm.nl
DISCLAIMER
De informatie verzonden in dit e-mail bericht is vertrouwelijk en
uitsluitend bestemd voor de geadresseerde. Indien de informatie bij
vergissing bij u terecht is gekomen dan verzoeken wij u vriendelijk de
afzender hiervan op de hoogte te stellen en de informatie te verwijderen.
Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze
informatie aan derden zonder expliciete toestemming van de afzender is niet
toegestaan. Aan de verstrekte informatie kunnen geen rechten worden
ontleend, tenzij expliciet anders is aangegeven in het e-mail bericht. Wij
staan niet in voor de juiste en volledige overbrenging, noch voor de
tijdige ontvangst van dit e-mail bericht.<br>
This message and its attachments are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
please notify the sender by reply mail and delete this message and its
attachments from your system and refrain from using or copying its contents
or disclosing its contents in any manner to third parties. We cannot assume
any responsibility for the accuracy or reliability of the information
contained in these message (including attachments), nor shall the
information be construed as constituting any obligation on the part of CQM
 DISCLAIMER

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

Re: Predict in version 11

nshephard
Administrator
On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier
<[hidden email]> wrote:

> Dear all,
>
> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))


Neil


--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution

Email - [hidden email]
Website - http://kimura-no-ip.org/
Photos - http://www.flickr.com/photos/slackline/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

Re: Predict in version 11

Marnix Zoutenbier
Dear all,

Neil his reaction is correct. However, it shows that I did not formulate my
problem accurate, because it is not the solution that works for me.

Let me extend the example with one extra observation to make myself more
clear
x1 testset y
1 1 12
2 1 13
3 1 14
4 2 .
3 2 .

So the last observation is defined by x1 in the same way as the third
observation. The testset (testset==2) consists of 2 observations, from
which the observation with x1=3 can be predicted based on the traininset
(testset==1) but the observation with x1=4 can not be predicted because
x1=4 is not in the trainingset.

First in version 11
version 11
anova y x1 if testset==1
predict yhat

Gives the following result in version 11
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . 12
3 2 . 14

Now in version 10
version 10
anova y x1 if testset==1
predict yhat

Gives the following result
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . .
3 2 . 14

This problem is not fixed with the 'e(sample)' suggestion, because I do
want to predict in the testset (outside e(sample)), however, I only want
predictions for values of x1 that are used in the trainingset (testset==1).

Thank you for your consideration,

Best regards,

Marnix



______________________

Drs. Marnix Zoutenbier MTD CIRM
Senior Consultant

T: +31 (0)40 750 23 25
F: +31 (0)40 750 16 99
E: [hidden email]

CQM B.V.
PO Box 414, 5600 AK Eindhoven, The Netherlands
Vonderweg 16, 5616 RM Eindhoven, The Netherlands
KvK 17076484
I: www.cqm.nl



From: Neil Shephard <[hidden email]>
To: [hidden email]
Date: 08-12-2010 12:08
Subject: Re: st: Predict in version 11
Sent by: [hidden email]



On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier
<[hidden email]> wrote:

> Dear all,
>
> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))


Neil


--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution

Email - [hidden email]
Website - http://kimura-no-ip.org/
Photos - http://www.flickr.com/photos/slackline/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

RE: Predict in version 11

Nick Cox
The solution appears to be just a twist away from that already given.

... if testset == 1

Otherwise put, -predict- allows -if- (and -in-), so just specify whatever restrictions you want.

Nick
[hidden email]

Marnix Zoutenbier

Neil his reaction is correct. However, it shows that I did not formulate my
problem accurate, because it is not the solution that works for me.

Let me extend the example with one extra observation to make myself more
clear
x1 testset y
1 1 12
2 1 13
3 1 14
4 2 .
3 2 .

So the last observation is defined by x1 in the same way as the third
observation. The testset (testset==2) consists of 2 observations, from
which the observation with x1=3 can be predicted based on the traininset
(testset==1) but the observation with x1=4 can not be predicted because
x1=4 is not in the trainingset.

First in version 11
version 11
anova y x1 if testset==1
predict yhat

Gives the following result in version 11
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . 12
3 2 . 14

Now in version 10
version 10
anova y x1 if testset==1
predict yhat

Gives the following result
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . .
3 2 . 14

This problem is not fixed with the 'e(sample)' suggestion, because I do
want to predict in the testset (outside e(sample)), however, I only want
predictions for values of x1 that are used in the trainingset (testset==1).

From: Neil Shephard <[hidden email]>

On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

RE: Predict in version 11

Marnix Zoutenbier
Dear Nick,

Thank you for your response. However, your solution is not what I mean. I
want to predict forboth  testset==1 and testset==2, but I want Stata to
predict a missing value in the case that x1=4 in testset==2 because x1=4
does not appear in testset==1.

However, in version 11 Stata also predicts in testset==2 for values of x1
that do not appear in testset==1 (trainingset). Stata uses the constant to
predict, which I think, is very confusing in large datasets. In version 10,
Stata predicts a missing value in those cases, which is, in my opinion, the
proper way to proceed.

Thank you in advance for your consideration,

Best regards,

Marnix


______________________

Drs. Marnix Zoutenbier MTD CIRM
Senior Consultant

T: +31 (0)40 750 23 25
F: +31 (0)40 750 16 99
E: [hidden email]

CQM B.V.
PO Box 414, 5600 AK Eindhoven, The Netherlands
Vonderweg 16, 5616 RM Eindhoven, The Netherlands
KvK 17076484
I: www.cqm.nl



From: Nick Cox <[hidden email]>
To: "'[hidden email]'"
            <[hidden email]>
Date: 08-12-2010 13:38
Subject: RE: st: Predict in version 11
Sent by: [hidden email]



The solution appears to be just a twist away from that already given.

... if testset == 1

Otherwise put, -predict- allows -if- (and -in-), so just specify whatever
restrictions you want.

Nick
[hidden email]

Marnix Zoutenbier

Neil his reaction is correct. However, it shows that I did not formulate my
problem accurate, because it is not the solution that works for me.

Let me extend the example with one extra observation to make myself more
clear
x1 testset y
1 1 12
2 1 13
3 1 14
4 2 .
3 2 .

So the last observation is defined by x1 in the same way as the third
observation. The testset (testset==2) consists of 2 observations, from
which the observation with x1=3 can be predicted based on the traininset
(testset==1) but the observation with x1=4 can not be predicted because
x1=4 is not in the trainingset.

First in version 11
version 11
anova y x1 if testset==1
predict yhat

Gives the following result in version 11
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . 12
3 2 . 14

Now in version 10
version 10
anova y x1 if testset==1
predict yhat

Gives the following result
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . .
3 2 . 14

This problem is not fixed with the 'e(sample)' suggestion, because I do
want to predict in the testset (outside e(sample)), however, I only want
predictions for values of x1 that are used in the trainingset (testset==1).

From: Neil Shephard <[hidden email]>

On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

RE: Predict in version 11

Nick Cox
OK. Looking at your example again, I think I can see your point. Why does -predict- think it can go ahead when x1 == 4? The key point is that -x1- is treated as categorical, so there's no information on that category in the data used to fit. The use of a baseline category instead, if that is what happening, may be a fair default, but statistically it seems arbitrary.

(I was tacitly thinking in regression terms and finding no difficulty in the idea of a prediction for values of the predictors that don't occur elsewhere. But this is ANOVA with a categorical predictor)

Nick
[hidden email]

Marnix Zoutenbier

Dear Nick,

Thank you for your response. However, your solution is not what I mean. I
want to predict forboth  testset==1 and testset==2, but I want Stata to
predict a missing value in the case that x1=4 in testset==2 because x1=4
does not appear in testset==1.

However, in version 11 Stata also predicts in testset==2 for values of x1
that do not appear in testset==1 (trainingset). Stata uses the constant to
predict, which I think, is very confusing in large datasets. In version 10,
Stata predicts a missing value in those cases, which is, in my opinion, the
proper way to proceed.


From: Nick Cox <[hidden email]>

The solution appears to be just a twist away from that already given.

... if testset == 1

Otherwise put, -predict- allows -if- (and -in-), so just specify whatever
restrictions you want.

Marnix Zoutenbier

Neil his reaction is correct. However, it shows that I did not formulate my
problem accurate, because it is not the solution that works for me.

Let me extend the example with one extra observation to make myself more
clear
x1 testset y
1 1 12
2 1 13
3 1 14
4 2 .
3 2 .

So the last observation is defined by x1 in the same way as the third
observation. The testset (testset==2) consists of 2 observations, from
which the observation with x1=3 can be predicted based on the traininset
(testset==1) but the observation with x1=4 can not be predicted because
x1=4 is not in the trainingset.

First in version 11
version 11
anova y x1 if testset==1
predict yhat

Gives the following result in version 11
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . 12
3 2 . 14

Now in version 10
version 10
anova y x1 if testset==1
predict yhat

Gives the following result
x1 testset y yhat
1 1 12 12
2 1 13 13
3 1 14 14
4 2 . .
3 2 . 14

This problem is not fixed with the 'e(sample)' suggestion, because I do
want to predict in the testset (outside e(sample)), however, I only want
predictions for values of x1 that are used in the trainingset (testset==1).

From: Neil Shephard <[hidden email]>

On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

Re: Predict in version 11

Jeff Pitblado, StataCorp LP
In reply to this post by Marnix Zoutenbier
Marnix Zoutenbier <[hidden email]> is using -predict- after -anova-
and noticed that Stata 11 will now produce a non-missing value in
out-of-sample observations where a factor variable takes on values not
observed within the estimation sample:

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1 testset y
> 1 1 12
> 2 1 13
> 3 1 14
> 4 2 .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . 12
>
> While in version 10 the following dataset results
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . .
>
> I prefer the version 10 way-of-working, because it gives me the opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?
>
> Thank you for your consideration,

Short reply:

Except under version control, as noted above by Marnix, there is no option of
-predict- to get it to behave like it did in Stata 10.  As with out-of-sample
predictions involving continuous predictors, Stata 11 relies on the data
analyst to judge which predictions are meaningful or even valid.

Both Neil Shephard <[hidden email]> and Nick Cox <[hidden email]>
point out that -predict- allows -if- and -in- restrictions, giving the data
analyst the control to identify which observations to compute the predictions.

Longer reply:

Prior to Stata 11, -anova- and -manova- were the only estimation commands that
possessed logic to handle categorical variables, but even they had some
limitations we intended to address with the new factor variables notation.
For example, controlling the base level and level restrictions were not
allowed with -anova- and -manova- without generating modified copies of the
factor variables.

The new factor variables notation also replaced and expanded on the features
of the -xi- prefix, which produced indicator variables for categorical
variables and some two-way interactions.

One of our goals for the new factor variables notation was to get all of
Stata's official estimation commands to support categorical variables and
their interactions consistently.  Thus -anova- and -manova- were updated to
possess the same features of their linear models counterparts, -regress- and
-mvreg-.

The new factor variables notation allows you to specify which levels to
include in a model fit.  Using Marnix's data, let's fit an ANOVA model where
we only care about the effect of x1=1 compared to all the other levels.  In
Stata 11 we simply type

***** BEGIN:
. anova y 1.x1

                           Number of obs =       3     R-squared     =  0.7500
                           Root MSE      = .707107     Adj R-squared =  0.5000

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |         1.5     1         1.5       3.00     0.3333
                         |
                      x1 |         1.5     1         1.5       3.00     0.3333
                         |
                Residual |          .5     1          .5  
              -----------+----------------------------------------------------
                   Total |           2     2           1  

. mat li e(b)

e(b)[1,2]
        1.      
       x1  _cons
y1   -1.5   13.5
***** END:

We see that -anova- used all observations where 'x1' and 'y' were not missing,
fitting an intercept '_cons' and a coefficient on '1.x1'.

        '1.x1' is factor variables notation for an implied variable that
        indicates when 'x1' is equal to 1.

Here are the linear predictions:

***** BEGIN:
. predict yhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +---------------------------+
     | x1   testset    y   yhat1 |
     |---------------------------|
  1. |  1         1   12      12 |
  2. |  2         1   13    13.5 |
  3. |  3         1   14    13.5 |
  4. |  4         2    .       . |
     +---------------------------+
***** END:

Notice that -predict- treated levels 2 and 3 the same, so we get their average
response back as the linear prediction.  This is in accordance with a linear
regression model with a single indicator variable that identifies when 'x1' is
equal to 1.

Here are the commands to reproduce the above using -regress-, but without
factor variables notation:

***** BEGIN:
. gen x1is1 = x1==1

. regress y x1is1

      Source |       SS       df       MS              Number of obs =       3
-------------+------------------------------           F(  1,     1) =    3.00
       Model |         1.5     1         1.5           Prob > F      =  0.3333
    Residual |          .5     1          .5           R-squared     =  0.7500
-------------+------------------------------           Adj R-squared =  0.5000
       Total |           2     2           1           Root MSE      =  .70711

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       x1is1 |       -1.5   .8660254    -1.73   0.333     -12.5039    9.503896
       _cons |       13.5         .5    27.00   0.024     7.146898     19.8531
------------------------------------------------------------------------------

. predict ryhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +--------------------------------------------+
     | x1   testset    y   yhat1   x1is1   ryhat1 |
     |--------------------------------------------|
  1. |  1         1   12      12       1       12 |
  2. |  2         1   13    13.5       0     13.5 |
  3. |  3         1   14    13.5       0     13.5 |
  4. |  4         2    .       .       0        . |
     +--------------------------------------------+
***** END:

Since we did not use factor variables notation, we can reproduce the result in
Stata 10 or Stata 11; we can even use -anova- instead of -regress-.

--Jeff --Ken
[hidden email] [hidden email]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Reply | Threaded
Open this post in threaded view
|

Re: Predict in version 11

Marnix Zoutenbier
Dear Jef, Nick, Neil,

** Short reply:
Thank you very much for your help with respect to -predict- after -anova-
when values of x in the testset are outside the domain of the trainingset.
I understand the way Stata 11 works and why this is chosen to be different
from stata 10.

** Some extra background for those who are interested
In our project we were dealing with a testset of 500k observations and a
testset of 50k observations from which the measurements were hidden to us.
Our model consisted of many different categorical regressors with some of
them 10-20 categories and a model which also inclcuded 2-,3-, and 4-factor
interactions. We assumed, based on our experience with Stata 10, that
combinations of the regressors in the testset that were not in the
trainingset were predicted with a missing value. The feedback we obtained
in terms of overall RMSE in the testset was much worse than we expected
based on the trainingset-results. The reason why is now clear to us:
-predict- predicts the basevalue if the combinations of regressors is not
estimated in the trainingset, without us realizing that, and that increased
the RMSE in the testset considerably. I am very happy we found out what the
reason is and being able to fix it.

Thank you very much for your help in this process,

Best regards,

Marnix



______________________

Drs. Marnix Zoutenbier MTD CIRM
Senior Consultant

T: +31 (0)40 750 23 25
F: +31 (0)40 750 16 99
E: [hidden email]

CQM B.V.
PO Box 414, 5600 AK Eindhoven, The Netherlands
Vonderweg 16, 5616 RM Eindhoven, The Netherlands
KvK 17076484
I: www.cqm.nl



From: [hidden email] (Jeff Pitblado, StataCorp LP)
To: [hidden email]
Date: 08-12-2010 20:11
Subject: Re: st: Predict in version 11
Sent by: [hidden email]



Marnix Zoutenbier <[hidden email]> is using -predict- after
-anova-
and noticed that Stata 11 will now produce a non-missing value in
out-of-sample observations where a factor variable takes on values not
observed within the estimation sample:

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1 testset y
> 1 1 12
> 2 1 13
> 3 1 14
> 4 2 .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . 12
>
> While in version 10 the following dataset results
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?
>
> Thank you for your consideration,

Short reply:

Except under version control, as noted above by Marnix, there is no option
of
-predict- to get it to behave like it did in Stata 10.  As with
out-of-sample
predictions involving continuous predictors, Stata 11 relies on the data
analyst to judge which predictions are meaningful or even valid.

Both Neil Shephard <[hidden email]> and Nick Cox
<[hidden email]>
point out that -predict- allows -if- and -in- restrictions, giving the data
analyst the control to identify which observations to compute the
predictions.

Longer reply:

Prior to Stata 11, -anova- and -manova- were the only estimation commands
that
possessed logic to handle categorical variables, but even they had some
limitations we intended to address with the new factor variables notation.
For example, controlling the base level and level restrictions were not
allowed with -anova- and -manova- without generating modified copies of the
factor variables.

The new factor variables notation also replaced and expanded on the
features
of the -xi- prefix, which produced indicator variables for categorical
variables and some two-way interactions.

One of our goals for the new factor variables notation was to get all of
Stata's official estimation commands to support categorical variables and
their interactions consistently.  Thus -anova- and -manova- were updated to
possess the same features of their linear models counterparts, -regress-
and
-mvreg-.

The new factor variables notation allows you to specify which levels to
include in a model fit.  Using Marnix's data, let's fit an ANOVA model
where
we only care about the effect of x1=1 compared to all the other levels.  In
Stata 11 we simply type

***** BEGIN:
. anova y 1.x1

                           Number of obs =       3     R-squared     =
0.7500
                           Root MSE      = .707107     Adj R-squared =
0.5000

                  Source |  Partial SS    df       MS           F     Prob
> F

-----------+----------------------------------------------------
                   Model |         1.5     1         1.5       3.00
0.3333
                         |
                      x1 |         1.5     1         1.5       3.00
0.3333
                         |
                Residual |          .5     1          .5

-----------+----------------------------------------------------
                   Total |           2     2           1

. mat li e(b)

e(b)[1,2]
        1.
       x1  _cons
y1   -1.5   13.5
***** END:

We see that -anova- used all observations where 'x1' and 'y' were not
missing,
fitting an intercept '_cons' and a coefficient on '1.x1'.

                 '1.x1' is factor variables notation for an implied variable
that
                 indicates when 'x1' is equal to 1.

Here are the linear predictions:

***** BEGIN:
. predict yhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +---------------------------+
     | x1   testset    y   yhat1 |
     |---------------------------|
  1. |  1         1   12      12 |
  2. |  2         1   13    13.5 |
  3. |  3         1   14    13.5 |
  4. |  4         2    .       . |
     +---------------------------+
***** END:

Notice that -predict- treated levels 2 and 3 the same, so we get their
average
response back as the linear prediction.  This is in accordance with a
linear
regression model with a single indicator variable that identifies when 'x1'
is
equal to 1.

Here are the commands to reproduce the above using -regress-, but without
factor variables notation:

***** BEGIN:
. gen x1is1 = x1==1

. regress y x1is1

      Source |       SS       df       MS              Number of obs =
3
-------------+------------------------------           F(  1,     1) =
3.00
       Model |         1.5     1         1.5           Prob > F      =
0.3333
    Residual |          .5     1          .5           R-squared     =
0.7500
-------------+------------------------------           Adj R-squared =
0.5000
       Total |           2     2           1           Root MSE
=  .70711

------------------------------------------------------------------------------

           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf.
Interval]
-------------+----------------------------------------------------------------

       x1is1 |       -1.5   .8660254    -1.73   0.333     -12.5039
9.503896
       _cons |       13.5         .5    27.00   0.024     7.146898
19.8531
------------------------------------------------------------------------------


. predict ryhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)

. list

     +--------------------------------------------+
     | x1   testset    y   yhat1   x1is1   ryhat1 |
     |--------------------------------------------|
  1. |  1         1   12      12       1       12 |
  2. |  2         1   13    13.5       0     13.5 |
  3. |  3         1   14    13.5       0     13.5 |
  4. |  4         2    .       .       0        . |
     +--------------------------------------------+
***** END:

Since we did not use factor variables notation, we can reproduce the result
in
Stata 10 or Stata 11; we can even use -anova- instead of -regress-.

--Jeff --Ken
[hidden email] [hidden email]
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/