Dear Jef, Nick, Neil,
** Short reply:
Thank you very much for your help with respect to predict after anova
when values of x in the testset are outside the domain of the trainingset.
I understand the way Stata 11 works and why this is chosen to be different
from stata 10.
** Some extra background for those who are interested
In our project we were dealing with a testset of 500k observations and a
testset of 50k observations from which the measurements were hidden to us.
Our model consisted of many different categorical regressors with some of
them 1020 categories and a model which also inclcuded 2,3, and 4factor
interactions. We assumed, based on our experience with Stata 10, that
combinations of the regressors in the testset that were not in the
trainingset were predicted with a missing value. The feedback we obtained
in terms of overall RMSE in the testset was much worse than we expected
based on the trainingsetresults. The reason why is now clear to us:
predict predicts the basevalue if the combinations of regressors is not
estimated in the trainingset, without us realizing that, and that increased
the RMSE in the testset considerably. I am very happy we found out what the
reason is and being able to fix it.
Thank you very much for your help in this process,
Best regards,
Marnix
______________________
Drs. Marnix Zoutenbier MTD CIRM
Senior Consultant
T: +31 (0)40 750 23 25
F: +31 (0)40 750 16 99
E:
[hidden email]
CQM B.V.
PO Box 414, 5600 AK Eindhoven, The Netherlands
Vonderweg 16, 5616 RM Eindhoven, The Netherlands
KvK 17076484
I: www.cqm.nl
From:
[hidden email] (Jeff Pitblado, StataCorp LP)
To:
[hidden email]
Date: 08122010 20:11
Subject: Re: st: Predict in version 11
Sent by:
[hidden email]
Marnix Zoutenbier <
[hidden email]> is using predict after
anova
and noticed that Stata 11 will now produce a nonmissing value in
outofsample observations where a factor variable takes on values not
observed within the estimation sample:
> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1 testset y
> 1 1 12
> 2 1 13
> 3 1 14
> 4 2 .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . 12
>
> While in version 10 the following dataset results
> x1 testset y yhat
> 1 1 12 12
> 2 1 13 13
> 3 1 14 14
> 4 2 . .
>
> I prefer the version 10 wayofworking, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?
>
> Thank you for your consideration,
Short reply:
Except under version control, as noted above by Marnix, there is no option
of
predict to get it to behave like it did in Stata 10. As with
outofsample
predictions involving continuous predictors, Stata 11 relies on the data
analyst to judge which predictions are meaningful or even valid.
Both Neil Shephard <
[hidden email]> and Nick Cox
<
[hidden email]>
point out that predict allows if and in restrictions, giving the data
analyst the control to identify which observations to compute the
predictions.
Longer reply:
Prior to Stata 11, anova and manova were the only estimation commands
that
possessed logic to handle categorical variables, but even they had some
limitations we intended to address with the new factor variables notation.
For example, controlling the base level and level restrictions were not
allowed with anova and manova without generating modified copies of the
factor variables.
The new factor variables notation also replaced and expanded on the
features
of the xi prefix, which produced indicator variables for categorical
variables and some twoway interactions.
One of our goals for the new factor variables notation was to get all of
Stata's official estimation commands to support categorical variables and
their interactions consistently. Thus anova and manova were updated to
possess the same features of their linear models counterparts, regress
and
mvreg.
The new factor variables notation allows you to specify which levels to
include in a model fit. Using Marnix's data, let's fit an ANOVA model
where
we only care about the effect of x1=1 compared to all the other levels. In
Stata 11 we simply type
***** BEGIN:
. anova y 1.x1
Number of obs = 3 Rsquared =
0.7500
Root MSE = .707107 Adj Rsquared =
0.5000
Source  Partial SS df MS F Prob
> F
+
Model  1.5 1 1.5 3.00
0.3333

x1  1.5 1 1.5 3.00
0.3333

Residual  .5 1 .5
+
Total  2 2 1
. mat li e(b)
e(b)[1,2]
1.
x1 _cons
y1 1.5 13.5
***** END:
We see that anova used all observations where 'x1' and 'y' were not
missing,
fitting an intercept '_cons' and a coefficient on '1.x1'.
'1.x1' is factor variables notation for an implied variable
that
indicates when 'x1' is equal to 1.
Here are the linear predictions:
***** BEGIN:
. predict yhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)
. list
++
 x1 testset y yhat1 

1.  1 1 12 12 
2.  2 1 13 13.5 
3.  3 1 14 13.5 
4.  4 2 . . 
++
***** END:
Notice that predict treated levels 2 and 3 the same, so we get their
average
response back as the linear prediction. This is in accordance with a
linear
regression model with a single indicator variable that identifies when 'x1'
is
equal to 1.
Here are the commands to reproduce the above using regress, but without
factor variables notation:
***** BEGIN:
. gen x1is1 = x1==1
. regress y x1is1
Source  SS df MS Number of obs =
3
+ F( 1, 1) =
3.00
Model  1.5 1 1.5 Prob > F =
0.3333
Residual  .5 1 .5 Rsquared =
0.7500
+ Adj Rsquared =
0.5000
Total  2 2 1 Root MSE
= .70711

y  Coef. Std. Err. t P>t [95% Conf.
Interval]
+
x1is1  1.5 .8660254 1.73 0.333 12.5039
9.503896
_cons  13.5 .5 27.00 0.024 7.146898
19.8531

. predict ryhat1 if e(sample)
(option xb assumed; fitted values)
(1 missing value generated)
. list
++
 x1 testset y yhat1 x1is1 ryhat1 

1.  1 1 12 12 1 12 
2.  2 1 13 13.5 0 13.5 
3.  3 1 14 13.5 0 13.5 
4.  4 2 . . 0 . 
++
***** END:
Since we did not use factor variables notation, we can reproduce the result
in
Stata 10 or Stata 11; we can even use anova instead of regress.
Jeff Ken
[hidden email] [hidden email]
*
* For searches and help try:
*
http://www.stata.com/help.cgi?search*
http://www.stata.com/support/statalist/faq*
http://www.ats.ucla.edu/stat/stata/*
* For searches and help try:
*
http://www.stata.com/help.cgi?search*
http://www.stata.com/support/statalist/faq*
http://www.ats.ucla.edu/stat/stata/