Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1220.
Date:                Thu, 10 May 2018   Prob (F-statistic):           7.83e-44
Time:                        08:37:38   Log-Likelihood:                 6.6982
No. Observations:                  50   AIC:                            -5.396
Df Residuals:                      46   BIC:                             2.252
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8844      0.075     64.948      0.000       4.733       5.036
x1             0.5188      0.012     44.731      0.000       0.495       0.542
x2             0.4916      0.046     10.783      0.000       0.400       0.583
x3            -0.0213      0.001    -20.925      0.000      -0.023      -0.019
==============================================================================
Omnibus:                        4.143   Durbin-Watson:                   2.432
Prob(Omnibus):                  0.126   Jarque-Bera (JB):                3.916
Skew:                          -0.228   Prob(JB):                        0.141
Kurtosis:                       4.293   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.35167955   4.84200381   5.29316687   5.67837496   5.98050408
   6.19491334   6.33020751   6.40682325   6.45367143   6.50338715
   6.58696789   6.72868074   6.94207532   7.22775768   7.5732907
   7.95523748   8.34301193   8.70390402   9.00845394   9.23529221
   9.3746507    9.42996746   9.41732185   9.3627928    9.29817406
   9.25575059   9.26299426   9.33805027   9.48675595   9.70168172
   9.96335174  10.2434437   10.50944217  10.72998073  10.8799954
  10.9448426   10.92270502  10.82489023  10.67397285  10.50008561
  10.33596866  10.2115904   10.1492238   10.15978699  10.24104948
  10.37799856  10.5453059   10.7114889   10.8440832   10.91497648]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.8906665   10.73571414  10.46939953  10.13565961   9.79233087
   9.49698893   9.29285206   9.19819979   9.20189748   9.26612246]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.884392
x1                  0.518808
np.sin(x1)          0.491636
I((x1 - 5) ** 2)   -0.021309
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.890666
1    10.735714
2    10.469400
3    10.135660
4     9.792331
5     9.496989
6     9.292852
7     9.198200
8     9.201897
9     9.266122
dtype: float64