Lecture 17: Pitfalls of Feature Engineering

Jupyter Printout

  • Problems of Feature Engineering
  • Overfiting
from numpy.linalg import solve

def fit(X, Y):
    return solve(X.T @ X, X.T @ Y)

def add_ones_column(data):
    n,_ = data.shape
    return np.hstack([np.ones((n,1)), data])

X = data[['X']].to_numpy()
Y = data[['Y']].to_numpy()

...

class LinearModel:
    def __init__(self, phi):
        self.phi = phi
    def fit(self, X, Y):
        Phi = self.phi(X)
        self.theta_hat = solve(Phi.T @ Phi, Phi.T @ Y)
        return self.theta_hat
    def predict(self, X):
        Phi = self.phi(X)
        return Phi @ self.theta_hat
    def loss(self, X, Y):
        return np.mean((Y - self.predict(X))**2)

model_line = LinearModel(phi_line)
model_line.fit(X,Y)
model_line.loss(X,Y)

Redundant Features

If you try to copy it, when trying to solve this is a Singular Matrix Error.

It isn't full rank, we has redundancy, the column space is not linearly independent.

Too Many Features

With too many, the optimal solution is underdetermined!

  • Using RBF
  • add RBF with linear
  • Do 20 bumps, on 9 data points?
    • Rank 9 matrix

Overfitting

  • Test data points

  • Training error decreases but test error is terrible!

  • The bias variance tradeoff. The best fit point.