# Lecture 17: Pitfalls of Feature Engineering

Jupyter Printout

• Problems of Feature Engineering
• Overfiting
from numpy.linalg import solve

def fit(X, Y):
return solve(X.T @ X, X.T @ Y)

n,_ = data.shape
return np.hstack([np.ones((n,1)), data])

X = data[['X']].to_numpy()
Y = data[['Y']].to_numpy()

...

class LinearModel:
def __init__(self, phi):
self.phi = phi
def fit(self, X, Y):
Phi = self.phi(X)
self.theta_hat = solve(Phi.T @ Phi, Phi.T @ Y)
return self.theta_hat
def predict(self, X):
Phi = self.phi(X)
return Phi @ self.theta_hat
def loss(self, X, Y):
return np.mean((Y - self.predict(X))**2)

model_line = LinearModel(phi_line)
model_line.fit(X,Y)
model_line.loss(X,Y)


## Redundant Features

If you try to copy it, when trying to solve this is a Singular Matrix Error.

It isn't full rank, we has redundancy, the column space is not linearly independent.

## Too Many Features

With too many, the optimal solution is underdetermined!

• Using RBF
• Do 20 bumps, on 9 data points?
• Rank 9 matrix

## Overfitting

• Test data points

• Training error decreases but test error is terrible!

• The bias variance tradeoff. The best fit point.