Lecture 11: Optimization

Slides Link

From Last Time

  • Graph of averages
  • Minimizing average loss
  • It has no assumption of the shape of the data.
  • There is always a line of best fit, but might not be right Anscombe Quartet


  • Linear easy to interpret

Log transformation

$$ y=a^x, log(y)=xlog(a) $$ $$ y=ax^k, log(y)=log(a)+k~log(x) $$

Simple Linear Regression: Interpreting the Slope

$$ slope = r~\frac{\sigma_y}{\sigma_x} $$

Regression is associative. Not causation.

For a slope of 0.09 in per pounds, we say 0.09 is the estimated difference in height between two people whose weights are one pound apart.

Recap on Modeling

  • For engineers is make predictions - accuracy
  • For scientists it's interpretability.
    • Parameters like \( F = ma \)

Steps for Modeling

Squared Loss vs. Absolute Loss

\( L^2 \) nice optimization properties (differentiable) and sensitive to outliers.

Calculus for Loss Minimization

$$ h(x)=f(g(x))) $$

$$ \frac{\partial h}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$

It cancels out.

Derivative of outside times derivative of inside. Repeat!

Numerical Optimization

  • First Order: Gradients, the slope of our loss landscape.
  • Second order is Hessian, computation is slow.


loss function \( f \) takes a vector and returns a scalar

Gradient takes the scalar and returns a vector.

Like falling down a well. 1 Dimension example.

Update the weights \( \theta \). \( \rho(\tau) \) is the learning rate.