# Lecture 11: Optimization

## From Last Time

- Graph of averages
- Minimizing average loss
- It has no assumption of the shape of the data.
- There is always a line of best fit, but might not be right
**Anscombe Quartet**

### Transformation

- Linear easy to interpret

#### Log transformation

$$ y=a^x, log(y)=xlog(a) $$ $$ y=ax^k, log(y)=log(a)+k~log(x) $$

### Simple Linear Regression: Interpreting the Slope

$$ slope = r~\frac{\sigma_y}{\sigma_x} $$

Regression is associative. Not causation.

For a slope of 0.09 in per pounds, we say **0.09 is the estimated difference in height between two people whose weights are one pound apart.**

## Recap on Modeling

- For engineers is make predictions -
**accuracy** - For scientists it's
**interpretability**.- Parameters like \( F = ma \)

## Steps for Modeling

### Squared Loss vs. Absolute Loss

\( L^2 \) nice optimization properties (differentiable) and sensitive to outliers.

### Calculus for Loss Minimization

$$ h(x)=f(g(x))) $$

$$ \frac{\partial h}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$

It cancels out.

Derivative of outside times derivative of inside. Repeat!

### Numerical Optimization

- First Order: Gradients, the slope of our loss landscape.
- Second order is Hessian, computation is slow.

### Gradients

loss function \( f \) takes a vector and returns a scalar

Gradient takes the scalar and returns a vector.

Like falling down a well. 1 Dimension example.

Update the weights \( \theta \). \( \rho(\tau) \) is the learning rate.