# Lecture 12: Optimization Cont. (SGD and Auto Diff)

Webcast

Slides

• Compute slope at that point

• Update (calculated from the gradient is red)

• Initial vector (weights?)
• Update by learning rate
• Converge, gradient is 0 or stop early
• We can do better! Calculating the gradient is slow

• Computing on the population is expensive!!!

• Sample! Called a batch. B term
• Assume loss is decomposable?

• _Decomposable Loss, must be able to be written as a sum on loss

• PyTorch

### Comparing Gradient Descent vs SGD

• SGD is faster, but on average, the mean is correct and it converges!!!

### PyTorch

• Using the forward pass, calculates gradient
• chain rules, individual calculus operations - computation graph

• Graph of each individual operation
• Backward Differentiation uses the chain rule!!

• Can use GPUs, and auto_diff

## Demo

• Line of best fit and residuals

• Mean Square Error Loss Surface

• $$L^{1}$$ loss surface for comparison!
• Sharp at the end

• PyTorch nn.Module, can add parameters
• Only need a forward function!

• N steps, get loss, do loss.backward()
• with torch.no_gradient():

• Visualization SimpleLinearModel

• Green model!

### Make it a Polynomial

• nepochs how many times walk through data, and loader for the batch size