# Lecture 12: Optimization Cont. (SGD and Auto Diff)

Webcast

Slides • Compute slope at that point • Update (calculated from the gradient is red) • Initial vector (weights?)
• Update by learning rate
• Converge, gradient is 0 or stop early
• We can do better! Calculating the gradient is slow • Computing on the population is expensive!!! • Sample! Called a batch. B term
• Assume loss is decomposable? • _Decomposable Loss, must be able to be written as a sum on loss • PyTorch

### Comparing Gradient Descent vs SGD  • SGD is faster, but on average, the mean is correct and it converges!!!

### PyTorch  • Using the forward pass, calculates gradient
• chain rules, individual calculus operations - computation graph • Graph of each individual operation
• Backward Differentiation uses the chain rule!! • Can use GPUs, and auto_diff

## Demo • Line of best fit and residuals • Mean Square Error Loss Surface • $$L^{1}$$ loss surface for comparison!
• Sharp at the end   • PyTorch nn.Module, can add parameters
• Only need a forward function! • N steps, get loss, do loss.backward()
• with torch.no_gradient(): • Visualization SimpleLinearModel • Green model!

### Make it a Polynomial  • nepochs how many times walk through data, and loader for the batch size  