Lecture 12: Optimization Cont. (SGD and Auto Diff)

Webcast

Slides

Gradient Descent Algorithm

  • Compute slope at that point

  • Update (calculated from the gradient is red)

  • Initial vector (weights?)
  • Update by learning rate
  • Converge, gradient is 0 or stop early
  • We can do better! Calculating the gradient is slow

  • Computing on the population is expensive!!!

Stochastic Gradient Descent

  • Sample! Called a batch. B term
  • Assume loss is decomposable?

  • _Decomposable Loss, must be able to be written as a sum on loss

  • Momentum, ADAM
  • PyTorch

Comparing Gradient Descent vs SGD

  • SGD is faster, but on average, the mean is correct and it converges!!!

PyTorch

  • Using the forward pass, calculates gradient
    • chain rules, individual calculus operations - computation graph

  • Graph of each individual operation
  • Backward Differentiation uses the chain rule!!

  • Can use GPUs, and auto_diff

Demo

  • Line of best fit and residuals

  • Mean Square Error Loss Surface

  • \( L^{1} \) loss surface for comparison!
  • Sharp at the end

  • PyTorch nn.Module, can add parameters
    • Only need a forward function!

Implement Basic Gradient Descent

  • N steps, get loss, do loss.backward()
    • with torch.no_gradient():

  • Visualization SimpleLinearModel

  • Green model!

Make it a Polynomial

Implement Stochastic Gradient Descent

  • nepochs how many times walk through data, and loader for the batch size

  • Overfitting LMAO