Grab the related library we will need
from fastai.basics import *
Below you will find the exact imports for everything we use today
import torch
from torch import nn
import numpy as np
import matplotlib.pyplot as plt
from fastai.torch_core import tensor
Stochastic Gradient Descent (SGD):
- Optimization technique (optimizer)
- Commonly used in neural networks
- Example with linear regression
n = 100
Generate our data
x = torch.ones(n,2)
len(x), x[:5]
Randomize it in a uniform distribution from -1 to 1
x[:,0].uniform_(-1., 1)
x[:5], x.shape
- Any linear model is
y=mx+b
m
,x
, andb
are matrices- We have
x
m = tensor(3.,2); m, m.shape
b
is a random bias
b = torch.rand(n); b[:5], b.shape
Now we can make our y
- Matrix multiplication is denoted with
@
y = x@m + b
We'll know if we got a size wrong if:
m@x + b
Plot our results
plt.scatter(x[:,0], y)
Our weights from last lesson should minimize the distance between points and our line.
- mean squared error: Take distance from
pred
andy
, square, then average
def mse(y_hat, y): return ((y_hat-y)**2).mean()
When we run our model, we are trying to predict m
For example, say a = (0.5, 0.75)
.
- Make a prediction
- Calculate the error
a = tensor(.5, .75)
Make prediction
y_pred = x@a
Calculate error
mse(y_pred, y)
What does that mean? Let's plot it
plt.scatter(x[:,0],y)
plt.scatter(x[:,0],y_pred)
Model doesn't seen to quite fit. What's next? Optimization
Walking down Gradient Descent
- Goal: Minimize the loss function (
mse
) - Gradient Descent:
- Starts with parameters
- Moves towards new parameters to minimize the function
- Take steps in the negative direction of gradient function
First let's make this parameter
a = nn.Parameter(a); a
Next let's create an update
function to check if the current a
improved. If so, move even closer.
We'll print out every 10 iterations to see how we are doing
def update():
y_hat = x@a
loss = mse(y, y_hat)
if i % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
torch.no_grad
: No back propogation (no updating of our weights)sub_
: Subtracts some value (lr * our gradient)grad.zero_
: Zeros our gradients
lr = 1e-1
for i in range(100): update()
Now let's see how this new a
compares.
- Detach removes all gradients
plt.scatter(x[:,0],y)
plt.scatter(x[:,0], (x@a).detach())
plt.scatter(x[:,0],y_pred)
We fit our line much better here
from matplotlib import animation, rc
rc('animation', html='jshtml')
Let's redo the process and animate our y closing in
a = nn.Parameter(tensor(0.5, 0.75)); a
We'll want to set a new y
to our x@a
def animate(i):
update()
line.set_ydata((x@a).detach())
return line,
Let's create a base figure
fig = plt.figure()
plt.scatter(x[:,0], y, c='orange')
line, = plt.plot(x[:,0], (x@a).detach())
plt.close()
And animate!
animation.FuncAnimation(fig, animate, np.arange(0,100), interval=20)
Ideally we split things up into batches of data to fit, and then work with all those batches (else we'd run out of memory!
If this were a classification problem, we would want to use Cross Entropy Loss
, where we penalize incorrect confident predictions along with correct unconfident predictions. It's also called negative loss likelihood