Introduction

Link

Taught by Ani Adhikari and Joey Gonzalez

Lectures

Lecture 11: Optimization

Slides Link

From Last Time

  • Graph of averages
  • Minimizing average loss
  • It has no assumption of the shape of the data.
  • There is always a line of best fit, but might not be right Anscombe Quartet

Transformation

  • Linear easy to interpret

Log transformation

$$ y=a^x, log(y)=xlog(a) $$ $$ y=ax^k, log(y)=log(a)+k~log(x) $$

Simple Linear Regression: Interpreting the Slope

$$ slope = r~\frac{\sigma_y}{\sigma_x} $$

Regression is associative. Not causation.

For a slope of 0.09 in per pounds, we say 0.09 is the estimated difference in height between two people whose weights are one pound apart.

Recap on Modeling

  • For engineers is make predictions - accuracy
  • For scientists it's interpretability.
    • Parameters like \( F = ma \)

Steps for Modeling

Squared Loss vs. Absolute Loss

\( L^2 \) nice optimization properties (differentiable) and sensitive to outliers.

Calculus for Loss Minimization

$$ h(x)=f(g(x))) $$

$$ \frac{\partial h}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$

It cancels out.

Derivative of outside times derivative of inside. Repeat!

Numerical Optimization

  • First Order: Gradients, the slope of our loss landscape.
  • Second order is Hessian, computation is slow.

Gradients

loss function \( f \) takes a vector and returns a scalar

Gradient takes the scalar and returns a vector.

Like falling down a well. 1 Dimension example.

Update the weights \( \theta \). \( \rho(\tau) \) is the learning rate.

Lecture 12: Optimization Cont. (SGD and Auto Diff)

Webcast

Slides

Gradient Descent Algorithm

  • Compute slope at that point

  • Update (calculated from the gradient is red)

  • Initial vector (weights?)
  • Update by learning rate
  • Converge, gradient is 0 or stop early
  • We can do better! Calculating the gradient is slow

  • Computing on the population is expensive!!!

Stochastic Gradient Descent

  • Sample! Called a batch. B term
  • Assume loss is decomposable?

  • _Decomposable Loss, must be able to be written as a sum on loss

  • Momentum, ADAM
  • PyTorch

Comparing Gradient Descent vs SGD

  • SGD is faster, but on average, the mean is correct and it converges!!!

PyTorch

  • Using the forward pass, calculates gradient
    • chain rules, individual calculus operations - computation graph

  • Graph of each individual operation
  • Backward Differentiation uses the chain rule!!

  • Can use GPUs, and auto_diff

Demo

  • Line of best fit and residuals

  • Mean Square Error Loss Surface

  • \( L^{1} \) loss surface for comparison!
  • Sharp at the end

  • PyTorch nn.Module, can add parameters
    • Only need a forward function!

Implement Basic Gradient Descent

  • N steps, get loss, do loss.backward()
    • with torch.no_gradient():

  • Visualization SimpleLinearModel

  • Green model!

Make it a Polynomial

Implement Stochastic Gradient Descent

  • nepochs how many times walk through data, and loader for the batch size

  • Overfitting LMAO

Lecture 13: Review Modeling and Optimization, Intro to Regression

Human Contexts and Ethics

Imagine you are a Data Scientist on Twitter's "Trust and Safety" team.

  1. Question/Prob Formation
    • Fake News is a problem
    • Doesn't have to be an Engineering-Focused problem!
  2. Data Acquisition and Cleaning
    • What data do we have and need to collect
    • President's Tweet
  3. Exploratory Data Analysis
    • Example classify tweets as healthy or unhealthy
    • Think about the context of your problem
    • Note biases and anomalies
  4. Predictions and Inference
    • What is the story, social good
    • Think of who is listening what kind of power do you have

Think about your social context.

Review Modeling and Optimization

Models

  • Models are a function f that map from X to Y.

  • Parametric Models
    • Have parameters, often represented as a vector
    • Linear Models

  • Non-Parametric Models?
    • Nearest Neighbor
    • copy the prediction from the closest datapoint
    • Really big! Grows with the size of the data

  • Kernel Density Estimator has a param, but it's more like a hyper param

Tradeoffs in Modeling

  • Can predict midterm grades from homeworks
    • Simple model interpretable, summarize data
    • Complex model

Loss Functions

  • loss how close is our model prediction to the actual value

  • Average Loss

  • Solve it with optimization, find the \( \theta \) (param) that min loss

  • \( f_\theta(x) \) is our model. \( L(\theta) \) is the loss func

  • F.l1_loss equiv
  • Keep it as tensors. can do autograd

  • When building a model, do class ExponentialModel(nn.Module)
  • Weights self.w = nn.Parameter(torch.ones(2, 1))
    • Are initial weights is a 2x1 tensor of ones [1, 1]
  • forward is how to make a prediction
  • to evaluate:
m = ExponentialModel()
m(0) # returns tensor of 2

  • In the 3d plot, have w0, w1, and loss. Find point that minimizes loss.

  • Example of orange vs yellow line and it's location on the loss landscape

Optimization of the Model

  • You know your loss, compute the gradient (how to improve our loss)
    • grad: scalar -> vector, each index is deriv with respect to param

  • Take deriv and evaluate it

  • Auto Diff reuses gradients

Lecture 14: Review Models and Loss

  • Response var you want to estimate

  • Model, summarizes with parameter w

  • w is an estimator

Loss

  • \( L(w)=\frac{1}{n}\) # sum n i
  • sum of each \( y_i \)

  • We compare the red value vs the purple value. The green value is the best w, it minimizes loss.

Minimizing Loss

  • L(w*) ?

  • What is your 1. data, 2. model, 3. parameters, 4. loss, (also optimization method)

  • From 1-d (best avg) to 2-d, best func

  • \( w* \) is w that minimizes L(w). The best estimator is \( \hat{y}(w^*) \)

  • Can generalize our optimization to 3d! (3 weight params)
  • Can't plot our loss in 3d (cause it's 4d with 3 dim and loss dim)

  • One option is calculus set deriv of loss to 0
  • Other is grad descent/SGD

Gradient Descent

  • Can actually do brute force. do np.linspace and try all values. O(N^2)?

  • Find optimal weights of our sin model

  • Derivatives

  • Gradients, the opposite of which way to walk

  • The gradient descent algorithm visualized

Lecture 15: Least Squares Linear Regression

  • Linear Model

  • Vector form \( x ^ T \theta \)

  • Matrix is for each prediction

  • note the one on the left and \( Y = X \theta \)

  • Linear model of carat, depth, and table
# adds one to left. Horizontal stack
X = np.hstack([np.ones((n,1)), data[['carat', 'depth', 'table']].to_numpy()])
X

def linear_model(theta, xt): 
    return xt @ theta # The @ symbol is matrix multiply

Least Squares Linear Regression

  • Loss
    • By Calculus or Geometric Reasoning

  • You have a span
    • Make the residual perpendicular (orthogonal)

  • Find Theta that minimizes residual
    • Orthogonal is that equation
    • \( X^T X \) is a matrix. The optimal value is full rank

  • Our Loss is the average squared loss
def squared_loss(theta):
    return ((Y - X @ theta).T @ (Y - X @ theta)).item() / Y.shape[0]

theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat

Y_hat = X @ theta_hat

Geometry of Least Squares

Lecture 16: Least Squares Regression in SKLearn

  • Our data!
# Grid of test points
# one column of just 1
def add_ones_column(X):
    return np.hstack([np.ones((X.shape[0],1)), X])
add_ones_column(X)

theta_hat = least_squares_by_solve(add_ones_column(X),Y)
def model_append_ones(X):
    return add_ones_column(X) @ theta_hat

def plot_plane(f, X, grid_points = 30):
    u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
    v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
    xu, xv = np.meshgrid(u,v)
    X = np.vstack((xu.flatten(),xv.flatten())).transpose()
    z = f(X)
    return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_append_ones, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Scikit Learn

# The API
model = SuperCoolModelType(args)

# train
model.fit(df[['X1' 'X1']], df[['Y']])

# predict!
model.predict(df2[['X1' 'X1']])

## Linear Regression
from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True) # intercept (It makes it don't go through the origin?
model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

# predict
synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
synth_data

Looks good!

Hyper-Parameters

Let's go through Kernel Regression

from sklearn.kernel_ridge import KernelRidge
super_model = KernelRidge(kernel="rbf")
super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(super_model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0), 
                  height=600)

Curvy Dude!

Feature Functions

  • P features (mappings)

  • Map non-linear into linear
  • Feature Engineering
    • non-linear
    • change from categorical
    • Covariant Matrix?

One-Hot Encoding

  • Matrix. Instead of Alabama = 1 Hawaii = 50 cause this implies order

  • Bag-of-words with n-grams
  • high dimensional and sparse

  • Ordering (2-gram) "book well", "well enjoy"

Domain Knowledge

  • Know isWinter know spike in time

Constant Feature

  • Add the 1 column, bias param

  • Feature functions, they have no params
#stack features
def phi_periodic(X):
    return np.hstack([
        X,
        np.sin(X),
        np.sin(10*X),
        np.sin(20*X),
        np.sin(X + 1),
        np.sin(10*X + 1),
        np.sin(20*X + 1)
    ])
    

  • Some features used, some not at all

Lecture 17: Pitfalls of Feature Engineering

Jupyter Printout

  • Problems of Feature Engineering
  • Overfiting
from numpy.linalg import solve

def fit(X, Y):
    return solve(X.T @ X, X.T @ Y)

def add_ones_column(data):
    n,_ = data.shape
    return np.hstack([np.ones((n,1)), data])

X = data[['X']].to_numpy()
Y = data[['Y']].to_numpy()

...

class LinearModel:
    def __init__(self, phi):
        self.phi = phi
    def fit(self, X, Y):
        Phi = self.phi(X)
        self.theta_hat = solve(Phi.T @ Phi, Phi.T @ Y)
        return self.theta_hat
    def predict(self, X):
        Phi = self.phi(X)
        return Phi @ self.theta_hat
    def loss(self, X, Y):
        return np.mean((Y - self.predict(X))**2)

model_line = LinearModel(phi_line)
model_line.fit(X,Y)
model_line.loss(X,Y)

Redundant Features

If you try to copy it, when trying to solve this is a Singular Matrix Error.

It isn't full rank, we has redundancy, the column space is not linearly independent.

Too Many Features

With too many, the optimal solution is underdetermined!

  • Using RBF
  • add RBF with linear
  • Do 20 bumps, on 9 data points?
    • Rank 9 matrix

Overfitting

  • Test data points

  • Training error decreases but test error is terrible!

  • The bias variance tradeoff. The best fit point.

Lecture 18: Cross Validation

  • We overfit the data

  • We try to fit minimize training error
  • Test error to see generalization error

  • 5-fold cross validation
  • like bootstrap

Cross Val

# shuffle
shuffled_data = data.sample(frac=1.) # all data
shuffled_data

split_point = int(shuffled_data.shape[0]*0.95)
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]

len(tr) + len(te) == len(data)

from sklearn.model_selection import train_test_split
tr, te = train_test_split(data, test_size=0.1, random_state=83)

Don't evalutate model on test error?

SKLearn Pipelines

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

model = Pipeline([
    ("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
    ("LinearModel", LinearRegression())
]) 

model['SelectColumns']

model.fit(tr, tr['mpg'])
# model is a pipeline

Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

models = {"c+d": model}
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
    ("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
    ("LinearModel", LinearRegression())
])

from sklearn.impute import SimpleImputer
model = Pipeline([
    ("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LinearRegression())
])

Cross Validation

from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model):
    model = clone(model)
    five_fold = KFold(n_splits=5)
    rmse_values = []
    for tr_ind, va_ind in five_fold.split(tr):
        model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
        rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
    return np.mean(rmse_values)

cross_validate_rmse(model)

def compare_models(models):
    # Compute the training error for each model
    training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
    # Compute the cross validation error for each model
    validation_rmse = [cross_validate_rmse(model) for model in models.values()]
    # Compute the test error for each model (don't do this!)
    test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
    names = list(models.keys())
    fig = go.Figure([
        go.Bar(x = names, y = training_rmse, name="Training RMSE"),
        go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
        go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
    return fig

  • Train and CV RMSE

  • An example of overfiting

Lecture 19: Regularization

Link

models = {}

quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]


for i in range(len(quantitative_features)):
    # The features to include in the ith model
    features = quantitative_features[:(i+1)]
    # The name we are giving to the ith model
    name = ",".join([name[0] for name in features])
    # The pipeline for the ith model
    model = Pipeline([
        ("SelectColumns", ColumnTransformer([
            ("keep", "passthrough", features),
        ])),
        ("Imputation", SimpleImputer()),
        ("LinearModel", LinearRegression())
    ])
    # Fit the pipeline
    model.fit(tr, tr['mpg']);
    # Saving the ith model
    models[name] = model

K-fold Cross Validation

from sklearn.model_selection import cross_val_score

Overfitting

  • The blue is really small

Regularization

  • penalize overfit models
  • use less complexity

  • complexity

  • Best solution

  • L1 norm
  • LASSO

  • L2 norm
  • doesn't really stick to the corners

  • Different regularization

  • Lambda is the same as complexity due to lagrangian

  • standardize all your terms?
ridge_model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", StandardScaler(), quantitative_features),
        ("origin_encoder", OneHotEncoder(), ["origin"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", Ridge(alpha=10))
])
ridge_model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", StandardScaler(), quantitative_features),
        ("origin_encoder", OneHotEncoder(), ["origin"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", Ridge(alpha=10))
])

alphas = np.linspace(0.5, 20, 30)
cv_values = []
train_values = []
test_values = []
for alpha in alphas:
    ridge_model.set_params(LinearModel__alpha=alpha)
    cv_values.append(np.mean(cross_val_score(ridge_model, tr, tr['mpg'], scoring=rmse_score, cv=5)))
    ridge_model.fit(tr, tr['mpg'])
    train_values.append(rmse_score(ridge_model, tr, tr['mpg']))
    test_values.append(rmse_score(ridge_model, te, te['mpg']))

Cross Validate Tune Regularization Param

fig = go.Figure()
fig.add_trace(go.Scatter(x = alphas, y = train_values, mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x = alphas, y = cv_values, mode="lines+markers", name="CV"))
fig.add_trace(go.Scatter(x = alphas, y = test_values, mode="lines+markers", name="Test"))
fig.update_layout(xaxis_title=r"$\alpha$", yaxis_title="CV RMSE")

Ridge with CV

from sklearn.linear_model import RidgeCV

alphas = np.linspace(0.5, 3, 30)

ridge_model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", StandardScaler(), quantitative_features),
        ("origin_encoder", OneHotEncoder(), ["origin"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", RidgeCV(alphas=alphas))
])

  • The red CV is shrinking

Lasso CV

from sklearn.linear_model import Lasso, LassoCV

lasso_model = Pipeline([
    ("SelectColumns", ColumnTransformer([
        ("keep", StandardScaler(), quantitative_features),
        ("origin_encoder", OneHotEncoder(), ["origin"]),
        ("text", CountVectorizer(), "name")
    ])),
    ("Imputation", SimpleImputer()),
    ("LinearModel", LassoCV(cv=3))
])

lasso_model.fit(tr, tr['mpg'])
models["LassoCV"] = lasso_model
compare_models(models)

Lecture 20: Random Variables, Sampling Variablility

  1. The model should fit our training data well
  2. The new athletes

Random Variables {21, 21} -> 21

  • X is a function

    • argument is a sample: element of the domain
    • returns number: element of the range
  • Random Variables: X_1, X_2, X_3

  • P(X=x)

    • X: random Variable: function
    • x: what the function may return: number
    • chance X returns x

  • added up all of the probabilities
  • in a discrete probability we think of it as area
    • P(a <= X <= b)

  • in continuous

  • Bernoulli(p)
    • indicator variable I has value 1 if event happens and 0 if not
    • P(I = 1) = p
    • P(1 = 0) = 1-p
  • Binomial(n, p) $$ P(X = k) ~ = ~ \binom{n}{k} p^k(1-p)^{n-k}, ~~~~ 0 \le k \le n $$
# with scipy
# chance of 50 heads in 100 tosses of a fair coin
stats.binom.pmf(50, 100, 0.5)
  • Uniform
unif_density = stats.uniform.pdf(x)    # uniform (0, 1) density
  • Normal $$ f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\big{(}\frac{x-\mu}{\sigma}\big{)}^2}, ~~~ -\infty < x < \infty $$
norm_density = stats.norm.pdf(x, 50, 5)      # 

  • there is no elementary closed form

Expectation

  • weighted average of possible values
  • weights: probabilities

one sample at a time:

  • E[X] = sum over all samples X(s) * P(s)
  • E[X] = sum over all x, x * P(X=x)

  • samples, P(s) and X(s)

  • instead P(X=x)

Properties

  • E[X+Y] = E[X] + E[Y]
  • S = X + Y
  • S(s) = X(s) + Y(s)
  • S(s) * P(s) = X(s)P(s) + Y(s)P(s)
  • do a sum S(s) * P(s) = X(s)P(s) + Y(s)P(s)

  • E[X]-E[5]=2-5=-3
  • E[(X-5)(X-5)]=E[X^2]-E[10x]+E[25]=13-20+25=18

Variance and SD

  • Var[x]=E[(x-E[x])^2]

  • pull out the term

  • D_s = S - mu_S

  • D_s = D_x + D_y

  • Var[X] + Var[Y] +2E[D_x D_y]

  • \( E(D_x D_y) = E((X- \mu x)(y-\mu Y)) \)

  • covariance

  • var[s] = var[x+y] only if the covariance is zero, they are independent


Random Variable

  • A random variable is a function mapping events to real numbers
    • \( X: \Omega -> \real, )

Lecture 21: Bias Variance Tradeoff

  • relation between x and y
  • we observe the random error \( \epsilon \)
  • we only see \( Y \), the data

  • prediction is \( \hat{Y} \)

Prediction Error

  • g is the right model, \( \epsilon \) is random error
  • red Y hat is our prediction

Model Risk

  • expectation of the squared difference

    • take a sample and get the mean
  • Chance Error

    • random
  • Bias

    • when our model is bad

Observation Variance

  • \( \epsilon \) is random, expectation is zero and variance is \( \sigma^2 \)
    • so Var of Y, g(x) is constant so varaince is only of epsilon
    • called observation error
    • measuring error, missing information
    • irreducia error

Chance Error

  • vary a little
  • from a random sample

Model Variance

  • average of prediction

  • can overfit into the data
  • reduce model complexity
  • don't fit the noise
  • bias

Our Model Vs the Truth

  • green is true, red is fixed

Model Bias

  • model prediction minus true g based on fixed x
  • not random
  • underfitting from not domain knowledge
  • overfit

  • average prediction vs actual value vs average prediction

Decomposition of Error and Risk

  • expected squared diff
  • Expectation in error (variance of observation), square of the bias, model variance

Bias Variance Decomposition

Predicting by a Function with Parameters

  • f is just y

Lecture 22: Residuals, Multicollinearity, Inference

Least Squares Regression

  • definition of orthogonal
    • difference between Y vector and X theta hat is 0?
    • invert the matrix if full rank for solution

A Regression Model

  • X is a design matrix, first column is 1
  • theta is params

Residuals

  • difference Y and Y hat (estimate)

Seperating Signal and Noise

  • true signal + noise and prediction + residual

Residuals Sum to Zero

  • ?

  • The average of the fitted values is equal to average of the observed responses

  • orthogonal to the residuals?

Multiple R^2 and Overved Response and Fitted Values

  • Multiple R^2

  • Coefficient of Determination
    • variance of the fitted values
    • variance of observed responses
    • "percent of variance explained by the model"

Colinearity and the Meaning of Slope

  • Change in y per unit change in x_1 given all other variables held constant

  • colinearity: when a covariate can be predicted by a linear function of others

Inference and Assumptions of Randomness

  • our model can be expressed as intercept, weight of features, and error

  • We ha ve to estimate the weights (slopes)

  • how do we test theta_1 is 0?

Confidence Intervals for True Slope

  • could bootstrap, could build a confidence interval (?)

Lecture 23: Logistic Regression

  • machine learning
  • when labeled you have supervised learning
    • when quantitative do regression
    • when categorical do classification
  • when unlabeled you have unsupervised learning
    • dimensionality reduction
    • clustering
  • finally reinforcement learning

Kinds of Classification

  • binary two classes
  • multiclass [cat, dog, car]
  • structured prediction ?

  • try least squares regression

  • two classes

  • truncated least square

Estimating the chance of success

  • two difference coins case by case

  • single expression \( p^y(1-p)^{1-y} \)

  • estimate probability find value that maximize the function
  • take a log

  • as a sum

  • you can minimize the average?

  • what is the function in the two hard cases?

  • as a loss function there is a penalty

Logistic Function

  • linear functions are not good for probabilities

  • t can be infinity and negative infinity
  • take e of both sides

  • model probability on the real line
  • sigmoid, defined on whole line, smooth, increasing, elongated S

  • derivative

Log Odds as a Linear Function

  • features
  • linear combo of features

  • log odds
  • probabiligty is of the log odds

The Steps of the Model

  • generalized linear model

  • linear regression continuous
  • categorical (probability Y is 1)

  • increase x by one unit

  • linearly seperable data

  • you need a little bit of uncertainty

  • with regularization term

Logistic Loss Function

Gradient Descent

Logistic Regression in Scikit Learn

Log Loss

  • or cross-entropy loss

  • log loss is convex!

Lecture 24: Logistic Regression Part 2

  • using a constant model, good baseline

  • partioned into 7? and calculated the proportion in each interval

Bonus K-nearest Neighbor

  • average stored in a heap?

  • kind of bumpy

Logistic Regression

  • \( \frac{1}{1+exp(-t)} \)
  • \( t=\sum_{k=0}^d \theta_d x_d\)

One Dimensional Logistic Regression Model

  • different co-efficients

  • slope and intercept (lower intercept move right)

Loss

  • cross entropy loss
    • y log f(x) + (1- y) log f(x) sum
  • have to use an iterative method

  • the code
  • forward has the model
  • with cross entropy loss
  • zero_gradient so we can take the gradient again

  • it's sexy
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='lbfgs')

lr_model.coef_

lr_model.intercept_

lr_model.predict # vs lr_model.predict_prob

  • if theta is infinity, now we are certain instead we want some regularization
  • 1.0e-5 * theta ** 2

The Decision Rule

  • predict 1 if P(Y=1 | x) > .5
  • but can choose a value other than .5
  • accuracy: fraction of correct predictions out of all predictions

Confusion Matrix

  • False-Positive when it is 0 (false) but the algorithm predicts 1 (true)
  • False-Negative when it is 1 (true) but the algorithm predicts 0 (false)
  • Precision: true positives over true positive + false positives
    • how many selected items are relevant
  • Recall: true positives over true positives + true negatives
    • how many relevant items are selected

  • say we want to ensure 95% of malignant tumors are classified as malignant
  • np.argmin(recall > .95) - 1

  • the pathologist would have to verify 611% of the samples.
  • false diagnoses of 5% as benign when malignant is unacceptable in practice!

Lecture 25: Decision Trees and Random Forests

  • how to classify this into three petals?

Decision Tree

  • petal_width < .75 and petal_length < 2
    • two dimensions

Scikit-learn Decision Tree

from sklearn import tree
decision_tree_model = tree.DecisionTree


# Better visualizer with graphviz

  • they will have perfect accuracy unless data has the same value with different classes

  • using more features now it is 4d

  • this would overfit

Decision Tree Generation

Node Entropy

  • p_0, p_1 proportions on a node amount
  • entropy -sum p log p
    • how unpredictable a node is

Entropy

  • when all data is one class, it's zero entropy
  • evenly split = 1
  • for C classes the entropy is logC

  • entropy of the left and entropy of the right
  • iteratively choose a split value

Overfitting

  • just don't let it overgrow
  • greedy algorithm

  • pruning
  • create a validation set

Random Forest

  • low bias capture data in dataset, high variance
  • just weight and vote

Bagging

  • Bootstrap AGGregatING
  • resample
  • final model
  • Berkeley Stats 1994!

  • pick m subset feature

  • heuristic

Why Random Forest

Lecture 26: Dimensionality Reduction

High Dimensional Data

  • 2d dataset
  • hard to plot 3d

  • would be redundant rank 3
sns.pairplot

Matrices are Linear Operations

  • on data

Matrices are Coordinate Transformation

Singular Value Decomposition

Orthonormality

  • vectors unit length 1 scaled and add up to 1
  • all vectors are orthogonal

Lecture 27: Principle Component Analysis

  • width and length are independent from each other
  • what if it is noisy?

  • last column is not quite zero
  • rank 3 approximation of rank 4 data

  • rank 2 approximation

Principal Component

  • need to subtract the mean

  • pc1 country on a line

  • how above or below the PC2

Variance (Singular Value Interpretation)

  • how many principle components to use?

Homeworks

Homework 4

Part 2

Question 2

Regex

Pattern example:

pattern = re.findall(r'\[(\d+)/(\w+)/(\d+):', str)

Question 3

Question 4

Question 5

  • For 5c, reading from csv without header, tab seperated, setting index to zeroth column, and selecting only the first column:
pd.read_csv('file.txt', index_col=0, header=None, sep='\t').rename(columns={1: 'col_name'})[['col_name']

5f

# this make each str a column!
df['col_name'].str.split(' ', expand=True)

5g

pd.merge(left, right, how='left') # left, right, outer, inner
# on=None, left_on=None, right_on=None, 
# use left_index, right_index

Question 7:

regex or can include parenthesis

(a|b)

Labs

Lab 3

Regex

Tool

\W: non-word
*: 0 or more
_: 1 or more
?: 0 or 1

Lab 4

Link

Part 1: Scaling

Distribution of values of female adult literacy rate as well as gross national income per capita.

We create a series with what we want, and drop null values.

  1. a. Want to build a histogram.

sns.countplot is used more for categorical variables. See 5 is missing.

  1. b. sns.diplot: reference

    d. sns.scatterplot: reference

Part 2: Kernel Density Estimation

The kernel density estimate (KDE) sum of a bunch of copies of the kernel, each centered on our data points. A default kernel is the Guassian kernel:

$$\Large K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2 \alpha ^2} \right) $$

Discussions

Discussion 7

Ans

Visualizing Gradients

  • Derivative takes with respect to each element

Discussion 8

Search

Search StackOverflow

Search Data 100

Search Python Libraries

Search Pandas

Search Matplotlib

Search Scikit Learn

Search Numpy

Exams

Midterm 1/Take Home Checkpoint Assignment

Review Slides

Questions in Scope

Material

  • Data Science
  • Probability
  • SQL
  • Pandas
  • Regex
  • Visualization
  • Modeling and Estimation
  • Optimization
  • PyTorch
  • Regression

Midterm 1 Fall 2019

Link

Reference Sheet

  • 1.a.
    • Ans: name
    • Cor:
    • Review: granularity

  • b.
    • Ans: 'low_calorie' cereal[caloires'] <= 100
    • Cor:
    • Review: Pandas Series

  • Primary Key: unique

    • In Purchases.csv, (OrderNum, ProdID) is unique
    • In Customers.csv CustID
  • Foreign Key: not unique

    • In Orders.csv CustID is a foreign key
    • "The order associated with Customers"
  • c.

    • Ans: fiber: continuous, type: discrete, low_calorie: discrete
    • Cor: type is nominal, low_calorie is ordinal
    • Review: types of vars, nominal vs ordinal

  • d.
    • Ans: groupby 'manufacturer' max sugars sort_values ascending=False
    • Cor: groupby 'manufacturer' agg max sort_values ascending=False
    • Review: agg functions
  • e.
    • Ans:
    • Cor: `cereal.groupby('manufacture').filter(lambda x: sum(x['type'] == 'hot') == 0)['manufacturer']
    • Review: groupby class, then filter x is group of one class that is reduced to a single value

  • f.

    • Review: pivot table
    • data is the dataframe, index, and column, aggfunc reduces to a single val
  • g. Review: what to do with NaN data

  • 2.a. 25, mean

  • b. 125

    • cor: c=25
  • c. 25

    • cor: c=30
  • d. 25

    • cor: c=30
    • review: take derivative? what kind of loss are they using
    • average square loss: \( L(w) = \frac{1}{n} \sum_{i = 1}^{n} (x_i - c)\)
  • 3.a.

    • ans: regex = r'^(.*): (.*)$'
    • cor: regex = r'([\w\s]+):\s+(\d+)'
    • review:
      • \w: alphanumerics a-zA-Z0-9_
      • \s whitespace
      • \d 0-9 - street number!
      • + one or more times
      • * zero or more times
  • b.i. ans: [45]\d{15}

    • review: [45] matches 4 or 5
      • \d{15} matches a decimal 15 times
  • ii.

    • ans: \d+\.?\d{2}
    • cor: \$(\d+\.?\d*) Needs the dollar match at the front? parenthesis () so we only get the inside
  • EDA 4.a.

    • log(y) and e**x inverse operations
    • review: operation on y allowed
  • b.i. 30

  • ii. skew right

    • cor: unimodal
    • review: skew, unimodal, EDA
  • iii. impossible to tell

  1. a. yesi Use a bar graph instead of plotting the distribution. You are unable to see the values.
    • cor: increase bandwidth smoother density estimate
    • review: density estimation func b. yes, it could show the numbers
    • cor: rescale y axis
    • review: rescaling c. idk
    • cor: density curve - compare distributions
    • review: density curve/distributions
  • d.
    • Ans: no
    • Cor:
    • Review:
  • c. impossible to tell
    • review: box and whiskers, impossible to tell frequency
  • 6.a.
    • Ans: People in CS W186
    • Cor: and not in Data 100
    • Review:
  • ii.
    • Ans: people who don't go to office hours
    • Cor:
    • Review:
  • 6.b.i.
    • Ans:
    • Cor: \( P(X_5 = 1) = \frac{500}{1000}\)
    • Review: simple random sample, and other one

  • Random sample with replacement, just multiply
  • Simple random sample, when doing an AND prob, multiply together

  • ii

    • Ans:
    • Cor: \( P(X_5 = 1, X_{50} = 1) = \frac{500}{1000} \times \frac{499}{999} \)
    • Review:
  • iii.

    • Ans:
    • Cor: $$ \frac{N-n}{N-1} np(1-p)$$

    $$=\frac{1000-50}{1000-1} 50 \frac{500}{1000}(1-\frac{500}{1000}) $$

    • Review: variance
  • iii.

    • Ans:
    • Cor: 0

Review

  • Binomial Probabilities

    • Probability of picking 4 blue and 3 not blue, has a certain probability
    • Multiply by the number of ways to pick it
  • NaN: see if there is any skew or bias if NaN is removed

  • Variance and STD

    • \( \sigma^2=Var(X)=\sum_{i = 1}^{n} p_i \cdot (x_i - \mu)^2 \)
    • Standard Deviation: \( \sigma=\sqrt{\sigma^2}=\sqrt{Var(X)} \)

Midterm 1 Review

Review Slides

Sampling

  • Simple Random Sample (SRS) sample uniformly, without replacement 1/50 * 1/49 * 1/48
  • Sample people chosen from the Sampling Frame people we could have chosen from the Target Population where we want to generalize to.
  • Probability Sample must be random.
  • Random sample with replacement, just multiply 1/50 * 1/50 * 1/50

Probability

  • List the distinct ways
  • Sometimes the complement is easier 1-P
  • Are you drawing one at a time? 1/50*1/49

SQL

Pandas

  • slice dataframe df.loc[['Mary', 'Anna'], :]
    • row index (on the left) of Mary and Anna, all column indices

Regex

Visualization

  • Types of Data
    • Quantitative Data
      • Continuous
        • weight, temperature
      • Discrete
        • finite, years of education, num sib
    • Categorical Data
      • Nominal
        • no ordering
        • Hair color
      • Ordinal
        • does have ordering!
        • Olympic Medals Gold > Silver > Bronze
  • Types of Plots
    • Quantitative Data
      • Histograms, Box Plots, Rug Plot, KDE - Kernel Density Estimation
      • Look at spread, shape, modes, outliers, unreasonable values
    • Nominal & Ordinal Data
      • Bar Plots (comparison)
      • Skew frequency, rare categories, invalid categories

  • Histogram and KDE

  • Box Plot

  • Not a Distribution

  • Is a Distribution

Kernel Density Estimators - KDE

  • visualize shape/structure not individual observations

  • Put a gaussian over each of the three points (tiny blue arrow), scale by 1/3, and sum

$$ K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2 \alpha ^2} \right) $$

\( \alpha \) is a smoothing param to change

Transformation

  • If your data is like \( x^2 \), to make it linear, make it \( \sqrt{f(x)} \)
  • If your data is like \( \sqrt{x} \), to make it linear, make it \( f(x)^2 \)
  • Can also shift y! \( f(y)^2 \)

Linear Regression & Loss Functions

  • Model Response Variable (y) with Explanatory Var/Predictors (x)
  • Loss functions measure error of our model
  • Squared loss L2 \( l(c) = (x-c)^2 \)
    • sensitive to outliers compared to L1, which doesn't have vanishing gradients. Constant gradients
  • Absolute loss L1 \( l(c) = |x-c|S \)
  • Minimize Loss.
    • Option A: Take the derivative and set to zero (convex)
    • Option B: Use Gradient Descent or SGD

  • Average Squared Loss, takes the mean of each loss per data point

Gradient Descent

  • Gradient is a vector of partial derivative of a function to each variable
  • grad: scalar -> vector

  • Gradient Descent is update by the negative of a small change in the gradient
  • Usually want convexity but it works without

Midterm 1 Review Questions

Link

Loss Function & Gradient Descent

Spring 2018

  • Q1.p3 Pg4 (Loss Functions)
    • a.: True b. True c. False d. second loss with infinite
  • Q8 Pg15 (Loss Minimization)
    • a. final one
    • review: weight theta is a constant. \( \frac{1}{n} \sum_{i = 1}^{n} \theta = \theta \)
  • Q10 Pg21-22 (Gradient Descent)
    • Random sample on each loop iteration
    • review np.random.choice returns a np.array sample. It can index a slice to give a np.array with X[ind,:]
    • final ans
    • correct ans: grad = (grad_function(theta, xbatch, ybatch) + 2 * theta * lam)
      • this keeps it a vector

Modeling/Regression

  • Summer 2019 Final
    • Q4
      • review \( \log(\theta e ^{-\theta x_i})=\log(\theta) - \theta x_i \)
      • \( \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \)
    • Q5
      • logistic regression/random forest, linear regression/random forest, linear regression/random forest, logistic regression/random forest
    • Q6
      • a.
        • positive weight
        • positive positive
        • positive
        • negative to beef (why?)
        • negative to chicken (why?)
        • Not Enough Info
      • b.-d.
        • 10
        • 25
        • 26
      • e.
        • low bias high variance
        • high variance
        • decrease variance
        • increase bias decrease variance
      • f.
        • training: c
        • validation: b
      • g.
        • high regularization, high mse
        • want in between regularization
    • Q7
      • a.
        • \( \theta^{(t + 1)} = \theta^{(t)} + \alpha (y - \sigma(\theta^{(t)} - 2) - \theta^{(t)}) \)
        • \( \theta^{(t + 1)} = y - \sigma(\theta^{(t)} - 2) \)
      • b.
        • scalar, scalar, len-p vector, len-p vector
        • cor read len-n vs len-p!
          • y is a single outcome
      • c.
        • if have all values
        • lambda doesn't affect second term, what you control
        • since theta is real equals

Regex

  • Summer 2019 Q2
  • go! and garbs!
    • garbs selects g then a then r then b, finally s
  • Fall 2018 Q6
      • 2 letters, third letter
      • + is one or more
      • * is zero or more
      • dogdog burritodog dogburrito
    • 9
      • 3 to 11