Introduction

Taught by Ani Adhikari and Joey Gonzalez

Lecture 11: Optimization

From Last Time

• Graph of averages
• Minimizing average loss
• It has no assumption of the shape of the data.
• There is always a line of best fit, but might not be right Anscombe Quartet

Transformation

• Linear easy to interpret

Log transformation

$$y=a^x, log(y)=xlog(a)$$ $$y=ax^k, log(y)=log(a)+k~log(x)$$

Simple Linear Regression: Interpreting the Slope

$$slope = r~\frac{\sigma_y}{\sigma_x}$$

Regression is associative. Not causation.

For a slope of 0.09 in per pounds, we say 0.09 is the estimated difference in height between two people whose weights are one pound apart.

Recap on Modeling

• For engineers is make predictions - accuracy
• For scientists it's interpretability.
• Parameters like $$F = ma$$

Steps for Modeling

Squared Loss vs. Absolute Loss

$$L^2$$ nice optimization properties (differentiable) and sensitive to outliers.

Calculus for Loss Minimization

$$h(x)=f(g(x)))$$

$$\frac{\partial h}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x}$$

It cancels out.

Derivative of outside times derivative of inside. Repeat!

Numerical Optimization

• First Order: Gradients, the slope of our loss landscape.
• Second order is Hessian, computation is slow.

loss function $$f$$ takes a vector and returns a scalar

Gradient takes the scalar and returns a vector.

Like falling down a well. 1 Dimension example.

Update the weights $$\theta$$. $$\rho(\tau)$$ is the learning rate.

Lecture 12: Optimization Cont. (SGD and Auto Diff)

Webcast

Slides

• Compute slope at that point

• Update (calculated from the gradient is red)

• Initial vector (weights?)
• Update by learning rate
• Converge, gradient is 0 or stop early
• We can do better! Calculating the gradient is slow

• Computing on the population is expensive!!!

• Sample! Called a batch. B term
• Assume loss is decomposable?

• _Decomposable Loss, must be able to be written as a sum on loss

• PyTorch

• SGD is faster, but on average, the mean is correct and it converges!!!

PyTorch

• Using the forward pass, calculates gradient
• chain rules, individual calculus operations - computation graph

• Graph of each individual operation
• Backward Differentiation uses the chain rule!!

• Can use GPUs, and auto_diff

Demo

• Line of best fit and residuals

• Mean Square Error Loss Surface

• $$L^{1}$$ loss surface for comparison!
• Sharp at the end

• PyTorch nn.Module, can add parameters
• Only need a forward function!

• N steps, get loss, do loss.backward()
• with torch.no_gradient():

• Visualization SimpleLinearModel

• Green model!

Make it a Polynomial

• nepochs how many times walk through data, and loader for the batch size

• Overfitting LMAO

Lecture 13: Review Modeling and Optimization, Intro to Regression

Human Contexts and Ethics

Imagine you are a Data Scientist on Twitter's "Trust and Safety" team.

1. Question/Prob Formation
• Fake News is a problem
• Doesn't have to be an Engineering-Focused problem!
2. Data Acquisition and Cleaning
• What data do we have and need to collect
• President's Tweet
3. Exploratory Data Analysis
• Example classify tweets as healthy or unhealthy
• Note biases and anomalies
4. Predictions and Inference
• What is the story, social good
• Think of who is listening what kind of power do you have

Review Modeling and Optimization

Models

• Models are a function f that map from X to Y.

• Parametric Models
• Have parameters, often represented as a vector
• Linear Models

• Non-Parametric Models?
• Nearest Neighbor
• copy the prediction from the closest datapoint
• Really big! Grows with the size of the data

• Kernel Density Estimator has a param, but it's more like a hyper param

• Can predict midterm grades from homeworks
• Simple model interpretable, summarize data
• Complex model

Loss Functions

• loss how close is our model prediction to the actual value

• Average Loss

• Solve it with optimization, find the $$\theta$$ (param) that min loss

• $$f_\theta(x)$$ is our model. $$L(\theta)$$ is the loss func

• F.l1_loss equiv
• Keep it as tensors. can do autograd

• When building a model, do class ExponentialModel(nn.Module)
• Weights self.w = nn.Parameter(torch.ones(2, 1))
• Are initial weights is a 2x1 tensor of ones [1, 1]
• forward is how to make a prediction
• to evaluate:
m = ExponentialModel()
m(0) # returns tensor of 2


• In the 3d plot, have w0, w1, and loss. Find point that minimizes loss.

• Example of orange vs yellow line and it's location on the loss landscape

Optimization of the Model

• You know your loss, compute the gradient (how to improve our loss)
• grad: scalar -> vector, each index is deriv with respect to param

• Take deriv and evaluate it

Lecture 14: Review Models and Loss

• Response var you want to estimate

• Model, summarizes with parameter w

• w is an estimator

Loss

• $$L(w)=\frac{1}{n}$$ # sum n i
• sum of each $$y_i$$

• We compare the red value vs the purple value. The green value is the best w, it minimizes loss.

Minimizing Loss

• L(w*) ?

• What is your 1. data, 2. model, 3. parameters, 4. loss, (also optimization method)

• From 1-d (best avg) to 2-d, best func

• $$w*$$ is w that minimizes L(w). The best estimator is $$\hat{y}(w^*)$$

• Can generalize our optimization to 3d! (3 weight params)
• Can't plot our loss in 3d (cause it's 4d with 3 dim and loss dim)

• One option is calculus set deriv of loss to 0

• Can actually do brute force. do np.linspace and try all values. O(N^2)?

• Find optimal weights of our sin model

• Derivatives

• Gradients, the opposite of which way to walk

• The gradient descent algorithm visualized

Lecture 15: Least Squares Linear Regression

• Linear Model

• Vector form $$x ^ T \theta$$

• Matrix is for each prediction

• note the one on the left and $$Y = X \theta$$

• Linear model of carat, depth, and table
# adds one to left. Horizontal stack
X = np.hstack([np.ones((n,1)), data[['carat', 'depth', 'table']].to_numpy()])
X

def linear_model(theta, xt):
return xt @ theta # The @ symbol is matrix multiply


Least Squares Linear Regression

• Loss
• By Calculus or Geometric Reasoning

• You have a span
• Make the residual perpendicular (orthogonal)

• Find Theta that minimizes residual
• Orthogonal is that equation
• $$X^T X$$ is a matrix. The optimal value is full rank

• Our Loss is the average squared loss
def squared_loss(theta):
return ((Y - X @ theta).T @ (Y - X @ theta)).item() / Y.shape[0]


theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat

Y_hat = X @ theta_hat


Lecture 16: Least Squares Regression in SKLearn

• Our data!
# Grid of test points
# one column of just 1
return np.hstack([np.ones((X.shape[0],1)), X])

def model_append_ones(X):

def plot_plane(f, X, grid_points = 30):
u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
xu, xv = np.meshgrid(u,v)
X = np.vstack((xu.flatten(),xv.flatten())).transpose()
z = f(X)
return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)

fig = go.Figure()
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)


Scikit Learn

# The API
model = SuperCoolModelType(args)

# train
model.fit(df[['X1' 'X1']], df[['Y']])

# predict!
model.predict(df2[['X1' 'X1']])

## Linear Regression
from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True) # intercept (It makes it don't go through the origin?
model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

# predict
synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
synth_data



Looks good!

Hyper-Parameters

Let's go through Kernel Regression

from sklearn.kernel_ridge import KernelRidge
super_model = KernelRidge(kernel="rbf")
super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])

fig = go.Figure()
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)


Curvy Dude!

Feature Functions

• P features (mappings)

• Map non-linear into linear
• Feature Engineering
• non-linear
• change from categorical
• Covariant Matrix?

One-Hot Encoding

• Matrix. Instead of Alabama = 1 Hawaii = 50 cause this implies order

• Bag-of-words with n-grams
• high dimensional and sparse

• Ordering (2-gram) "book well", "well enjoy"

Domain Knowledge

• Know isWinter know spike in time

Constant Feature

• Add the 1 column, bias param

• Feature functions, they have no params
#stack features
def phi_periodic(X):
return np.hstack([
X,
np.sin(X),
np.sin(10*X),
np.sin(20*X),
np.sin(X + 1),
np.sin(10*X + 1),
np.sin(20*X + 1)
])



• Some features used, some not at all

Lecture 17: Pitfalls of Feature Engineering

Jupyter Printout

• Problems of Feature Engineering
• Overfiting
from numpy.linalg import solve

def fit(X, Y):
return solve(X.T @ X, X.T @ Y)

n,_ = data.shape
return np.hstack([np.ones((n,1)), data])

X = data[['X']].to_numpy()
Y = data[['Y']].to_numpy()

...

class LinearModel:
def __init__(self, phi):
self.phi = phi
def fit(self, X, Y):
Phi = self.phi(X)
self.theta_hat = solve(Phi.T @ Phi, Phi.T @ Y)
return self.theta_hat
def predict(self, X):
Phi = self.phi(X)
return Phi @ self.theta_hat
def loss(self, X, Y):
return np.mean((Y - self.predict(X))**2)

model_line = LinearModel(phi_line)
model_line.fit(X,Y)
model_line.loss(X,Y)


Redundant Features

If you try to copy it, when trying to solve this is a Singular Matrix Error.

It isn't full rank, we has redundancy, the column space is not linearly independent.

Too Many Features

With too many, the optimal solution is underdetermined!

• Using RBF
• Do 20 bumps, on 9 data points?
• Rank 9 matrix

Overfitting

• Test data points

• Training error decreases but test error is terrible!

• The bias variance tradeoff. The best fit point.

Lecture 18: Cross Validation

• We overfit the data

• We try to fit minimize training error
• Test error to see generalization error

• 5-fold cross validation
• like bootstrap

Cross Val

# shuffle
shuffled_data = data.sample(frac=1.) # all data
shuffled_data

split_point = int(shuffled_data.shape[0]*0.95)
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]

len(tr) + len(te) == len(data)

from sklearn.model_selection import train_test_split
tr, te = train_test_split(data, test_size=0.1, random_state=83)


Don't evalutate model on test error?

SKLearn Pipelines

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
("LinearModel", LinearRegression())
])

model['SelectColumns']

model.fit(tr, tr['mpg'])
# model is a pipeline

Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))

models = {"c+d": model}
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
("LinearModel", LinearRegression())
])

from sklearn.impute import SimpleImputer
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])


Cross Validation

from sklearn.model_selection import KFold
from sklearn.base import clone

def cross_validate_rmse(model):
model = clone(model)
five_fold = KFold(n_splits=5)
rmse_values = []
for tr_ind, va_ind in five_fold.split(tr):
model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
return np.mean(rmse_values)

cross_validate_rmse(model)

def compare_models(models):
# Compute the training error for each model
training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
# Compute the cross validation error for each model
validation_rmse = [cross_validate_rmse(model) for model in models.values()]
# Compute the test error for each model (don't do this!)
test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
return fig


• Train and CV RMSE

• An example of overfiting

Lecture 19: Regularization

models = {}

quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]

for i in range(len(quantitative_features)):
# The features to include in the ith model
features = quantitative_features[:(i+1)]
# The name we are giving to the ith model
name = ",".join([name[0] for name in features])
# The pipeline for the ith model
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", features),
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
# Fit the pipeline
model.fit(tr, tr['mpg']);
# Saving the ith model
models[name] = model


K-fold Cross Validation

from sklearn.model_selection import cross_val_score



Overfitting

• The blue is really small

Regularization

• penalize overfit models
• use less complexity

• complexity

• Best solution

• L1 norm
• LASSO

• L2 norm
• doesn't really stick to the corners

• Different regularization

• Lambda is the same as complexity due to lagrangian

ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])

ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])

alphas = np.linspace(0.5, 20, 30)
cv_values = []
train_values = []
test_values = []
for alpha in alphas:
ridge_model.set_params(LinearModel__alpha=alpha)
cv_values.append(np.mean(cross_val_score(ridge_model, tr, tr['mpg'], scoring=rmse_score, cv=5)))
ridge_model.fit(tr, tr['mpg'])
train_values.append(rmse_score(ridge_model, tr, tr['mpg']))
test_values.append(rmse_score(ridge_model, te, te['mpg']))


Cross Validate Tune Regularization Param

fig = go.Figure()
fig.add_trace(go.Scatter(x = alphas, y = train_values, mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x = alphas, y = cv_values, mode="lines+markers", name="CV"))
fig.add_trace(go.Scatter(x = alphas, y = test_values, mode="lines+markers", name="Test"))
fig.update_layout(xaxis_title=r"$\alpha$", yaxis_title="CV RMSE")


Ridge with CV

from sklearn.linear_model import RidgeCV

alphas = np.linspace(0.5, 3, 30)

ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", RidgeCV(alphas=alphas))
])


• The red CV is shrinking

Lasso CV

from sklearn.linear_model import Lasso, LassoCV

lasso_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", LassoCV(cv=3))
])

lasso_model.fit(tr, tr['mpg'])
models["LassoCV"] = lasso_model
compare_models(models)


Lecture 20: Random Variables, Sampling Variablility

1. The model should fit our training data well
2. The new athletes

Random Variables {21, 21} -> 21

• X is a function

• argument is a sample: element of the domain
• returns number: element of the range
• Random Variables: X_1, X_2, X_3

• P(X=x)

• X: random Variable: function
• x: what the function may return: number
• chance X returns x

• added up all of the probabilities
• in a discrete probability we think of it as area
• P(a <= X <= b)

• in continuous

• Bernoulli(p)
• indicator variable I has value 1 if event happens and 0 if not
• P(I = 1) = p
• P(1 = 0) = 1-p
• Binomial(n, p) $$P(X = k) ~ = ~ \binom{n}{k} p^k(1-p)^{n-k}, ~~~~ 0 \le k \le n$$
# with scipy
# chance of 50 heads in 100 tosses of a fair coin
stats.binom.pmf(50, 100, 0.5)

• Uniform
unif_density = stats.uniform.pdf(x)    # uniform (0, 1) density

• Normal $$f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\big{(}\frac{x-\mu}{\sigma}\big{)}^2}, ~~~ -\infty < x < \infty$$
norm_density = stats.norm.pdf(x, 50, 5)      #


• there is no elementary closed form

Expectation

• weighted average of possible values
• weights: probabilities

one sample at a time:

• E[X] = sum over all samples X(s) * P(s)
• E[X] = sum over all x, x * P(X=x)

• samples, P(s) and X(s)

Properties

• E[X+Y] = E[X] + E[Y]
• S = X + Y
• S(s) = X(s) + Y(s)
• S(s) * P(s) = X(s)P(s) + Y(s)P(s)
• do a sum S(s) * P(s) = X(s)P(s) + Y(s)P(s)

• E[X]-E[5]=2-5=-3
• E[(X-5)(X-5)]=E[X^2]-E[10x]+E[25]=13-20+25=18

Variance and SD

• Var[x]=E[(x-E[x])^2]

• pull out the term

• D_s = S - mu_S

• D_s = D_x + D_y

• Var[X] + Var[Y] +2E[D_x D_y]

• $$E(D_x D_y) = E((X- \mu x)(y-\mu Y))$$

• covariance

• var[s] = var[x+y] only if the covariance is zero, they are independent

Random Variable

• A random variable is a function mapping events to real numbers
• $$X: \Omega -> \real, ) Lecture 21: Bias Variance Tradeoff • relation between x and y • we observe the random error \( \epsilon$$
• we only see $$Y$$, the data

• prediction is $$\hat{Y}$$

Prediction Error

• g is the right model, $$\epsilon$$ is random error
• red Y hat is our prediction

Model Risk

• expectation of the squared difference

• take a sample and get the mean
• Chance Error

• random
• Bias

• when our model is bad

Observation Variance

• $$\epsilon$$ is random, expectation is zero and variance is $$\sigma^2$$
• so Var of Y, g(x) is constant so varaince is only of epsilon
• called observation error
• measuring error, missing information
• irreducia error

Chance Error

• vary a little
• from a random sample

Model Variance

• average of prediction

• can overfit into the data
• reduce model complexity
• don't fit the noise
• bias

Our Model Vs the Truth

• green is true, red is fixed

Model Bias

• model prediction minus true g based on fixed x
• not random
• underfitting from not domain knowledge
• overfit

• average prediction vs actual value vs average prediction

Decomposition of Error and Risk

• expected squared diff
• Expectation in error (variance of observation), square of the bias, model variance

• f is just y

Lecture 22: Residuals, Multicollinearity, Inference

Least Squares Regression

• definition of orthogonal
• difference between Y vector and X theta hat is 0?
• invert the matrix if full rank for solution

A Regression Model

• X is a design matrix, first column is 1
• theta is params

Residuals

• difference Y and Y hat (estimate)

Seperating Signal and Noise

• true signal + noise and prediction + residual

Residuals Sum to Zero

• ?

• The average of the fitted values is equal to average of the observed responses

• orthogonal to the residuals?

Multiple R^2 and Overved Response and Fitted Values

• Multiple R^2

• Coefficient of Determination
• variance of the fitted values
• variance of observed responses
• "percent of variance explained by the model"

Colinearity and the Meaning of Slope

• Change in y per unit change in x_1 given all other variables held constant

• colinearity: when a covariate can be predicted by a linear function of others

Inference and Assumptions of Randomness

• our model can be expressed as intercept, weight of features, and error

• We ha ve to estimate the weights (slopes)

• how do we test theta_1 is 0?

Confidence Intervals for True Slope

• could bootstrap, could build a confidence interval (?)

Lecture 23: Logistic Regression

• machine learning
• when labeled you have supervised learning
• when quantitative do regression
• when categorical do classification
• when unlabeled you have unsupervised learning
• dimensionality reduction
• clustering
• finally reinforcement learning

Kinds of Classification

• binary two classes
• multiclass [cat, dog, car]
• structured prediction ?

• try least squares regression

• two classes

• truncated least square

Estimating the chance of success

• two difference coins case by case

• single expression $$p^y(1-p)^{1-y}$$

• estimate probability find value that maximize the function
• take a log

• as a sum

• you can minimize the average?

• what is the function in the two hard cases?

• as a loss function there is a penalty

Logistic Function

• linear functions are not good for probabilities

• t can be infinity and negative infinity
• take e of both sides

• model probability on the real line
• sigmoid, defined on whole line, smooth, increasing, elongated S

• derivative

Log Odds as a Linear Function

• features
• linear combo of features

• log odds
• probabiligty is of the log odds

The Steps of the Model

• generalized linear model

• linear regression continuous
• categorical (probability Y is 1)

• increase x by one unit

• linearly seperable data

• you need a little bit of uncertainty

• with regularization term

Log Loss

• or cross-entropy loss

• log loss is convex!

Lecture 24: Logistic Regression Part 2

• using a constant model, good baseline

• partioned into 7? and calculated the proportion in each interval

Bonus K-nearest Neighbor

• average stored in a heap?

• kind of bumpy

Logistic Regression

• $$\frac{1}{1+exp(-t)}$$
• $$t=\sum_{k=0}^d \theta_d x_d$$

One Dimensional Logistic Regression Model

• different co-efficients

• slope and intercept (lower intercept move right)

Loss

• cross entropy loss
• y log f(x) + (1- y) log f(x) sum
• have to use an iterative method

• the code
• forward has the model
• with cross entropy loss

• it's sexy
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='lbfgs')

lr_model.coef_

lr_model.intercept_

lr_model.predict # vs lr_model.predict_prob


• if theta is infinity, now we are certain instead we want some regularization
• 1.0e-5 * theta ** 2

The Decision Rule

• predict 1 if P(Y=1 | x) > .5
• but can choose a value other than .5
• accuracy: fraction of correct predictions out of all predictions

Confusion Matrix

• False-Positive when it is 0 (false) but the algorithm predicts 1 (true)
• False-Negative when it is 1 (true) but the algorithm predicts 0 (false)
• Precision: true positives over true positive + false positives
• how many selected items are relevant
• Recall: true positives over true positives + true negatives
• how many relevant items are selected

• say we want to ensure 95% of malignant tumors are classified as malignant
• np.argmin(recall > .95) - 1

• the pathologist would have to verify 611% of the samples.
• false diagnoses of 5% as benign when malignant is unacceptable in practice!

Lecture 25: Decision Trees and Random Forests

• how to classify this into three petals?

Decision Tree

• petal_width < .75 and petal_length < 2
• two dimensions

Scikit-learn Decision Tree

from sklearn import tree
decision_tree_model = tree.DecisionTree

# Better visualizer with graphviz


• they will have perfect accuracy unless data has the same value with different classes

• using more features now it is 4d

• this would overfit

Node Entropy

• p_0, p_1 proportions on a node amount
• entropy -sum p log p
• how unpredictable a node is

Entropy

• when all data is one class, it's zero entropy
• evenly split = 1
• for C classes the entropy is logC

• entropy of the left and entropy of the right
• iteratively choose a split value

Overfitting

• just don't let it overgrow
• greedy algorithm

• pruning
• create a validation set

Random Forest

• low bias capture data in dataset, high variance
• just weight and vote

Bagging

• Bootstrap AGGregatING
• resample
• final model
• Berkeley Stats 1994!

• pick m subset feature

• heuristic

Lecture 26: Dimensionality Reduction

High Dimensional Data

• 2d dataset
• hard to plot 3d

• would be redundant rank 3
sns.pairplot


• on data

Orthonormality

• vectors unit length 1 scaled and add up to 1
• all vectors are orthogonal

Lecture 27: Principle Component Analysis

• width and length are independent from each other
• what if it is noisy?

• last column is not quite zero
• rank 3 approximation of rank 4 data

• rank 2 approximation

Principal Component

• need to subtract the mean

• pc1 country on a line

• how above or below the PC2

Variance (Singular Value Interpretation)

• how many principle components to use?

Homework 4

Part 2

Question 2

Regex

Pattern example:

pattern = re.findall(r'\[(\d+)/(\w+)/(\d+):', str)


Question 5

• For 5c, reading from csv without header, tab seperated, setting index to zeroth column, and selecting only the first column:
pd.read_csv('file.txt', index_col=0, header=None, sep='\t').rename(columns={1: 'col_name'})[['col_name']


5f

# this make each str a column!
df['col_name'].str.split(' ', expand=True)


5g

pd.merge(left, right, how='left') # left, right, outer, inner
# on=None, left_on=None, right_on=None,
# use left_index, right_index


Question 7:

regex or can include parenthesis

(a|b)


Lab 3

Regex

Tool

\W: non-word
*: 0 or more
_: 1 or more
?: 0 or 1


Lab 4

Part 1: Scaling

Distribution of values of female adult literacy rate as well as gross national income per capita.

We create a series with what we want, and drop null values.

1. a. Want to build a histogram.

sns.countplot is used more for categorical variables. See 5 is missing.

1. b. sns.diplot: reference

d. sns.scatterplot: reference

Part 2: Kernel Density Estimation

The kernel density estimate (KDE) sum of a bunch of copies of the kernel, each centered on our data points. A default kernel is the Guassian kernel:

$$\Large K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2 \alpha ^2} \right)$$

Discussion 7

Ans

• Derivative takes with respect to each element

Midterm 1/Take Home Checkpoint Assignment

Review Slides

Questions in Scope

Material

• Data Science
• Probability
• SQL
• Pandas
• Regex
• Visualization
• Modeling and Estimation
• Optimization
• PyTorch
• Regression

Midterm 1 Fall 2019

Reference Sheet

• 1.a.
• Ans: name
• Cor:
• Review: granularity

• b.
• Ans: 'low_calorie' cereal[caloires'] <= 100
• Cor:
• Review: Pandas Series

• Primary Key: unique

• In Purchases.csv, (OrderNum, ProdID) is unique
• In Customers.csv CustID
• Foreign Key: not unique

• In Orders.csv CustID is a foreign key
• "The order associated with Customers"
• c.

• Ans: fiber: continuous, type: discrete, low_calorie: discrete
• Cor: type is nominal, low_calorie is ordinal
• Review: types of vars, nominal vs ordinal

• d.
• Ans: groupby 'manufacturer' max sugars sort_values ascending=False
• Cor: groupby 'manufacturer' agg max sort_values ascending=False
• Review: agg functions
• e.
• Ans:
• Cor: cereal.groupby('manufacture').filter(lambda x: sum(x['type'] == 'hot') == 0)['manufacturer']
• Review: groupby class, then filter x is group of one class that is reduced to a single value

• f.

• Review: pivot table
• data is the dataframe, index, and column, aggfunc reduces to a single val
• g. Review: what to do with NaN data

• 2.a. 25, mean

• b. 125

• cor: c=25
• c. 25

• cor: c=30
• d. 25

• cor: c=30
• review: take derivative? what kind of loss are they using
• average square loss: $$L(w) = \frac{1}{n} \sum_{i = 1}^{n} (x_i - c)$$
• 3.a.

• ans: regex = r'^(.*): (.*)$' • cor: regex = r'([\w\s]+):\s+(\d+)' • review: • \w: alphanumerics a-zA-Z0-9_ • \s whitespace • \d 0-9 - street number! • + one or more times • * zero or more times • b.i. ans: [45]\d{15} • review: [45] matches 4 or 5 • \d{15} matches a decimal 15 times • ii. • ans: \d+\.?\d{2} • cor: \$(\d+\.?\d*) Needs the dollar match at the front? parenthesis () so we only get the inside
• EDA 4.a.

• log(y) and e**x inverse operations
• review: operation on y allowed
• b.i. 30

• ii. skew right

• cor: unimodal
• review: skew, unimodal, EDA
• iii. impossible to tell

1. a. yesi Use a bar graph instead of plotting the distribution. You are unable to see the values.
• cor: increase bandwidth smoother density estimate
• review: density estimation func b. yes, it could show the numbers
• cor: rescale y axis
• review: rescaling c. idk
• cor: density curve - compare distributions
• review: density curve/distributions
• d.
• Ans: no
• Cor:
• Review:
• c. impossible to tell
• review: box and whiskers, impossible to tell frequency
• 6.a.
• Ans: People in CS W186
• Cor: and not in Data 100
• Review:
• ii.
• Ans: people who don't go to office hours
• Cor:
• Review:
• 6.b.i.
• Ans:
• Cor: $$P(X_5 = 1) = \frac{500}{1000}$$
• Review: simple random sample, and other one

• Random sample with replacement, just multiply
• Simple random sample, when doing an AND prob, multiply together

• ii

• Ans:
• Cor: $$P(X_5 = 1, X_{50} = 1) = \frac{500}{1000} \times \frac{499}{999}$$
• Review:
• iii.

• Ans:
• Cor: $$\frac{N-n}{N-1} np(1-p)$$

$$=\frac{1000-50}{1000-1} 50 \frac{500}{1000}(1-\frac{500}{1000})$$

• Review: variance
• iii.

• Ans:
• Cor: 0

Review

• Binomial Probabilities

• Probability of picking 4 blue and 3 not blue, has a certain probability
• Multiply by the number of ways to pick it
• NaN: see if there is any skew or bias if NaN is removed

• Variance and STD

• $$\sigma^2=Var(X)=\sum_{i = 1}^{n} p_i \cdot (x_i - \mu)^2$$
• Standard Deviation: $$\sigma=\sqrt{\sigma^2}=\sqrt{Var(X)}$$

Midterm 1 Review

Review Slides

Sampling

• Simple Random Sample (SRS) sample uniformly, without replacement 1/50 * 1/49 * 1/48
• Sample people chosen from the Sampling Frame people we could have chosen from the Target Population where we want to generalize to.
• Probability Sample must be random.
• Random sample with replacement, just multiply 1/50 * 1/50 * 1/50

Probability

• List the distinct ways
• Sometimes the complement is easier 1-P
• Are you drawing one at a time? 1/50*1/49

Pandas

• slice dataframe df.loc[['Mary', 'Anna'], :]
• row index (on the left) of Mary and Anna, all column indices

Visualization

• Types of Data
• Quantitative Data
• Continuous
• weight, temperature
• Discrete
• finite, years of education, num sib
• Categorical Data
• Nominal
• no ordering
• Hair color
• Ordinal
• does have ordering!
• Olympic Medals Gold > Silver > Bronze
• Types of Plots
• Quantitative Data
• Histograms, Box Plots, Rug Plot, KDE - Kernel Density Estimation
• Look at spread, shape, modes, outliers, unreasonable values
• Nominal & Ordinal Data
• Bar Plots (comparison)
• Skew frequency, rare categories, invalid categories

• Histogram and KDE

• Box Plot

• Not a Distribution

• Is a Distribution

Kernel Density Estimators - KDE

• visualize shape/structure not individual observations

• Put a gaussian over each of the three points (tiny blue arrow), scale by 1/3, and sum

$$K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2 \alpha ^2} \right)$$

$$\alpha$$ is a smoothing param to change

Transformation

• If your data is like $$x^2$$, to make it linear, make it $$\sqrt{f(x)}$$
• If your data is like $$\sqrt{x}$$, to make it linear, make it $$f(x)^2$$
• Can also shift y! $$f(y)^2$$

Linear Regression & Loss Functions

• Model Response Variable (y) with Explanatory Var/Predictors (x)
• Loss functions measure error of our model
• Squared loss L2 $$l(c) = (x-c)^2$$
• sensitive to outliers compared to L1, which doesn't have vanishing gradients. Constant gradients
• Absolute loss L1 $$l(c) = |x-c|S$$
• Minimize Loss.
• Option A: Take the derivative and set to zero (convex)
• Option B: Use Gradient Descent or SGD

• Average Squared Loss, takes the mean of each loss per data point

• Gradient is a vector of partial derivative of a function to each variable

• Gradient Descent is update by the negative of a small change in the gradient
• Usually want convexity but it works without

Midterm 1 Review Questions

Spring 2018

• Q1.p3 Pg4 (Loss Functions)
• a.: True b. True c. False d. second loss with infinite
• Q8 Pg15 (Loss Minimization)
• a. final one
• review: weight theta is a constant. $$\frac{1}{n} \sum_{i = 1}^{n} \theta = \theta$$
• Random sample on each loop iteration
• review np.random.choice returns a np.array sample. It can index a slice to give a np.array with X[ind,:]
• final ans
• correct ans: grad = (grad_function(theta, xbatch, ybatch) + 2 * theta * lam)
• this keeps it a vector

Modeling/Regression

• Summer 2019 Final
• Q4
• review $$\log(\theta e ^{-\theta x_i})=\log(\theta) - \theta x_i$$
• $$\frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x}$$
• Q5
• logistic regression/random forest, linear regression/random forest, linear regression/random forest, logistic regression/random forest
• Q6
• a.
• positive weight
• positive positive
• positive
• negative to beef (why?)
• negative to chicken (why?)
• Not Enough Info
• b.-d.
• 10
• 25
• 26
• e.
• low bias high variance
• high variance
• decrease variance
• increase bias decrease variance
• f.
• training: c
• validation: b
• g.
• high regularization, high mse
• want in between regularization
• Q7
• a.
• $$\theta^{(t + 1)} = \theta^{(t)} + \alpha (y - \sigma(\theta^{(t)} - 2) - \theta^{(t)})$$
• $$\theta^{(t + 1)} = y - \sigma(\theta^{(t)} - 2)$$
• b.
• scalar, scalar, len-p vector, len-p vector
• cor read len-n vs len-p!
• y is a single outcome
• c.
• if have all values
• lambda doesn't affect second term, what you control
• since theta is real equals

Regex

• Summer 2019 Q2
• go! and garbs!
• garbs selects g then a then r then b, finally s
• Fall 2018 Q6
• 2 letters, third letter
• + is one or more
• * is zero or more
• dogdog burritodog dogburrito`
• 9
• 3 to 11