Introduction
Taught by Ani Adhikari and Joey Gonzalez
Lectures
Lecture 11: Optimization
From Last Time
 Graph of averages
 Minimizing average loss
 It has no assumption of the shape of the data.
 There is always a line of best fit, but might not be right Anscombe Quartet
Transformation
 Linear easy to interpret
Log transformation
$$ y=a^x, log(y)=xlog(a) $$ $$ y=ax^k, log(y)=log(a)+k~log(x) $$
Simple Linear Regression: Interpreting the Slope
$$ slope = r~\frac{\sigma_y}{\sigma_x} $$
Regression is associative. Not causation.
For a slope of 0.09 in per pounds, we say 0.09 is the estimated difference in height between two people whose weights are one pound apart.
Recap on Modeling
 For engineers is make predictions  accuracy
 For scientists it's interpretability.
 Parameters like \( F = ma \)
Steps for Modeling
Squared Loss vs. Absolute Loss
\( L^2 \) nice optimization properties (differentiable) and sensitive to outliers.
Calculus for Loss Minimization
$$ h(x)=f(g(x))) $$
$$ \frac{\partial h}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} $$
It cancels out.
Derivative of outside times derivative of inside. Repeat!
Numerical Optimization
 First Order: Gradients, the slope of our loss landscape.
 Second order is Hessian, computation is slow.
Gradients
loss function \( f \) takes a vector and returns a scalar
Gradient takes the scalar and returns a vector.
Like falling down a well. 1 Dimension example.
Update the weights \( \theta \). \( \rho(\tau) \) is the learning rate.
Lecture 12: Optimization Cont. (SGD and Auto Diff)
Gradient Descent Algorithm
 Compute slope at that point
 Update (calculated from the gradient is red)
 Initial vector (weights?)
 Update by learning rate
 Converge, gradient is 0 or stop early
 We can do better! Calculating the gradient is slow
 Computing on the population is expensive!!!
Stochastic Gradient Descent
 Sample! Called a batch. B term
 Assume loss is decomposable?
 _Decomposable Loss, must be able to be written as a sum on loss
 Momentum, ADAM
 PyTorch
Comparing Gradient Descent vs SGD
 SGD is faster, but on average, the mean is correct and it converges!!!
PyTorch
 Using the forward pass, calculates gradient
 chain rules, individual calculus operations  computation graph
 Graph of each individual operation
 Backward Differentiation uses the chain rule!!
 Can use GPUs, and auto_diff
Demo
 Line of best fit and residuals
 Mean Square Error Loss Surface
 \( L^{1} \) loss surface for comparison!
 Sharp at the end
 PyTorch
nn.Module
, can add parameters Only need a
forward
function!
 Only need a
Implement Basic Gradient Descent
 N steps, get loss, do loss.backward()
with torch.no_gradient():
 Visualization SimpleLinearModel
 Green model!
Make it a Polynomial
Implement Stochastic Gradient Descent
nepochs
how many times walk through data, andloader
for the batch size
 Overfitting LMAO
Lecture 13: Review Modeling and Optimization, Intro to Regression
Human Contexts and Ethics
Imagine you are a Data Scientist on Twitter's "Trust and Safety" team.
 Question/Prob Formation
 Fake News is a problem
 Doesn't have to be an EngineeringFocused problem!
 Data Acquisition and Cleaning
 What data do we have and need to collect
 President's Tweet
 Exploratory Data Analysis
 Example classify tweets as healthy or unhealthy
 Think about the context of your problem
 Note biases and anomalies
 Predictions and Inference
 What is the story, social good
 Think of who is listening what kind of power do you have
Think about your social context.
Review Modeling and Optimization
Models
 Models are a function f that map from X to Y.
 Parametric Models
 Have parameters, often represented as a vector
 Linear Models
 NonParametric Models?
 Nearest Neighbor
 copy the prediction from the closest datapoint
 Really big! Grows with the size of the data
 Kernel Density Estimator has a param, but it's more like a hyper param
Tradeoffs in Modeling
 Can predict midterm grades from homeworks
 Simple model interpretable, summarize data
 Complex model
Loss Functions
 loss how close is our model prediction to the actual value
 Average Loss
 Solve it with optimization, find the \( \theta \) (param) that min loss
 \( f_\theta(x) \) is our model. \( L(\theta) \) is the loss func
F.l1_loss
equiv Keep it as tensors. can do autograd
 When building a model, do
class ExponentialModel(nn.Module)
 Weights
self.w = nn.Parameter(torch.ones(2, 1))
 Are initial weights is a 2x1 tensor of ones [1, 1]
forward
is how to make a prediction to evaluate:
m = ExponentialModel()
m(0) # returns tensor of 2
 In the 3d plot, have
w0
,w1
, andloss
. Find point that minimizes loss.
 Example of orange vs yellow line and it's location on the loss landscape
Optimization of the Model
 You know your loss, compute the gradient (how to improve our loss)
 grad: scalar > vector, each index is deriv with respect to param
 Take deriv and evaluate it
 Auto Diff reuses gradients
Lecture 14: Review Models and Loss

Response var you want to estimate

Model, summarizes with parameter
w

w
is an estimator
Loss
 \( L(w)=\frac{1}{n}\) # sum n i
 sum of each \( y_i \)
 We compare the red value vs the purple value. The green value is the best
w
, it minimizes loss.
Minimizing Loss
 L(w*) ?
 What is your 1. data, 2. model, 3. parameters, 4. loss, (also optimization method)
 From 1d (best avg) to 2d, best func
 \( w* \) is w that minimizes L(w). The best estimator is \( \hat{y}(w^*) \)
 Can generalize our optimization to 3d! (3 weight params)
 Can't plot our loss in 3d (cause it's 4d with 3 dim and loss dim)
 One option is calculus set deriv of loss to 0
 Other is grad descent/SGD
Gradient Descent
 Can actually do brute force. do
np.linspace
and try all values. O(N^2)?
 Find optimal weights of our sin model
 Derivatives
 Gradients, the opposite of which way to walk
 The gradient descent algorithm visualized
Lecture 15: Least Squares Linear Regression
 Linear Model
 Vector form \( x ^ T \theta \)
 Matrix is for each prediction
 note the one on the left and \( Y = X \theta \)
 Linear model of
carat
,depth
, andtable
# adds one to left. Horizontal stack
X = np.hstack([np.ones((n,1)), data[['carat', 'depth', 'table']].to_numpy()])
X
def linear_model(theta, xt):
return xt @ theta # The @ symbol is matrix multiply
Least Squares Linear Regression
 Loss
 By Calculus or Geometric Reasoning
 You have a span
 Make the residual perpendicular (orthogonal)
 Find Theta that minimizes residual
 Orthogonal is that equation
 \( X^T X \) is a matrix. The optimal value is full rank
 Our Loss is the average squared loss
def squared_loss(theta):
return ((Y  X @ theta).T @ (Y  X @ theta)).item() / Y.shape[0]
theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat
Y_hat = X @ theta_hat
Geometry of Least Squares
Lecture 16: Least Squares Regression in SKLearn
 Our data!
# Grid of test points
# one column of just 1
def add_ones_column(X):
return np.hstack([np.ones((X.shape[0],1)), X])
add_ones_column(X)
theta_hat = least_squares_by_solve(add_ones_column(X),Y)
def model_append_ones(X):
return add_ones_column(X) @ theta_hat
def plot_plane(f, X, grid_points = 30):
u = np.linspace(X[:,0].min(),X[:,0].max(), grid_points)
v = np.linspace(X[:,1].min(),X[:,1].max(), grid_points)
xu, xv = np.meshgrid(u,v)
X = np.vstack((xu.flatten(),xv.flatten())).transpose()
z = f(X)
return go.Surface(x=xu, y=xv, z=z.reshape(xu.shape),opacity=0.8)
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(model_append_ones, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
Scikit Learn
# The API
model = SuperCoolModelType(args)
# train
model.fit(df[['X1' 'X1']], df[['Y']])
# predict!
model.predict(df2[['X1' 'X1']])
## Linear Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True) # intercept (It makes it don't go through the origin?
model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
# predict
synth_data['Y_hat'] = model.predict(synth_data[["X1", "X2"]])
synth_data
Looks good!
HyperParameters
Let's go through Kernel Regression
from sklearn.kernel_ridge import KernelRidge
super_model = KernelRidge(kernel="rbf")
super_model.fit(synth_data[["X1", "X2"]], synth_data[["Y"]])
fig = go.Figure()
fig.add_trace(data_scatter)
fig.add_trace(plot_plane(super_model.predict, X))
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0),
height=600)
Curvy Dude!
Feature Functions
 P features (mappings)
 Map nonlinear into linear
 Feature Engineering
 nonlinear
 change from categorical
 Covariant Matrix?
OneHot Encoding
 Matrix. Instead of Alabama = 1 Hawaii = 50 cause this implies order
 Bagofwords with ngrams
 high dimensional and sparse
 Ordering (2gram) "book well", "well enjoy"
Domain Knowledge
 Know
isWinter
know spike in time
Constant Feature
 Add the 1 column, bias param
 Feature functions, they have no params
#stack features
def phi_periodic(X):
return np.hstack([
X,
np.sin(X),
np.sin(10*X),
np.sin(20*X),
np.sin(X + 1),
np.sin(10*X + 1),
np.sin(20*X + 1)
])
 Some features used, some not at all
Lecture 17: Pitfalls of Feature Engineering
 Problems of Feature Engineering
 Overfiting
from numpy.linalg import solve
def fit(X, Y):
return solve(X.T @ X, X.T @ Y)
def add_ones_column(data):
n,_ = data.shape
return np.hstack([np.ones((n,1)), data])
X = data[['X']].to_numpy()
Y = data[['Y']].to_numpy()
...
class LinearModel:
def __init__(self, phi):
self.phi = phi
def fit(self, X, Y):
Phi = self.phi(X)
self.theta_hat = solve(Phi.T @ Phi, Phi.T @ Y)
return self.theta_hat
def predict(self, X):
Phi = self.phi(X)
return Phi @ self.theta_hat
def loss(self, X, Y):
return np.mean((Y  self.predict(X))**2)
model_line = LinearModel(phi_line)
model_line.fit(X,Y)
model_line.loss(X,Y)
Redundant Features
If you try to copy it, when trying to solve this is a Singular Matrix Error.
It isn't full rank, we has redundancy, the column space is not linearly independent.
Too Many Features
With too many, the optimal solution is underdetermined!
 Using RBF
 add RBF with linear
 Do 20 bumps, on 9 data points?
 Rank 9 matrix
Overfitting
 Test data points
 Training error decreases but test error is terrible!
 The bias variance tradeoff. The best fit point.
Lecture 18: Cross Validation
 We overfit the data
 We try to fit minimize training error
 Test error to see generalization error
 5fold cross validation
 like bootstrap
Cross Val
# shuffle
shuffled_data = data.sample(frac=1.) # all data
shuffled_data
split_point = int(shuffled_data.shape[0]*0.95)
tr = shuffled_data.iloc[:split_point]
te = shuffled_data.iloc[split_point:]
len(tr) + len(te) == len(data)
from sklearn.model_selection import train_test_split
tr, te = train_test_split(data, test_size=0.1, random_state=83)
Don't evalutate model on test error?
SKLearn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", ["cylinders", "displacement"])])),
("LinearModel", LinearRegression())
])
model['SelectColumns']
model.fit(tr, tr['mpg'])
# model is a pipeline
Y_hat = model.predict(tr)
Y = tr['mpg']
print("Training Error (RMSE):", rmse(Y, Y_hat))
models = {"c+d": model}
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration"]
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
("LinearModel", LinearRegression())
])
from sklearn.impute import SimpleImputer
model = Pipeline([
("SelectColumns", ColumnTransformer([("keep", "passthrough", quantitative_features)])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
Cross Validation
from sklearn.model_selection import KFold
from sklearn.base import clone
def cross_validate_rmse(model):
model = clone(model)
five_fold = KFold(n_splits=5)
rmse_values = []
for tr_ind, va_ind in five_fold.split(tr):
model.fit(tr.iloc[tr_ind,:], tr['mpg'].iloc[tr_ind])
rmse_values.append(rmse(tr['mpg'].iloc[va_ind], model.predict(tr.iloc[va_ind,:])))
return np.mean(rmse_values)
cross_validate_rmse(model)
def compare_models(models):
# Compute the training error for each model
training_rmse = [rmse(tr['mpg'], model.predict(tr)) for model in models.values()]
# Compute the cross validation error for each model
validation_rmse = [cross_validate_rmse(model) for model in models.values()]
# Compute the test error for each model (don't do this!)
test_rmse = [rmse(te['mpg'], model.predict(te)) for model in models.values()]
names = list(models.keys())
fig = go.Figure([
go.Bar(x = names, y = training_rmse, name="Training RMSE"),
go.Bar(x = names, y = validation_rmse, name="CV RMSE"),
go.Bar(x = names, y = test_rmse, name="Test RMSE", opacity=.3)])
return fig
 Train and CV RMSE
 An example of overfiting
Lecture 19: Regularization
models = {}
quantitative_features = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
for i in range(len(quantitative_features)):
# The features to include in the ith model
features = quantitative_features[:(i+1)]
# The name we are giving to the ith model
name = ",".join([name[0] for name in features])
# The pipeline for the ith model
model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", "passthrough", features),
])),
("Imputation", SimpleImputer()),
("LinearModel", LinearRegression())
])
# Fit the pipeline
model.fit(tr, tr['mpg']);
# Saving the ith model
models[name] = model
Kfold Cross Validation
from sklearn.model_selection import cross_val_score
Overfitting
 The blue is really small
Regularization
 penalize overfit models
 use less complexity
 complexity
 Best solution
 L1 norm
 LASSO
 L2 norm
 doesn't really stick to the corners
 Different regularization
 Lambda is the same as complexity due to lagrangian
 standardize all your terms?
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", Ridge(alpha=10))
])
alphas = np.linspace(0.5, 20, 30)
cv_values = []
train_values = []
test_values = []
for alpha in alphas:
ridge_model.set_params(LinearModel__alpha=alpha)
cv_values.append(np.mean(cross_val_score(ridge_model, tr, tr['mpg'], scoring=rmse_score, cv=5)))
ridge_model.fit(tr, tr['mpg'])
train_values.append(rmse_score(ridge_model, tr, tr['mpg']))
test_values.append(rmse_score(ridge_model, te, te['mpg']))
Cross Validate Tune Regularization Param
fig = go.Figure()
fig.add_trace(go.Scatter(x = alphas, y = train_values, mode="lines+markers", name="Train"))
fig.add_trace(go.Scatter(x = alphas, y = cv_values, mode="lines+markers", name="CV"))
fig.add_trace(go.Scatter(x = alphas, y = test_values, mode="lines+markers", name="Test"))
fig.update_layout(xaxis_title=r"$\alpha$", yaxis_title="CV RMSE")
Ridge with CV
from sklearn.linear_model import RidgeCV
alphas = np.linspace(0.5, 3, 30)
ridge_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", RidgeCV(alphas=alphas))
])
 The red CV is shrinking
Lasso CV
from sklearn.linear_model import Lasso, LassoCV
lasso_model = Pipeline([
("SelectColumns", ColumnTransformer([
("keep", StandardScaler(), quantitative_features),
("origin_encoder", OneHotEncoder(), ["origin"]),
("text", CountVectorizer(), "name")
])),
("Imputation", SimpleImputer()),
("LinearModel", LassoCV(cv=3))
])
lasso_model.fit(tr, tr['mpg'])
models["LassoCV"] = lasso_model
compare_models(models)
Lecture 20: Random Variables, Sampling Variablility
 The model should fit our training data well
 The new athletes
Random Variables {21, 21} > 21

X is a function
 argument is a sample: element of the domain
 returns number: element of the range

Random Variables: X_1, X_2, X_3

P(X=x)
 X: random Variable: function
 x: what the function may return: number
 chance X returns x
 added up all of the probabilities
 in a discrete probability we think of it as area
 P(a <= X <= b)
 in continuous
 Bernoulli(p)
 indicator variable I has value 1 if event happens and 0 if not
 P(I = 1) = p
 P(1 = 0) = 1p
 Binomial(n, p) $$ P(X = k) ~ = ~ \binom{n}{k} p^k(1p)^{nk}, ~~~~ 0 \le k \le n $$
# with scipy
# chance of 50 heads in 100 tosses of a fair coin
stats.binom.pmf(50, 100, 0.5)
 Uniform
unif_density = stats.uniform.pdf(x) # uniform (0, 1) density
 Normal $$ f(x) ~ = ~ \frac{1}{\sqrt{2\pi}\sigma} e^{\frac{1}{2}\big{(}\frac{x\mu}{\sigma}\big{)}^2}, ~~~ \infty < x < \infty $$
norm_density = stats.norm.pdf(x, 50, 5) #
 there is no elementary closed form
Expectation
 weighted average of possible values
 weights: probabilities
one sample at a time:
 E[X] = sum over all samples X(s) * P(s)
 E[X] = sum over all x, x * P(X=x)
 samples, P(s) and X(s)
 instead P(X=x)
Properties
 E[X+Y] = E[X] + E[Y]
 S = X + Y
 S(s) = X(s) + Y(s)
 S(s) * P(s) = X(s)P(s) + Y(s)P(s)
 do a sum S(s) * P(s) = X(s)P(s) + Y(s)P(s)
 E[X]E[5]=25=3
 E[(X5)(X5)]=E[X^2]E[10x]+E[25]=1320+25=18
Variance and SD
 Var[x]=E[(xE[x])^2]

pull out the term

D_s = S  mu_S

D_s = D_x + D_y

Var[X] + Var[Y] +2E[D_x D_y]

\( E(D_x D_y) = E((X \mu x)(y\mu Y)) \)

covariance

var[s] = var[x+y] only if the covariance is zero, they are independent
Random Variable
 A random variable is a function mapping events to real numbers
 \( X: \Omega > \real, )
Lecture 21: Bias Variance Tradeoff
 relation between x and y
 we observe the random error \( \epsilon \)
 we only see \( Y \), the data
 prediction is \( \hat{Y} \)
Prediction Error
 g is the right model, \( \epsilon \) is random error
 red Y hat is our prediction
Model Risk

expectation of the squared difference
 take a sample and get the mean

Chance Error
 random

Bias
 when our model is bad
Observation Variance
 \( \epsilon \) is random, expectation is zero and variance is \( \sigma^2 \)
 so Var of Y, g(x) is constant so varaince is only of epsilon
 called observation error
 measuring error, missing information
 irreducia error
Chance Error
 vary a little
 from a random sample
Model Variance
 average of prediction
 can overfit into the data
 reduce model complexity
 don't fit the noise
 bias
Our Model Vs the Truth
 green is true, red is fixed
Model Bias
 model prediction minus true g based on fixed x
 not random
 underfitting from not domain knowledge
 overfit
 average prediction vs actual value vs average prediction
Decomposition of Error and Risk
 expected squared diff
 Expectation in error (variance of observation), square of the bias, model variance
Bias Variance Decomposition
Predicting by a Function with Parameters
 f is just y
Lecture 22: Residuals, Multicollinearity, Inference
Least Squares Regression
 definition of orthogonal
 difference between Y vector and X theta hat is 0?
 invert the matrix if full rank for solution
A Regression Model
 X is a design matrix, first column is 1
 theta is params
Residuals
 difference Y and Y hat (estimate)
Seperating Signal and Noise
 true signal + noise and prediction + residual
Residuals Sum to Zero
 ?
 The average of the fitted values is equal to average of the observed responses
 orthogonal to the residuals?
Multiple R^2 and Overved Response and Fitted Values
 Multiple R^2
 Coefficient of Determination
 variance of the fitted values
 variance of observed responses
 "percent of variance explained by the model"
Colinearity and the Meaning of Slope
 Change in y per unit change in x_1 given all other variables held constant
 colinearity: when a covariate can be predicted by a linear function of others
Inference and Assumptions of Randomness
 our model can be expressed as intercept, weight of features, and error
 We ha ve to estimate the weights (slopes)
 how do we test theta_1 is 0?
Confidence Intervals for True Slope
 could bootstrap, could build a confidence interval (?)
Lecture 23: Logistic Regression
 machine learning
 when labeled you have supervised learning
 when quantitative do regression
 when categorical do classification
 when unlabeled you have unsupervised learning
 dimensionality reduction
 clustering
 finally reinforcement learning
Kinds of Classification
 binary two classes
 multiclass [cat, dog, car]
 structured prediction ?
 try least squares regression
 two classes
 truncated least square
Estimating the chance of success
 two difference coins case by case
 single expression \( p^y(1p)^{1y} \)
 estimate probability find value that maximize the function
 take a log
 as a sum
 you can minimize the average?
 what is the function in the two hard cases?
 as a loss function there is a penalty
Logistic Function
 linear functions are not good for probabilities
 t can be infinity and negative infinity
 take e of both sides
 model probability on the real line
 sigmoid, defined on whole line, smooth, increasing, elongated S
 derivative
Log Odds as a Linear Function
 features
 linear combo of features
 log odds
 probabiligty is of the log odds
The Steps of the Model
 generalized linear model
 linear regression continuous
 categorical (probability Y is 1)
 increase x by one unit
 linearly seperable data
 you need a little bit of uncertainty
 with regularization term
Logistic Loss Function
Gradient Descent
Logistic Regression in Scikit Learn
Log Loss
 or crossentropy loss
 log loss is convex!
Lecture 24: Logistic Regression Part 2
 using a constant model, good baseline
 partioned into 7? and calculated the proportion in each interval
Bonus Knearest Neighbor
 average stored in a heap?
 kind of bumpy
Logistic Regression
 \( \frac{1}{1+exp(t)} \)
 \( t=\sum_{k=0}^d \theta_d x_d\)
One Dimensional Logistic Regression Model
 different coefficients
 slope and intercept (lower intercept move right)
Loss
 cross entropy loss
 y log f(x) + (1 y) log f(x) sum
 have to use an iterative method
 the code
 forward has the model
 with cross entropy loss
 zero_gradient so we can take the gradient again
 it's sexy
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(solver='lbfgs')
lr_model.coef_
lr_model.intercept_
lr_model.predict # vs lr_model.predict_prob
 if theta is infinity, now we are certain instead we want some regularization
 1.0e5 * theta ** 2
The Decision Rule
 predict 1 if P(Y=1  x) > .5
 but can choose a value other than .5
 accuracy: fraction of correct predictions out of all predictions
Confusion Matrix
 FalsePositive when it is 0 (false) but the algorithm predicts 1 (true)
 FalseNegative when it is 1 (true) but the algorithm predicts 0 (false)
 Precision: true positives over true positive + false positives
 how many selected items are relevant
 Recall: true positives over true positives + true negatives
 how many relevant items are selected
 say we want to ensure 95% of malignant tumors are classified as malignant
 np.argmin(recall > .95)  1
 the pathologist would have to verify 611% of the samples.
 false diagnoses of 5% as benign when malignant is unacceptable in practice!
Lecture 25: Decision Trees and Random Forests
 how to classify this into three petals?
Decision Tree
 petal_width < .75 and petal_length < 2
 two dimensions
Scikitlearn Decision Tree
from sklearn import tree
decision_tree_model = tree.DecisionTree
# Better visualizer with graphviz
 they will have perfect accuracy unless data has the same value with different classes
 using more features now it is 4d
 this would overfit
Decision Tree Generation
Node Entropy
 p_0, p_1 proportions on a node amount
 entropy sum p log p
 how unpredictable a node is
Entropy
 when all data is one class, it's zero entropy
 evenly split = 1
 for C classes the entropy is logC
 entropy of the left and entropy of the right
 iteratively choose a split value
Overfitting
 just don't let it overgrow
 greedy algorithm
 pruning
 create a validation set
Random Forest
 low bias capture data in dataset, high variance
 just weight and vote
Bagging
 Bootstrap AGGregatING
 resample
 final model
 Berkeley Stats 1994!
 pick m subset feature
 heuristic
Why Random Forest
Lecture 26: Dimensionality Reduction
High Dimensional Data
 2d dataset
 hard to plot 3d
 would be redundant rank 3
sns.pairplot
Matrices are Linear Operations
 on data
Matrices are Coordinate Transformation
Singular Value Decomposition
Orthonormality
 vectors unit length 1 scaled and add up to 1
 all vectors are orthogonal
Lecture 27: Principle Component Analysis
 width and length are independent from each other
 what if it is noisy?
 last column is not quite zero
 rank 3 approximation of rank 4 data
 rank 2 approximation
Principal Component
 need to subtract the mean
 pc1 country on a line
 how above or below the PC2
Variance (Singular Value Interpretation)
 how many principle components to use?
Homeworks
Homework 4
Part 2
Question 2
Pattern example:
pattern = re.findall(r'\[(\d+)/(\w+)/(\d+):', str)
 Pandas Str Replace
 Replace tags with empty
Question 3
 Seaborn Distplot
 Multiple Distributions
 Just use multiple
sns.distplot
 Just use multiple
 Legend
Question 4
Question 5
 For 5c, reading from csv without header, tab seperated, setting index to zeroth column, and selecting only the first column:
pd.read_csv('file.txt', index_col=0, header=None, sep='\t').rename(columns={1: 'col_name'})[['col_name']
5f
# this make each str a column!
df['col_name'].str.split(' ', expand=True)
5g
pd.merge(left, right, how='left') # left, right, outer, inner
# on=None, left_on=None, right_on=None,
# use left_index, right_index
Question 7:
regex or can include parenthesis
(ab)
Labs
Lab 3
Regex
\W: nonword
*: 0 or more
_: 1 or more
?: 0 or 1
Lab 4
Part 1: Scaling
Distribution of values of female adult literacy rate as well as gross national income per capita.
We create a series with what we want, and drop null values.
 a. Want to build a histogram.
sns.countplot
is used more for categorical variables. See 5
is missing.
Part 2: Kernel Density Estimation
The kernel density estimate (KDE) sum of a bunch of copies of the kernel, each centered on our data points. A default kernel is the Guassian kernel:
$$\Large K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(\frac{(x  z)^2}{2 \alpha ^2} \right) $$
Discussions
Discussion 7
Visualizing Gradients
 Derivative takes with respect to each element
Discussion 8
Search
Search StackOverflow
Search Data 100
Search Python Libraries
Search Pandas
Search Matplotlib
Search Scikit Learn
Search Numpy
Exams
Midterm 1/Take Home Checkpoint Assignment
Material
 Data Science
 Probability
 SQL
 Pandas
 Regex
 Visualization
 Modeling and Estimation
 Optimization
 PyTorch
 Regression
Midterm 1 Fall 2019
 1.a.
 Ans: name
 Cor:
 Review:
granularity
 b.
 Ans:
'low_calorie'
cereal[caloires'] <= 100
 Cor:
 Review:
Pandas Series
 Ans:

Primary Key: unique
 In Purchases.csv,
(OrderNum, ProdID)
is unique  In Customers.csv
CustID
 In Purchases.csv,

Foreign Key: not unique
 In Orders.csv
CustID
is a foreign key  "The order associated with Customers"
 In Orders.csv

c.
 Ans:
fiber
: continuous,type
: discrete,low_calorie
: discrete  Cor:
type
is nominal,low_calorie
is ordinal  Review:
types of vars, nominal vs ordinal
 Ans:
 d.
 Ans:
groupby
'manufacturer'
max
sugars
sort_values
ascending=False
 Cor:
groupby
'manufacturer'
agg
max
sort_values
ascending=False
 Review:
agg functions
 Ans:
 e.
 Ans:
 Cor: `cereal.groupby('manufacture').filter(lambda x: sum(x['type'] == 'hot') == 0)['manufacturer']
 Review:
groupby class, then filter x is group of one class that is reduced to a single value

f.
 Review:
pivot table
 data is the dataframe, index, and column, aggfunc reduces to a single val
 Review:

g. Review:
what to do with NaN data

2.a. 25, mean

b. 125
 cor: c=25

c. 25
 cor: c=30

d. 25
 cor: c=30
 review:
take derivative
? what kind of loss are they using  average square loss: \( L(w) = \frac{1}{n} \sum_{i = 1}^{n} (x_i  c)\)

3.a.
 ans:
regex = r'^(.*): (.*)$'
 cor:
regex = r'([\w\s]+):\s+(\d+)'
 review:
\w
: alphanumericsazAZ09_
\s
whitespace\d
09
 street number!+
one or more times*
zero or more times
 ans:

b.i. ans:
[45]\d{15}
 review:
[45]
matches 4 or 5\d{15}
matches a decimal 15 times
 review:

ii.
 ans:
\d+\.?\d{2}
 cor:
\$(\d+\.?\d*)
Needs the dollar match at the front? parenthesis () so we only get the inside
 ans:

EDA 4.a.
 log(y) and e**x inverse operations
 review:
operation on y allowed

b.i. 30

ii. skew right
 cor: unimodal
 review:
skew, unimodal, EDA

iii. impossible to tell
 a. yesi Use a bar graph instead of plotting the distribution. You are unable to see the values.
 cor: increase bandwidth smoother density estimate
 review:
density estimation func
b. yes, it could show the numbers  cor: rescale y axis
 review:
rescaling
c. idk  cor:
density curve  compare distributions
 review:
density curve/distributions
 d.
 Ans: no
 Cor:
 Review:
 c. impossible to tell
 review:
box and whiskers, impossible to tell frequency
 review:
 6.a.
 Ans: People in CS W186
 Cor: and not in Data 100
 Review:
 ii.
 Ans: people who don't go to office hours
 Cor:
 Review:
 6.b.i.
 Ans:
 Cor: \( P(X_5 = 1) = \frac{500}{1000}\)
 Review:
simple random sample, and other one
 Random sample with replacement, just multiply
 Simple random sample, when doing an AND prob, multiply together

ii
 Ans:
 Cor: \( P(X_5 = 1, X_{50} = 1) = \frac{500}{1000} \times \frac{499}{999} \)
 Review:

iii.
 Ans:
 Cor: $$ \frac{Nn}{N1} np(1p)$$
$$=\frac{100050}{10001} 50 \frac{500}{1000}(1\frac{500}{1000}) $$
 Review:
variance

iii.
 Ans:
 Cor: 0
Review

Binomial Probabilities
 Probability of picking 4 blue and 3 not blue, has a certain probability
 Multiply by the number of ways to pick it

NaN
: see if there is any skew or bias ifNaN
is removed 
Variance and STD
 \( \sigma^2=Var(X)=\sum_{i = 1}^{n} p_i \cdot (x_i  \mu)^2 \)
 Standard Deviation: \( \sigma=\sqrt{\sigma^2}=\sqrt{Var(X)} \)
Midterm 1 Review
Sampling
 Simple Random Sample (SRS) sample uniformly, without replacement 1/50 * 1/49 * 1/48
 Sample people chosen from the Sampling Frame people we could have chosen from the Target Population where we want to generalize to.
 Probability Sample must be random.
 Random sample with replacement, just multiply 1/50 * 1/50 * 1/50
Probability
 List the distinct ways
 Sometimes the complement is easier 1P
 Are you drawing one at a time? 1/50*1/49
SQL
Pandas
 slice dataframe
df.loc[['Mary', 'Anna'], :]
 row index (on the left) of Mary and Anna, all column indices
Regex
Visualization
 Types of Data
 Quantitative Data
 Continuous
 weight, temperature
 Discrete
 finite, years of education, num sib
 Continuous
 Categorical Data
 Nominal
 no ordering
 Hair color
 Ordinal
 does have ordering!
 Olympic Medals Gold > Silver > Bronze
 Nominal
 Quantitative Data
 Types of Plots
 Quantitative Data
 Histograms, Box Plots, Rug Plot, KDE  Kernel Density Estimation
 Look at spread, shape, modes, outliers, unreasonable values
 Nominal & Ordinal Data
 Bar Plots (comparison)
 Skew frequency, rare categories, invalid categories
 Quantitative Data
 Histogram and KDE
 Box Plot
 Not a Distribution
 Is a Distribution
Kernel Density Estimators  KDE
 visualize shape/structure not individual observations
 Put a gaussian over each of the three points (tiny blue arrow), scale by 1/3, and sum
$$ K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(\frac{(x  z)^2}{2 \alpha ^2} \right) $$
\( \alpha \) is a smoothing param to change
Transformation
 If your data is like \( x^2 \), to make it linear, make it \( \sqrt{f(x)} \)
 If your data is like \( \sqrt{x} \), to make it linear, make it \( f(x)^2 \)
 Can also shift y! \( f(y)^2 \)
Linear Regression & Loss Functions
 Model Response Variable (y) with Explanatory Var/Predictors (x)
 Loss functions measure error of our model
 Squared loss L2 \( l(c) = (xc)^2 \)
 sensitive to outliers compared to L1, which doesn't have vanishing gradients. Constant gradients
 Absolute loss L1 \( l(c) = xcS \)
 Minimize Loss.
 Option A: Take the derivative and set to zero (convex)
 Option B: Use Gradient Descent or SGD
 Average Squared Loss, takes the mean of each loss per data point
Gradient Descent
 Gradient is a vector of partial derivative of a function to each variable
 grad: scalar > vector
 Gradient Descent is update by the negative of a small change in the gradient
 Usually want convexity but it works without
Midterm 1 Review Questions
Loss Function & Gradient Descent
Spring 2018
 Q1.p3 Pg4 (Loss Functions)
 a.: True b. True c. False d. second loss with infinite
 Q8 Pg15 (Loss Minimization)
 a. final one
 review:
weight theta is a constant.
\( \frac{1}{n} \sum_{i = 1}^{n} \theta = \theta \)
 Q10 Pg2122 (Gradient Descent)
 Random sample on each loop iteration
 review
np.random.choice returns a np.array sample. It can index a slice to give a np.array with X[ind,:]
 final ans
 correct ans:
grad = (grad_function(theta, xbatch, ybatch) + 2 * theta * lam)
 this keeps it a vector
Modeling/Regression
 Summer 2019 Final
 Q4
 review \( \log(\theta e ^{\theta x_i})=\log(\theta)  \theta x_i \)
 \( \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{x} \)
 Q5
 logistic regression/random forest, linear regression/random forest, linear regression/random forest, logistic regression/random forest
 Q6
 a.
 positive weight
 positive positive
 positive
 negative to beef (why?)
 negative to chicken (why?)
 Not Enough Info
 b.d.
 10
 25
 26
 e.
 low bias high variance
 high variance
 decrease variance
 increase bias decrease variance
 f.
 training: c
 validation: b
 g.
 high regularization, high mse
 want in between regularization
 a.
 Q7
 a.
 \( \theta^{(t + 1)} = \theta^{(t)} + \alpha (y  \sigma(\theta^{(t)}  2)  \theta^{(t)}) \)
 \( \theta^{(t + 1)} = y  \sigma(\theta^{(t)}  2) \)
 b.
 scalar, scalar, lenp vector, lenp vector
cor
read lenn vs lenp! y is a single outcome
 c.
 if have all values
 lambda doesn't affect second term, what you control
 since theta is real equals
 a.
 Q4
Regex
 Summer 2019 Q2
go!
andgarbs!
garbs
selects g then a then r then b, finally s
 Fall 2018 Q6


 2 letters, third letter
+
is one or more*
is zero or more



dogdog
burritodog
dogburrito


 9


 3 to 11
