Lecture 25: Decision Trees and Random Forests

  • how to classify this into three petals?

Decision Tree

  • petal_width < .75 and petal_length < 2
    • two dimensions

Scikit-learn Decision Tree

from sklearn import tree
decision_tree_model = tree.DecisionTree


# Better visualizer with graphviz

  • they will have perfect accuracy unless data has the same value with different classes

  • using more features now it is 4d

  • this would overfit

Decision Tree Generation

Node Entropy

  • p_0, p_1 proportions on a node amount
  • entropy -sum p log p
    • how unpredictable a node is

Entropy

  • when all data is one class, it's zero entropy
  • evenly split = 1
  • for C classes the entropy is logC

  • entropy of the left and entropy of the right
  • iteratively choose a split value

Overfitting

  • just don't let it overgrow
  • greedy algorithm

  • pruning
  • create a validation set

Random Forest

  • low bias capture data in dataset, high variance
  • just weight and vote

Bagging

  • Bootstrap AGGregatING
  • resample
  • final model
  • Berkeley Stats 1994!

  • pick m subset feature

  • heuristic

Why Random Forest