I will go through following topics in this post...

(i)   Decision Tree Basics

(ii)  Feature selection

Decision Tree Basics

Decision Tree is one of the classical machine learning algorithms which is used for both supervised classification & Regression.

Decision Tree

If you observe the above image, one can answer following questions...

1. Should I accept a new offer or not?

2. How can machine decide it?

Humans we can easily take rational decisions but machine can't

Researchers developed lot of algorithms for machine decision making. Decision Tree is one of them

Geometrically Decision Trees are a set of hyperplanes which divide the data in different segments as shown below...

If you observe the above image, we have 2 classes Pink and Green. If you use classification algorithms such as Logistic Regression or SVM, you would find that we can't correctly classify these two classes but with decision tree we can easily classify them because decision tree uses lines parallel to axis to divide the data in to different boundaries where as others try to find one single line or plane to segregate and classify the data.

Let us understand the algorithm by going through an example.

We need to know understand following terminology first...

(i)   Root Node

(ii)  Leaf Node

(iii) Entropy

(iv) Gini impurity

(v) Information Gain

Root Node

If you observe the above image we have tree like structure . A Node which is used to build and divide the tree is called Root node. It's the highest node in the tree structure.

Leaf Node

Leaf node is the end of the node in the tree structure. After the leaf node we can't divide the tree.

Note : In Decision tree each node represents Feature(column in dataset) each branch represents decision and each leaf represents outcome.

If you observe the above image we have dataset with 5 features and 14 data points.

Objective: We need a make a decision tree that predicts whether tennis will be played on the day or not?

We have 5 features in which 1 is target or outcome , So using 4 features we need to predict whether Tennis can be played or not on a specific day.

This is a classical supervised binary classification problem.

We can use Decision Tree for this classification problem.

To build a decision tree we need to choose a Root node.

Root node can be any feature (Outlook , Temp, Humidity, Windy).

We can decide the Root node based on Entropy or Gini impurity.

Entropy

First We calculate the each feature's Entropy or Gini impurity.

In the above formula, Pi is probability of class(yes/no).

In our above example, we have three sub categories sunny, overcast and rainy of feature "outlook".

1. Now we need to find the probability(yes/no) of each category.

2. After finding probability we find the entropy of each category by multiplying p(i) with log(p(i))

We repeat the above two steps for each feature and each category.
After finding the entropy we need to find the information gain.

If you observe the above two images Outlook is a parent node , Sunny ,Overcast, Rain are child nodes

For information gain we need to substract weighted child entropy from parent entropy.

Check out above snapshot to learn how to calculate information gain.

After calculating information gain we need to choose the root node

Root Node : Pick the feature which has highest gain value. That is the root node.

Note1 : Splitting categorical features is very easy but splitting numerical features is very hard.

For Splitting numerical features :

(i)  Sort the numerical features in ascending order

(ii) Evaluate all n possible splitting conditions and find maximum information gain. Max information gain feature is root node.

Feature Selection

Information gain is used for feature selection.

Imbalanced data set : Decision trees are affected by imbalanced dataset. So we need to balance the dataset before applying algorithm.

Large Dimensionality : For large dimensional data training time increases.

Overfitting : When the depth of tree increases there is risk of overfitting.

Note : Decision tress are very easy to interpret.

Decision Tree Regression

For classification we use Information gain where as for Decision Tree Regression, we use Mean Squared Error. Check out this link for Decision Tree Regression hyper parameters tuning in Python.

We calculate for each parent and child, the mean squared error.

We choose a node which has least mean squared error. That becomes the root node.