- Article Source: Classification using Decision Trees in R
- Authors: Morgun Ivan

# Classification using Decision Trees in R

## Introduction

*Tree based* learning algorithms are considered to be one of the best
and mostly used supervised learning
methods
(having a pre-defined target variable).

Unlike other ML
algorithms
based on statistical techniques, *decision tree* is a non-parametric
model,
having no underlying assumptions for the model.

**Decision tree** are powerful *non-linear classifiers*, which utilize a
tree structure to model the relationships among the features and the
potential outcomes. A *decision tree classifier* uses a structure of
branching decisions, which channel examples into a final predicted class
value.

This machine-learning approach is used to classify data into classes and
to represent the results in a flowchart, such as a tree structure. This
model classifies data in a dataset by flowing through a query structure
from the root until it reaches the leaf, which represents one class. The
root represents the attribute that plays a main role in classification,
and the leaf represents the class. The *decision* tree model follows the
steps outlined below in classifying data:

- It puts all training examples to a root.
- It divides training examples based on selected attributes.
- It selects attributes by using some
*statistical measures*. - Recursive partitioning continues until no training example remains, or until no attribute remains, or the remaining training examples belong to the same class.

Types of *decision tree* is based on the type of target variable we
have. It can be of two types:

**Classification Trees**: where the target variable is categorical and the tree is used to identify the*class*within which a target variable would likely fall into. For example, the target variable has two value YES or NO.**Regression Trees**: where the target variable is continuous and tree is used to predict it’s value.

During the process of analysis, multiple trees may be created. There are
several techniques used to create trees. The techniques are called
*ensemble methods*:

*Bagging decision trees*. The data is resampled and frequently used to obtain a prediction based on consensus*Random forest classifier*. Used to improve the classification rate.*Boosted trees*. This can be used for regression or classification problems.*Rotation forest*. Uses a technique called Principal Component Analysis (PCA).

From architecture point of view, *decision tree* is a *graph* to
represent choices and their results in form of a tree. The *nodes* in
the graph represent an *event or choice* and the *edges* of the graph
represent the *decision rules or conditions*.

To better understand how this works in practice, let’s consider the
following tree, which predicts whether a car should be purchased. A car
purchase decision begins at the *root node*, where it is then passed
through *decision nodes* that require choices to be made based on the
attributes of the car. These choices split the data across *branches*
that indicate potential outcomes of a decision, depicted here as *BUY*
or *DONT’T BUY* outcomes, though in some cases there may be more than
two possibilities. In the case a final decision can be made, the tree is
terminated by *leaf nodes* (also known as *terminal nodes*) that denote
the action to be taken as the result of the series of decisions. In the
case of a predictive model, the leaf nodes provide the expected result
given the series of events in the tree.

There are some actions on the tree: *splitting* is a process of dividing
a node into two or more sub-nodes and *pruning* (opposite process of
*splitting*) is a process when we remove sub-nodes of a decision node.

Another example of *decision tree*: Is a girl
date-worthy?.

*Decision trees* are built using a heuristic called *recursive
partitioning*. This approach is also commonly known as **divide and
conquer** because it splits the data into subsets, which are then split
repeatedly into even smaller subsets, and so on and so forth until the
process stops when the algorithm determines the data within the subsets
are sufficiently homogenous, or another stopping criterion has been met.

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.

The algorithm selection (or measuring association between attributes) is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:

*Gini Index*is more suitable to continuous attributes and entropy in case of discrete data. Also it works well for minimizing misclassifications.*Entropy*is slightly slower than Gini-index, as it involves logarithms.*Chi-Square**Information Gain*s a measure that quantifies the change in the entropy before and after the split. It’s an elegantly simple measure to decide the relevance of an attribute.*Reduction in Variance*

There are numerous implementations of *decision trees*. They differ in
the criterion, which decides how to split a variable, the number of
splits per step and other details.

- Classification and Regression Trees (CART)
- C4.5
- PART
- Bagging CART
- Random Forest
- Gradient Boosted Machine
- Boosted C5.0

One of the most well-known implementations is the C5.0 algorithm. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box.

Strengths of the C5.0 algorithm

- An all-purpose classifier that does well on most problems
- Highly automatic learning process, which can handle numeric or nominal features, as well as missing data
- Less data cleaning required:
- Excludes unimportant features
- Can be used on both small and large datasets
- Non parametric method (have no assumptions about the space distribution and the classifier structure)
- Results in a model that can be interpreted without a mathematical background (for relatively small trees)
- More efficient than other complex models

Weaknesses of the C5.0 algorithm

- Decision tree models are often biased toward splits on features having a large number of levels
- It is easy to overfit or underfit the model
- Can have trouble modeling some relationships due to reliance on axis-parallel splits
- Small changes in the training data can result in large changes to decision logic
- Large trees can be difficult to interpret and the decisions they make may seem counterintuitive

## Modeling in R

There are many packages in R for modeling *decision trees*: `rpart`

,
`party`

, `RWeka`

, `ipred`

, `randomForest`

, `gbm`

, `C50`

.

The R package
rpart
implements recursive partitioning. The following example uses the *iris
data set*. This dataset contains 3 classes of 150 instances each, where
each class refers to the type of the iris plant. We’ll try to find a
tree, which can tell us if an Iris flower species belongs to one of
following classes: *setosa*, *versicolor* or *virginica*. As the
response variable is categorial, the resulting tree is called
*classification tree*. The default criterion, which is maximized in each
split is the *Gini index*.

Install and load `rpart`

package

```
install.packages("rpart")
install.packages("rpart.plot")
library("rpart")
library("rpart.plot")
data("iris")
```

You can explore *iris data set* structure by `str()`

command.

```
str(iris)
```

Now, we split out entire dataset into two parts - *the training set* and
*the testing set*. This is a very common practice in machine learning -
wherein, we train a machine learning algorithm with the training data,
and then test our model using the testing data.

```
indexes = sample(150, 110)
iris_train = iris[indexes,]
iris_test = iris[-indexes,]
```

Now, we need to build our *decision tree*. To do that, we first build a
formula which we shall be using to depict the dependencies. For this
problem, we’re trying to build a model that tries to classify
(*classification tree*) a test data point, into one of the three
`Species`

classes - i.e. *setosa*, *virginica* or *versicolor*. The
input is a tuple consisting of `Sepal.Width`

, `Petal.Width`

,
`Sepal.Length`

and `Petal.Length`

.

```
target = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
```

When you set a numerical variable as *y* value (or *target*), the tree
is a *regression tree* and you can write the function as below.

```
target = Sepal.Length ~ ., iris
```

Build and plot model

```
tree = rpart(target, data = iris_train, method = "class")
rpart.plot(tree)
```

The `method`

argument can be switched according to the type of the
response variable. It is `class`

for categorial, `anova`

for numerical,
`poisson`

for count data and `exp`

for survival data.

The result is a very short tree: if `Petal.Length`

is smaller than 2.5
we label the flower with *setosa*. Else we look at the variable
`Petal.Width`

. Is `Petal.Width`

smaller than 1.6? If so, we label the
flower *versicolor*, else *virginica*.

Now that our model is built, we need to cross-check its validity by
pitching it against our test data. So we use the `predict`

function to
predict the classes of the test data. And then create a matrix showing
the comparison between the prediction result and the actual category.

```
predictions = predict(tree, iris_test)
table(predictions, iris$Species)
```

Let’s try another package - `party`

. It has `ctree()`

function for
fitting conditional trees. Execute `?ctree`

for the help page. This
package has good plotting facilities for conditional trees.

```
install.packages("party")
library(party)
tree = ctree(Species ~ ., data = iris)
plot(tree, main="Conditional Inference Tree for Iris")
table(predict(tree), iris$Species)
```

The C5.0 method is a further extension of C4.5 and pinnacle of that line of methods. It is available in the C50 package. The following recipe demonstrates the C5.0 with boosting method applied to the iris dataset.

```
install.packages("C50")
library(C50)
# build model
tree = C5.0(Species ~ ., data = iris, trials=10)
# make predictions
table(predict(tree, newdata=iris), iris$Species)
```

## Pruning

Size of a *decision tree* often changes the accuracy of the prediction.
In general, bigger *tree* means higher accuracy, but if the *tree* is
too big, it overfits the data and results in decreased accuracy and
robustness. By overfitting, it means the *tree* may be good at analysing
the training data to a high degree of accuracy, but it may fail to
correctly predict test data because it is grown with too much features
of the training set it is not stable to any minor variations in those
features. Robustness means the consistency in the performance of the
*tree*, in this context. Hence, the objective would be to grow a *tree*
to an optimal size that is accurate and robust enough.

Pruning is a technique used in determining the size of the *tree*.
Often, you would grow a very large *tree* and starts to trim in down
while measuring the change in accuracy and robustness. `rpart()`

offers
a parameter `minsplit`

which is used in this pruning process. It refers
to minimum number of data points required before a split is made at a
node. The below example compares the effect of changing `minsplit`

parameter.

```
tree_ms3 = rpart(target, iris_train, control = rpart.control(minsplit = 3))
tree_ms10 = rpart(target, iris_train, control = rpart.control(minsplit = 10))
par(mfcol = c(1, 2))
rpart.plot(tree_ms3, main = "minsplit=3")
text(tree_ms3, cex = 0.7)
rpart.plot(tree_ms10, main = "minsplit=10")
text(tree_ms10, cex = 0.7)
```

## Are tree based models better than linear models?

Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which will help you to decide which algorithm to use:

- If the relationship between dependent and independent variable is well approximated by a linear model, linear regression will outperform tree based model.
- If there is a high non-linearity and complex relationship between dependent and independent variables, a tree model will outperform a classical regression method.
- If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

## Outcome

A *decision tree* is a tree like chart (tool) showing the hierarchy of
decisions and consequences. It is commonly used in decision analysis.
Methods of *decision tree* present their knowledge in the form of
logical structures that can be understood with no statistical knowledge.

Interpretation looks like: “If (x1 > 4) and (x2 < 0.5) than (y = 12)”.
This is much easier to explain to a non-statistician than a linear
model. Therefore it is a powerful tool not only for prediction, but also
to explain the relation of your response *Y* and your covariables *X* in
an easy understandable way.

I’d like to point out that a single decision tree usually won’t have much predictive power but an ensemble of varied decision trees such as random forests and boosted models can perform extremely well.