A decision tree is a human readable approximation of a function. It is built using a decision tree making algorithm, like ID3 or C4.5.

The main algorithm to produce a decision tree is called ID3. The ID3 algorithm, as originally described, can only produce a decision tree that represents a discreet valued function whose instances’ features’ are also discreet. In other words, ID3 only produces decision trees if your data’s input features are discreet, and your data’s output is also discreet. However, over the years, people have made little modifications to ID3 that allow it to work with both real valued outputs, and real valued features. When people employ these modifications, we no longer call it ID3, we call it C4.5 :man_facepalming:, lol, for some reason.

Two algorithms to build decision trees from data:

  • ID3
  • C4.5

C4.5 is basically a version of (an extension of) ID3 that allows your data to have real valued features and real valued outputs. ID3 only allows discreet valued features and discreet valued outputs.

Remember, after all, supervised machine learning is all about finding a function (hypothesis) that “fits” your training data. The different algorithms have different trade offs. One of the considerations for a trade off is the type of input/output an algorithm supports. C4.5 is less restrictive on the type of your input/output data. One could say, C4.5 is more “general”.

Let’s go over the ID3 algorithm and then we’ll briefly explain how C4.5 handles real valued features/outputs. “I don’t care about the how”, you say? Remember, the reasoning/rational behind any choice/advantage/concept/theory/discovery/etc is the most important part, the most fruitful part to understand. So man up and pay attention till the very end! :angry:

ID3 Algorithm

Recursively split at the node that results in the most information gain (aka entropy lost).

Information gain is basically how much homogeneity you gain by doing that particular split. How much homogeneity gained is the same as how much entropy is reduced, and we can quantify entropy :grin:. The plan is:

  1. Find how how much entropy you have right now.
  2. Find how how much entropy you will have if you split at particular node.
  3. Subtract the two to find out how much entropy reduced! In other words, how much information was gained (since entropy reduced = information gained).
  4. Do this for every attribute, and then actually split at the attribute that results in the most information gain
  5. Now you have several new split little chunks of data. Do the same exactly thing for each of these pieces of data (i.e. split them in the attribute that results in most information gain)
  6. Keep doing this until everything is split into homogenious chunks (i.e. all same class, i.e. entropy is 0)

How do you calculate how much entropy is in a bag of stuff? Let’s say that each instance in a bag can belong to one of a bunch of classes. Given a bag with some instances, how do you calculate its entropy?

Here’s the formula!


  • is a particular class
  • is the probability of randomly picking an instance from the bag that ends up being of class .

So in other words (if your classes are colors, and instances are marbles), you say, “what is the probability that I will pick a black marble from my bag?”. Then you plug it into the above equation. Then you ask “what is the probability that I will pick a red marble from my bag?”. You plug that into the equation too. You keep asking this question for every possible color (in your problem), then you add em all up as specified by the equation! All hail the great equation!

Seriously though, there is a very good intuitive reason (remember - reasoning is very important) for this equation, BUT :grin:, that would require this article to be much bigger and I don’t like big articles, SO, I recommend you go out and about and find yourself another source that explains the reasoning behind the entropy equation. Research “entropy equation information theory”. Don’t get it confused with the physics entropy equation!

C4.5 Algorithm

C4.5 algorithm handles real valued features by simply

discretizing them!

What ranges do we choose to discretize a continuous variable into? Here is a heuristic:

  1. Sort all your data by the real valued feature that you’re interested in discretizing
  2. Start at the left edge of your sorted data, head towards the right, draw a pipe whenever you encounter a change of class. For example if the first 3 data you encounter are all in class “red”, but the 4th one is in class “blue”, draw a pipe between data 3 and 4.
  3. Choose the ranges between the pipes!

You can consider these steps as data “pre-processing” steps really, because ultimately your algorithm still needs discreet inputs! You just take your real valued stuff, discretize them by following the above heuristic, and then simply feed the discretized data into the regular ole ID3 algorithm.


   continuous data  +-->  | heuristic |  discretized data

                                        | ID3 algorithm |

Some More Notes

Alternative Splitting Criteria

There are alternatives to using information gain as the criteria for the split. Two of them being “gini impurity” and “gain ratio”.

Handling Costed Features

There is also a way to give a “cost” to each attribute, and prefer trees that use high cost features as little as possible (i.e. for the vast majority of cases, it will be able to classify without using high cost features, but it may still require them with certain inputs). This is useful if you want to create a learner that can classify your future examples while relying as little as possible on high cost features. For example, calculating loan risk based on features such as age, height, weight, and criminal status. Criminal status might be an expensive feature to acquire for your clients, so you’d want your learner to not need this for most cases (i.e. for most people, age, height, and weight should be good enough to determine if they are a good person or not!! :).

Pruning (Reducing Overfitting)

If you continue to recursively split nodes until all your leaf nodes are homogenous (i.e. they only contain instances of one class), you may overfit your data. You may want to subsequently prune (i.e. remove) some of the nodes (i.e. reduce complexity of your tree). Rule post pruning is a very good method of doing this.