Concept of Entropy and its application in Machine learning
In this article, I will explain the concept of entropy and its application in machine learning algorithms.
Entropy Definition:
The name comes from the concept of entropy in physics, more specifically in thermodynamics where entropy is the measure of molecular disorder in the system.
- In information theory, Entropy is the degree of randomness or disorder or uncertainty in a system.
- Entropy is also the expected minimum number of bits needed by a system to transmit a message. The higher the entropy or the number of bits needed, the more complex the message is.
Explanation of entropy:
Let us try to understand and correlate the above two definitions an example-
Suppose we have a system which transmits messages.
Suppose there can be ‘n‘ distinct messages that can be encoded by the system.
Let us assume that each of those messages is a statistical event. So, the number of distinct statistical events is ‘n’.
Assuming it is a uniform distribution, the probability 𝑝𝑖 of occurrence of each of those message = 1/n.
So, 𝑝𝑖 = 1/n . Hence n = 1/ 𝑝𝑖
Minimum number of bits needed to encode the message (which can have ’n’ distinct combinations) =
Log n = Log(1/𝑝𝑖) = -Log(𝑝𝑖) (Pls. note, log is the base-2 logarithm)
Expected number of minimum bits needed encode a message in the system =
Mathematical expectation of -Log(𝑝𝑖)= E(- Log(𝑝𝑖)) = −∑𝑝𝑖log𝑝𝑖 (assuming it is a discrete probability distribution )
So, Entropy H(x) = −∑𝑝𝑖log𝑝𝑖 which the expected minimum bits needed to transmit a message or information.
- Entropy is also the expected value of Information Content in a system. Higher the information content, higher is the entropy.
- Information content is defined as I(x) = -Log(𝑝𝑖). Hence entropy H(x) = −∑𝑝𝑖I(x)
The entropy is maximised if all the events are equally likely i.e. it is a uniform distribution. That means that the uncertainty or randomness in the system is at the maximum.
A skewed/biased probability distribution will have lower entropy, since randomness will be less.
Entropy of a coin toss:
Let X be a random variable which indicates the occurrence of “Heads” or “Tails” when the coin is tossed.
Let Pr( X=1) indicated the probability of occurrence of “Heads”.
Let Pr( X=0) indicated the probability of occurrence of “Tails”.
For a fair coin, Pr( X=1) = Pr( X=0) = 0.5.
So Entropy = −∑𝑝𝑖log𝑝𝑖 = -0.5Log(0.5) - 0.5Log(0.5) = 0.5+0.5 = 1
So when both the events are equally likely i.e. the distribution has maximum randomness, the entropy is maximum.
Now, let us assume that the coin is biased
Say, probability of occurrence of Heads Pr( X=1) = 0.7 .
and probability of occurrence of Tail Pr( X=0) = 0.3.
So Entropy = −∑𝑝𝑖log𝑝𝑖 = -0.7Log(0.7) — 0.3Log(0.3) ~ .88
So the distribution has lower entropy as randomness of the distribution is less.
The graph below shows the plot between entropy H(x) and probability of occurrence of Heads Pr( X=1) for a coin toss. Entropy is maximum for a fair coin where both events are equally likely.
If we know that some events are more likely than others, we should exploit this in our coding scheme. We would like to allocate fewer bits to frequent characters than characters that are rare. This is used in Huffman coding.
- Entropy is nothing new and was defined in 1948 by Claude Shannon, an US mathematician, and ‘father of information theory’. It is therefore often called Shannon entropy.
Applications of Entropy in Machine Learning:
Let us first look at the definition of cross-entropy:
- Cross-Entropy
Cross entropy quantifies the difference between two distributions.
The cross entropy is the expected number of bits per message needed to encode events drawn from a true distribution 𝑞, if using a code for distribution 𝑝 rather than 𝑞:
𝐻𝑋(𝑝,𝑞) =−∑ 𝑞(𝑥)log𝑝(𝑥).
In the above equation, q is the true distribution.
Before going into the application of entropy, let us understand what is a loss function
- Loss function: Loss function quantifies the accuracy of the machine learning model by computing the difference between the values predicted by the model and actual values from the labelled (test/training) data set. During the training of the ML model using an optimisation algorithm ( ex-Gradient Descent ), the Loss function is minimised by tweaking the parameters(weights) of the model.
Example: Suppose we have defined a machine learning model to predict the cost of a house in a specific locality. The model has been defined as:
y = w1*x1 + w2*x2 + w0
where x1, x2 are features “area of house” and “number of rooms”. w1 and w2 are weights and w0 is called the bias.
During the training of the model from labelled data-sets( in this example labelled data indicates actual cost of house and its area and number of rooms) using an optimisation algorithm, the values of w1,w2 and w0 are fine tuned to reduce the difference between actual value and the one predicted by the model. Loss function measures the difference between the predicted value and the actual value from data-set and hence plays an important role in fine tuning the weights of the model. Basically the optimisation algorithm( like Gradient descent ) minimises the Loss function.
- Cross Entropy as a loss function in multi-class classification machine learning algorithms
Cross-entropy is commonly used in machine learning as a loss function. It is used as a loss function in classification models like logistic Regression and neural networks.
As indicated earlier, Cross entropy quantifies the difference between two distributions. Here q is the true distribution and p is the predicted distribution.
During training of a machine learning model using an optimiser, the aim will be to minimize the loss function (cross entropy) i.e., the difference between predicted and the true distribution.
- Entropy and Information Gain in Decision Trees
Now we know entropy is the measure of disorder in a system. Next we need a metric to measure the reduction of this disorder in our target systems given additional information( features/independent variables) about it.
Information Gain is the measure of the reduction in entropy or degree of disorder in a system. Mathematically it can be written as:
IG(Y, X) = E(Y) — E(Y/X)
We subtract the entropy of Y given X from the entropy of just Y to calculate the reduction of uncertainty about Y given an additional piece of information X about Y. This is called Information Gain. The greater the reduction in this uncertainty, the more information is gained about Y from X.
Higher is the Information Gain, lesser is the entropy as the system becomes more predictable.
In Decision Trees, the parameter/feature which increases the information gain and reduces the entropy to the maximum level in the system is chosen as an attribute for splitting any node.
This process is repeated till no further information gain is possible by splitting a decision tree node.