Introduction to Shannon entropy
Motivated by what I perceive to be a deep misunderstanding of the concept of entropy I have decided to take us onto a journey into the world of entropy. Recent confusions as to how to calculate entropy for mutating genes have will be addressed in some detail.
I will start with referencing Shannon’s seminal paper on entropy and slowly expand the discussion to include the formulas relevant for calculating the entropy in the genome.
But first some warnings
7 Comments
Pim van Meurs · 28 May 2004
Erik 12345 · 28 May 2004
The following points cannot be emphasized forcefully enough:
1. Every quantity to which we can associate a probability distribution has its own Shannon entropy. Every time we see the term "Shannon entropy" we must therefore ask "Shannon entropy of what, i.e. to which quantity does this Shannon entropy refer?".
2. The Shannon entropy of a quantity is only as meaningful as the probabilities used to compute it. In the case when the quantity is a gene sequence, the probabilities may represent the researcher's subjective knowledge about the gene sequence of a particular individual. Another possibility is that the probabilities represent the relative frequencies with which particular gene sequences occur in a given population.
3. There is nothing magical about weighted sums of logarithms that make them automatically relevant to the application you are interested in! If there is to be any point in computing the Shannon entropy of a quantity, one must first figure out its significance for the particular application. For instance, communication engineers study the Shannon entropy of messages because it provides a lower bound on the number of data symbols that must be transmitted for messages to be fully recoverable at the receiving end. Proponents of Maximum Entropy Inference, on the other hand, take interest in Shannon entropies for a completely different reason, namely because they think the right way translate knowledge into probabilities is to choose, of all probability distributions consistent with what is known, the one that maximizes the Shannon entropy. It is unfortunate that the theoretical framework used by communication engineers, the philosophy of Maximum Entropy Inference, as well as other kinds of frameworks all claim to be named "Information Theory".
The last point should be reinforced by the existence of (infinitely many!) generalizations of the expression for the Shannon entropy. One such generalization is the Tsallis entropy. The Tsallis entropy of the quantity X, with the associated probability distribution p, is defined as
S(X; q) = (1 - SUM p(k)^q) / (q - 1).
With the choice q = 1 we recover the Shannon entropy of X (the expression looks mathematically nasty, since we divide by zero, but the limit is well-defined and equal to the Shannon entropy). Other choices of the value of q give other "entropies". Interdisciplinary researchers should ask themselves: "Assuming for the sake of the argument that an 'entropy' is relevant to my application, why is the relevant entropy the Tsallis entropy with q = 1 (aka 'Shannon entropy') rather than, say, the Tsallis entropy with q = 2.378?"
Excercise: Investigate Schneider's and Adami's articles with respect 1-3 above. The reader will find that their different answers to the questions raised in 1-3 are not entirely transparent in their papers (personally, I had to read Schneider's Ev paper several times). Bonus excercise: Is there any particular reason why Schneider should make use of the Tsallis entropy with the particular choice q = 1? Same question for Adami?
Pim van Meurs · 28 May 2004
Thanks for your comments Erik. I have to admit that I am not familiar with "Tsallis entropy" but some searching uncovered an interesting issue indeed.
Tsallis Entropy
As I understand, the reason for chosing q=1 is that Tsallis entropy is only additive for q=1.
Pim van Meurs · 28 May 2004
In addition the probability distribution function for q=1 seems to be Gaussian while the probability distribution for the Tsallis entropy can be used in cases where the probability distribution functions have different tails.
Although I understand that a transformation exists which turns Tsallis into additive entropy, or Renyi entropy
However for Tsallis entropy we find
S(AB:q)= S(A:q)+S(B:q)+(1-q)S(A:q)S(B:q)
Erik 12345 · 28 May 2004
Pim van Meurs · 28 May 2004
Erik got me thinking, in fact another assumption made was that the nucleotides were uncorrelated. This does not seem to be a very good approximation and I wonder if other entropy forms would 'work better'.
Bioinformatic Principles Underlying the Information Content of Transcription Factor Binding Sites Jan Kim, Thomas Martinetz and Daniel Polani, Journal of theoretical Biology 2003 (220) 529-544)
This paper addresses some additional issues with Rseq and Rfreq.
Pim van Meurs · 28 May 2004
Erik: The bottom line is that whether or not additivity is important depends entirely on what you are trying to do.
Your words of warning are well taken and in fact I appreciate you pointing out to me one of the many facts of entropy I have yet to become aware of.
As far as additivity is concerned, while under certain circumstances I can see why this may be of less interest, for an uncorrelated genome it certainly seems to make sense. I am wondering though if the same holds for a genome where nucleotide sequences are correlated.