Update Oct 26: I have to take responsibility for not clarifying that my usage of information is based on Shannon's theory of information according to which
Information content of DNA
The information content of DNA is much harder to determine than merely looking at the number of base pairs and multiplying it by 2 to get the size in bits (remember that each site can have up to 4 different nucleotides, or 2 bits). However, this approach can provide us with a zeroth order estimate of the maximum possible information that can be stored in said sequence which for the human genome with 3 billion base pairs would amount to 6 billion bits or 750 Mbytes.
However, information theory shows that random sequences have the lowest information content and that well preserved sequences contain the maximum information content. In other words, the actual information content ranges from zero for totally random sequences to 2 bits for conserved sequences.
Another way to look at this is to compress the DNA sequence using a regular archive utility. If the sequence is random, the compression will be minimal, if the sequence is fully regular, the compression will be much higher.
So how does one obtain a better estimate of the information content of DNA? By estimating the entropy per triplet (3 base pairs) which has a maximum entropy of 6 and for coding regions a value of 5.6 and for non-coding regions 5.82. This means that the information content for coding regions is 0.4 bit per triplet and for non-coding regions .18 bit per triplet. For 3 billion base pairs, or 1 billion triplets, this gives us an actual information content of 0.4 billion bits or 50 Mbytes assuming the best case scenario that all DNA is coding or about 24 Mbytes if all the DNA is non-coding.
Now how does this compare with evolutionary theory? In a 1960 paper "Natural selection as the process of accumulating genetic information in adaptive evolution", Kimura calculated that the amount of information added per generation was around 0.29 bits or since the Cambrian explosion some 500 million years ago, on the order of 108 bits or 12.5 Mbytes assuming that the geometric mean of the duration of one generation is about 1 year.
As a side note, Kimura reasoned that about 107 or 108 bits bits of information would be necessary to specify human anatomy.(Source: Adaptation and Natural Selection By George Christopher Williams)
So is this a reliable way to determine the information content of DNA? Perhaps not, and a better way is to take a large sample DNA from different people and determine for each base pair, how variable it is. A preserved site will have the maximum of 2 bits of information while a totally random site will have zero bits of information.
The problem is to understand how much information is contained by these 'bits'. For instance, the total number of electrons is about 1079 and finding one 'preferred' one' amongst these which translates to about 250 bits. This means that in 1000 generations, natural selection can achieve something far more improbable than this.
Update Oct 26: I have to take responsibility for not clarifying that my usage of information is based on Shannon's theory of information according to which
I(y)=Hmax - H(Y)
where I(Y) is the amount of information, H(Y) is the entropy of the received sequence and HMax is the maximum entropy (basically the entropy of uniform distributed sequence).
See Shannon entropy applied where I described how Shannon entropy is applied in biology with references to the work by Chris Adami and Tom Schneider.
Update Oct 26: I have to take responsibility for not clarifying that my usage of information is based on Shannon's theory of information according to which
94 Comments
tresmal · 22 October 2008
I don't know why you posted this. I can't imagine anyone being interested. :)
"As a side note, Kimura reasoned that about 10^7 or 10^8 bits bits of information would be necessary to specify human anatomy.(Source: Adaptation and Natural Selection By George Christopher Williams)"
I have no idea why I cut and pasted that quote.
PvM · 23 October 2008
I happened to read Kimura's paper while researching why Dembski seems to be unfamiliar with the history of the concept of information in biology and found Kimura's 1961 comments to be quite relevant.
Joe Felsenstein · 23 October 2008
Now you've set yourself up for getting a lot criticism. Speaking as an expert on informaton and evolution, I can say that *everyone* who posts here is sure that they are an expert on information and evolution (which is how I know I am too). And some of them will no doubt argue vehemently that random sequences have the *most* information, not the least. Enjoy.
Bill Gascoyne · 23 October 2008
I'm going to throw my $.02 in, but I'm not sure I'm able to express this in a sufficiently coherent manner.
I submit that describing DNA in terms of information is rather like describing electrons and such in terms of waves and/or particles. An electron is what it is, and describing it as wave-like or particle-like is a human analogy that helps us understand it and does not mean that the electron is actually a wave or a particle. Similarly, DNA is what it is, and describing it in terms of information content doesn't mean that DNA consists of information that is used in the way that a computer uses information.
PvM · 23 October 2008
novparl · 23 October 2008
Last sentence of essay seems to have a word missing.
The complexity of DNA proves evolution. It must be easy for 6 billion bits to evolve over 4 billion years.
To my "friends" - what do you guys think of NOMA? Dawkins or Gould?
SteveF · 23 October 2008
Opisthokont · 23 October 2008
To some degree, of course, this is all a red herring. DNA alone does not "specify human anatomy"; a lot of anatomy is in fact epigenetic. This means, strictly speaking, that it is inherited but not encoded in DNA; one of the best-studied mechanisms for this is the interactions between cells during development. Both cells could become the same thing, but one cell's signalling molecule tells an adjacent cell to become something else, and in turn that cell may change which signalling molecules it uses. Depending on the pattern of signalling molecules, both spatially and temporally, the results of development can differ significantly. (What starts it all? one might ask. There are a number of mechanisms known for this as well, many of which are external to the embryo, often being set up by the mother.) The signals themselves are often highly evolutionarily conserved, such that the Pax6 gene homologue from a fruit fly, which (among other things) specifies eyes, can make eyes grow in places where it is injected into a developing frog. This is not to say that DNA is unimportant, of course, just that it is not the only part of the story (at least with eukaryotes).
That said, this is a nice article, and an important investigation into one of ID's primary claims.
iml8 · 23 October 2008
There are questions in here folded into questions.
First question: is there some "magic ratio" of the
number of bits in a "program" to the complexity
it produces? "Yes, the value of the ratio is ... 0.42!"
I don't think anybody's figured out any such ratio. And
even if they had, nobody's figured out how to mathematically
determine the complexity of an organism to permit
such a calculation.
Even comparing the same program written in different
computer languages is tricky. Some languages may be able
to do particular tasks in much less code than others.
And even when it comes to the binary executable program,
it's hard to make comparisons. For a large program,
the executable for an interpreted system is much smaller
than the executable for a compiled system (if much slower as well). And among
compiled systems the size of the executable depends on
the compiler and the processor. A specialized processor
will probably need much less code for a task tailored
to it than a general-purpose processor.
And that's only comparing the SAME program. Comparing
different programs? Writing a little toy demo program
to draw even a simple picture is a pain; a toy demo
program to draw an elaborate fractal pattern is shorter,
and it can produce as much fractal detail as one likes
just by changing the count of the number of iterations.
Incidentally, the growth of organisms seems to have
fractal features, and fractal algorithms are noted for
ability to generate lots of elaboration for a small
amount of code.
Then ... comparing "programs" between two different
systems that don't have any real resemblance to each
other and don't perform the same functions is out in
hyperspace.
Are there not enough bits in the human
genome to encode the human body? For all we know I
could insist that there's FOUR TIMES as many bits
as required, and dare anyone to prove me wrong: "You
see, because of the Binary Coding Efficiency [it's nice
to make up impressive-sounding phrases here] of the
human system its Binary Coding Ratio is vastly better
than that of a personal computer ... "
But at least I would be being silly on purpose.
White Rabbit (Greg Goebel) http://www.vectorsite.net/gblog.html
Jeffrey Shallit · 23 October 2008
Actually, in the Kolmogorov theory, random sequences are highly likely to have maximum or near-maximum information content. Furthermore, compression experiments with DNA suggest that it is quite difficult to achieve significant compression, suggesting they are close to random and have very high Kolmogorov information content.
eric · 23 October 2008
Venus Mousetrap · 23 October 2008
Your post seems to be lacking a conclusion, but I find this subject interesting, because I recently had a look at a presentation of a supposedly new ID theory, in which the supposed scientist presenting it believes that 'functional information' is an indicator of intelligence.
Problem is, he defines it as 'the negative log to base 2 of, the number of ways to perform a function acceptably well divided by the total number of ways it can be performed', or I = -log2[M/N], which is a kind of equation you'll be familiar with, as it's basically the same as Dembski's.
Problem is, this doesn't even try to give an estimate of information content - rather, it is saying 'Given a list of all the ways to do something, this will give you the minimum information required to pick one from that list'.
Post is here. I'm glad to see, at least, that scientists have been doing real science on this matter long before the ID people.
iml8 · 23 October 2008
PvM · 23 October 2008
PvM · 23 October 2008
TomS · 23 October 2008
Venus Mousetrap · 23 October 2008
Venus Mousetrap · 23 October 2008
I also apparently can't tell the difference between an r and an n, so I've misspelt his name several times. Silly me.
iml8 · 23 October 2008
PvM · 23 October 2008
Venus Mousetrap · 23 October 2008
Ugh. I've just read that paper of Kirk Durston's, above, and I can't believe they're still trying the tornado-in-a-junkyard ploy (or as Kirk says, he 'assumes that evolution is a random walk'). Still, while they're not coming up with new stuff, it's easier to debunk I guess.
PvM · 23 October 2008
PvM · 23 October 2008
Venus Mousetrap · 23 October 2008
And then, like the professionals they are, they go straight to claiming that there must be a conspiracy against them which is preventing their work being accepted.
Even without the alarmingly large amount of evidence that there IS something incredibly fishy behind the ID scenes (Wedge Document, presentations to Christian groups, association with creationists, creationist arguments, creationist websites, etc)... even without all that, they won't accept that their failings are entirely their own.
Daniel Gaston · 23 October 2008
Great Discussion so far, and great comments from PvM and Joe.
I've always found this topic pretty interesting, mostly because of the absolutely horrendous abuses of math in general, but probability and information theory in particular, that guys like Dembski are doing. Does any of the more knowledgeable folks here, like Joe, have an opinion of this "Functional Information" as used by Szostack (and I am presuming misused by Durston)? Is it useful at all?
I think PvM is quite right in the difficulty of properly probabilsitically modeling the growth of information content of a genome in that it isn't really random walk, although one supposes that it does have a random walk-like element to it. Some sort of bounded random walk/Markov Process would better emulate an evolutionary search pattern of that type in my opinion.
I would also suggest that as well as population level information measures of a given gene that measures across evolutionary diversity are also useful, and we use that sort of Entropy score already in a Shannon Information sense when looking at aligned homologs from diverse taxa.
Stanton · 23 October 2008
eric · 23 October 2008
Henry J · 23 October 2008
Joe Felsenstein · 23 October 2008
Stephen · 23 October 2008
Daniel Gaston · 23 October 2008
L Zoel · 23 October 2008
This stuff really becomes interesting when you realize that DARPA is interested in it as well:
"Mathematical Challenge Fourteen: An Information Theory for Virus Evolution
Can Shannon’s theory shed light on this fundamental area of biology?"
( https://www.fbo.gov/index?s=opportunity&mode=form&id=c120bc7171c203aa5f4b3903aa08e558&tab=core&_cview=0&cck=1&au=&ck= )
Now we just have to wait and see if it's an ID advocate, a biologist or a mathematician who finally cracks this problem.
As a math major, I can't help but lean towards the latter.
iml8 · 23 October 2008
Daniel Gaston · 23 October 2008
Glen Davidson · 23 October 2008
http://tinyurl.com/2kxyc7
Henry J · 23 October 2008
Even if the mind didn't evolve, the mechanisms that mind uses to implement the designs would have to come from somewhere. Or in other words, even with a design, some method of engineering is necessary as well.
Henry
eric · 23 October 2008
Henry J · 23 October 2008
iml8 · 23 October 2008
Maxwell's Daemon · 23 October 2008
It seems to me that the crux of the "not accessible to Darwinian evolution" argument hinges on the terrain map of codon/protein space. This is the fundamental argument against DE given by Dembski, Durston , Behe, ,Axe and others. Namely that functional proteins exist as isolated islands in the codon sequence to protein mapping, as in the quote from Durston above:
"Most combinations of amino acids will not produce a stable, three dimensional folded protein structure. Furthermore, the sequence space that encodes a stable folding protein tends to be surrounded by non-folding sequence space. Thus, to generate a novel protein with a stable fold, an evolutionary pathway must cross non-folding sequence space via a random walk, where natural selection will be inoperative. Thus, it requires functional information to properly specify a biological protein with a stable secondary structure."
Seems to me that for this to be true, this implies that nearly all single nucleotide-amino acid substitutions would render a functional, folded protein into a non-folded functionless protein.
This is obviously absurd. A single amino acid substitution is most UNlikely to to significantly affect protein folding, which depends mainly on long-sequence behavior of the amino acid chain, and is relatively insensitive to short-sequence changes.
Single substitutions in the core region of the protein, on the other hand, while not affecting the overall shape of the protein, could have a significant impact on the enzymatic function of said protein.
In other words, every functional protein, instead of being an "island" surrounded by a non-functional "sea" is in fact connected through a large number of functional links, mostly neutral to selection, to other functional nodes with different, selectable functionality.
What naively appears to be a serious impediment to Darwinian evolution, is in fact, a feature capable of being exploited by the Darwinian process to generate new functionality using very little in the way of probability resources, and offers a clear alternative to the "tornado in a junkyard" scenario so favored by the anti-evolution argument.
SMgr · 23 October 2008
...Kimura calculated that the amount of information added per generation was around 0.29 bits or since the Cambrian explosion some 500 million years ago..
One thing I'm wondering: I've heard that our genome has doubled in size several times in our history, so a n average value per year may not reflect the actual way information is accumulated. A doubling of the genome, by itself, would not be much of an increase in information. However, once there are duplicates of each gene, this would allow far more neutral selection (e.g. noise) to accumulate in the genome just after the doubling occurs since changes to the copies of these genes would be less likely to have delterious results. It would seem then that the rate of information accumulated would tend to spike after a genome doubling event. There would also be more "surface area" to accumulate random changes in non-coding regions.
Do I have that right?
Henry J · 23 October 2008
Dale Husband · 23 October 2008
I'd like to offer these videos for discussion:
How Evolution Causes an Increase in Information, Part I
http://www.youtube.com/watch?v=I14KTshLUkg
How Evolution Causes an Increase in Information, Part II
http://www.youtube.com/watch?v=i9u50wKDb_4
Enjoy!
mr silly · 24 October 2008
random has the lowest information content? bull crap
Daniel Gaston · 24 October 2008
PvM · 24 October 2008
PvM · 24 October 2008
Henry J · 24 October 2008
But it's still just a worm!11!!!one!!
Do they always have that same exact number of cells?
tresmal · 24 October 2008
IIRC yes they do always have the same number of cells. Same number of nerve cells, muscle cells etc.. It would be hard to come up with a better model for studying development. IMO C elegans isn't as well known as it should be.
PvM · 24 October 2008
PvM · 24 October 2008
Russ · 24 October 2008
As much as I love treating DNA as a string I think that it is difficult to actually rationalize this when we consider the enzymatic active of RNA. For example take Trypanosoma brucei whose Mini-chromosomes interact with DNA as it is being transcribed allowing for an extreme variability that is slightly stochastic. I don't think a Shannon's can be applied just to the sequence without applying it to all of the probable messages that it can express as well (taking into account probabilistic models on how often that sequence is created).
PvM · 24 October 2008
Henry J · 24 October 2008
Where Nematoda aka roundworms are on the tree of life.
Henry J · 24 October 2008
Where Nematoda aka roundworms are on the tree of life: http://tolweb.org/Nematoda/2472
DS · 24 October 2008
PvM wrote (or quoted):
"The principal cell interactions that coordinate vulval development in C. elegans involve only these three cells and an organizing cell of the gonadal primordium, i.e. four cells in total. The development of the vulva in many other nematodes also involves a small number of homologous cells. Yet despite the homology of the vulva and the cells involved among nematode species, a large number of changes have been noted in the signaling that occurs between these cells to regulate development of the adult structure."
(Begin sarcasm ...
Sure, but unless you can explain in one short sentence, without using any big sciency words, exactly how the exact shape of the vulva is determined then "Darwinism" is completely wrong - for some unknown reason that I don't have to explain. I ain't gonna read no dang papers neither and you can't make me. Sides, everybody knows that there just ain't enough information in that there itty bitty genome to make the whole critter. Must be magic. Sure ain't no computer type program.
... end sarcasm).
Seems that some developmental systems are indeed well understood at the molecular level. Who would have thunk it?
Henry J · 24 October 2008
PvM · 24 October 2008
PvM · 24 October 2008
The nematode model above is interesting as it involves approximately 400,000 interactions between 16,000 genes. So let's estimate the number of bits needed here
log(2) (16,000) = 14 bits or 28 bits to describe the interaction from one gene with another. 28*4*105 or about 11 Mbits. Assume that the linkage is represented by 32 bits and we have 3*108 bits to describe the genetic network. Now these are just estimates of the amount of information needed to represent the phenotypic expression.
There may be a more compact representation of the model but this seems a rough estimate of the amount of information.
Daniel Gaston · 25 October 2008
PvM:
As a PhD student in molecular evolution who has been following the issue for a few years it definitly wouldn't shock me, because ID has contributed absolutely nothing non-trivial to science. :) Especially to Evolutionary Biology.
stevaroni · 25 October 2008
Stanton · 25 October 2008
Pimp Van Pickle · 25 October 2008
PvM · 25 October 2008
Pimp Van Pickle · 25 October 2008
Do the math. Consider a coin:
H(y)=SUM(1,n,p(yi),log2(p(yi))
This graph shows that as the coin becomes fair, the information conveyed in each *random* flip is maximized.
I believe the equation was developed by Shannon, but that is a minor point.
PvM · 25 October 2008
Pimp Van Pickle · 26 October 2008
Pierce, John R. (1961) An Introduction to Information Theory Dover Publications, New York, NY Shannon, Claude E. (1948) A Mathematical Theory of Communication The Bell System Technical Journal
PvM · 26 October 2008
PvM · 26 October 2008
PvM · 26 October 2008
Or
Jan T. Kim, Thomas Martinetz and Daniel Polani Bioinformatic principles underlying the information content of transcription factor binding sites J. theor. Biol. (2003) 220, 529–544
Pimp Van Pickle · 27 October 2008
- They are completely random. (Random source)
- They have been corrupted by noise. (Random interferance)
- They represent optimally (efficiently) encoded messages.
- They have been encrypted.
- The compression algorithm cannot recognize the pattern in the sequences, which can be remedied with a better and perhaps novel compression algorithm.
- Any number of combinations of the above
- ...other possibilities probably exist
Likewise, if most DNA sequences do compress well, it by itself could indicate:- It has low information content (the genetic language is redundant).
- It has error correction and detection built into the code.
- It has been encrypted. (Code is not intended to be broken)
- It has been obfuscated
- ...other possibilities probably exist
Strictly speaking, compressibility or entropy calculations cannot tell you how much information (in the English language sense of the word) actually exists in the DNA sequence, or how much of it is "random". However, compressibility is one tool in providing a possible upper bound on the information content of genetic sequences, assuming the genetic alphabet can legitimately be thought of as an alphabet, and the sequences can be legitimately thought of as messages. P.S. Schneider's R is a suspect application of information rate of a binary symmetric memoryless channel (BSC). But I am not sure how that bolsters your incorrect argument that entropy and information are inversely related. Noise reduces channel capacity. So what?PvM · 27 October 2008
PvM · 27 October 2008
Wesley R. Elsberry · 27 October 2008
Dave Lovell · 27 October 2008
Pimp Van Pickle · 27 October 2008
eric · 27 October 2008
I draw a larger point from this discussion about Shannon vs. Kolmogorov (vs. other definitions of information), entropy vs. information. Which is that the ID claim that 'evolution cannot produce new information' is at best vague and ill-defined, because there are multiple, legitimate, yet different ways to define 'information' yet IDers do not state which definition they are using.
But 'its vague' is the kindest thing you could say. 'Nonsensical' is a more apt description of the ID argument since multiple conflicting definitions mean there will be mutations that increase information under one definition but decrease information under another.
As if choosing between conflicting definitions weren't enough of a problem, IDers also have to contend with the follow-on issue that scientists may use whatever definition is best for solving the specific (scientific) problem at hand, without philosophically committing to any one definition as an ultimate or objective truth. I'm arguing that our mathematical descriptions of the concept 'information' are powerful tools in the toolbox, but they remain tools, not paradigms or deeply-held premises. So the bedrock assumption needed for the ID argument to even make sense - that there is only one, objective definition of information by which DNA content should be measured - is rendered false, making the entire ID claim philosophically meaningless.
PvM · 27 October 2008
PvM · 27 October 2008
A final reference other than Schneider, Adami
Chen et al, Divergence and Shannon Information in Genomes, Physical Review Letters, 94, 178103, 2005
They show that there are two perspectives to information, one is the fidelity of the transmission, the other one is the information in the received message itself. Pointing out that information increases with a decrease in uncertainty (entropy) they define Shannon Information as the difference between before and after. The uncertainty before is, lacking any further data, a uniform distribution (leading to max uncertainty / entropy) followed by a reduction in uncertainty after the message is received.
So in case of my coin example, the toss of 10 coins, with all head leads to a before estimate of uncertainty of 10 bits and an after the coins are tossed, an uncertainty of zero bits, where the reduction of 10 bits is the Shannon information.
Hope this clarifies two different ways of looking at information.
Shepherd Moon · 6 November 2008
Most of this discussion is over my head when it comes to the details of calculating information content or the relevance of information content to creationist arguments.
I will say that the argument does come up on the creationist forums in which I participate. The main debates along these lines have included the following. I will sum up the position of the creationist in question. This is not to imply that I've won the debates - I'm sure someone here could do better against these arguments.
1. Differences in chimpanzee and human DNA
My opponent presents a two-pronged case, both prongs of which may in fact contradict each other.
First, he argues that the evidence for 98% similarity between chimpanzee and human DNA is not convincing. He presents some calculations that the similarity is closer to 88% or 90%.
Second, he presents calculations for the amount of information (2 bits per base) in human and chimpanzee DNA. Then he say that the difference - however many K or MB - does not explain how humans can display so much more complexity if the information difference between the species is relatively small. He can't see how all of this complexity it squeezed into 1 or 2 MB.
My counterarguments were:
A. 88% or 90% similarity is still very high. That suggests either (1) humans are less complex than he thinks or (2) chimpanzees are more complex than he thinks.
B. DNA may encode shorthand instructions or some other way of yielding complex behavior without having to encode all the resulting behavior in the DNA. For example, War and Peace can be written by a human without having to have the entire information content of War and Peace in his DNA, even though we still do need to explain how DNA can result in the creation of a complex object such as War and Peace.
C. If my opponent is arguing that the chimanzee-human similarity is greater, then that greater difference provides more available space to store the complexity he is claiming won't fit. So can my opponent quantify how much information content humans have *in total* so that we can see the difference with chimpanzees more clearly and decide whether the actual information content differences really are too big.
2. Whether information has any "weight" or other tangible property.
I did not weigh in on this debate but simply noted it. The argument was basically this:
a. Take a box of fine sand.
b. Weigh the box and record the result.
c. Write a message in the box of sand with a stick.
d. Weigh the box and record the result.
e. Shake the box to erase the message.
f. Weigh the box and record the result.
If the box weighs the same before and after the message has been erased, then how can one say that information has any material existence? The responses by the other creationists all affirmed that the fact that no weight difference is detected is proof that information is not material or tangible. Thus support is perceived for the existence of the supernatural.
My counterargument (to myself, because again, I did not reply to that thread) is that it takes energy to write the message in the sand and energy to erase it. So if one had a scale sensitive enough to measure such minutes changes in energy, one would probably detect differences in the results, if not in weight then perhaps in temperature.
I would also ask the author to give an example of information that is not encoded in matter or energy. For even if the purported source of information is supernatural or the information is not measurable by conventional material tools, in order for us to perceive the information it has to be presented to our senses, and our senses function with measurable data as input.
3. Whether mutations add any information to DNA
The argument in this case is based almost completely on Lee M. Spetner's book Not by Chance!. Basically, the author and my opponent take the view that the probability of a mutation that adds information is so ridiculously low (something like 2.7×10^-2,739) that Darwinian evolution has been refuted on statistical grounds.
My counterargument, such as it was, was that I suspect a probability trick or mistake whereby the author is multipying probabilities too much - something like the case I heard about where a lawyer used 10 or 11 properties described by witness to claim that the odds against a defendant were 10^11 to 1 against, when in fact the probabilities were not really independent.
I would be eager to learn if there are rebuttals to the arguments above that I could make note of and use in the future. But I wanted to post them here because I enjoyed your article and can tell you from experience that creationsts are relying heavily on arguments that involve the information content of DNA. And given that the subject is so complicated, I think it would be beneficial to come up with well-illustrated and easily digestible refutations, where they exist, of the creationist arguments.
Cheers,
Shepherd Moon
Henry J · 6 November 2008
Daniel Pope · 7 November 2008
That last citation is completely right but I don't think you've understood it fully, PvM. You've stated all the definitions right but applied them wrongly.
Let's go back the coin example, which you misstated slightly. A fair coin has a maximum entropy of 1 bit. If you flip the coin, it's randomised such that its entropy is still 1 bit. I = H_max - H = 1 - 1 = 0. So the information is 0 bits. So far so good.
But suppose I deliberately turn the coin to heads. The probability that it reads heads is 1. So its entropy is 0. So I = H_max - H = 1 - 0 = 1 bit.
What I think you've misunderstood is that you're not conveying information by just flipping coins. You convey information by deliberately trying to set coins. If I've got a line of 1000 coins, I can set them to heads or tails and convey up to 1000 bits of information. If I have a small probability of making a mistake, I have a higher received entropy. But I can still convey information, just less efficiently. If there's a 1% chance we miscommunicate (I set the coins wrong or you read the coins wrong), the receiving entropy is 0.08 bits.
Incidentally, it's not entropy "before" and "after" (something which has also been said in this thread). There's only one observation of the information - the "after". I can still convey the same information if the 1000 coins were all set to heads beforehand. The H_max is the entropy you know the message to have. If you know in advance that I'm going to be setting 90% of the coins to heads, you receive less information per coin - only 0.47 bits per coin. If we also make 1% mistakes, we can only communicate 0.38 bits per message.
To extend this to DNA is fairly easy. You state that you know that the entropy per coding triplet is 5.6 bits, but knowing with exact certainty the value of the triplet means the received entropy is 0. So the information we can read is 5.6 bits per triplet.
What we've just discussed is the information we gain by recording a natural (coding) DNA strand exactly, base-for-base. I'm not a biologist (I'm a computer scientist) so here's where I'm on shakier ground. I believe a DNA triplet has 64 possible states but those only code for 20 amino acids, and a "stop" code, right? RNA polymerase can only read those 21 symbols. Also it has no statistical information about the distribution of codons, but the coding does amount to statistical information about amino acids. Looking at the number of codons for each symbol (http://en.wikipedia.org/wiki/Codon), the entropy of DNA is computed as follows (in Python):
>>> cf
{'Cys': 2, 'Asp': 2, 'Ser': 6, 'Gln': 2, 'Lys': 2, 'Trp': 1, 'Pro': 4, 'STOP': 3, 'Thr': 4, 'Ile': 3, 'Ala': 4, 'Phe': 2, 'Gly': 4, 'His': 2, 'Leu': 6, 'Arg': 6, 'Met': 1, 'Glu': 2, 'Asn': 2, 'Tyr': 2, 'Val': 4}
>>> sum(cf.values())
64
>>> [(v/64.0)*log(v/64.0)/log(2) for v in cf.values()]
[-0.15625, -0.15625, -0.32015976555739162, -0.15625, -0.15625, -0.09375, -0.25, -0.20695488277869584, -0.25, -0.20695488277869584, -0.25, -0.15625, -0.25, -0.15625, -0.32015976555739162, -0.32015976555739162, -0.09375, -0.15625, -0.15625, -0.15625, -0.25]
>>> -sum(_)
4.2181390622295662
So the information extracted from DNA by RNA polymerase is 4.218 bits per codon.
PvM · 7 November 2008
Daniel Pope · 8 November 2008
PvM · 8 November 2008
Daniel Pope · 9 November 2008
PvM · 9 November 2008
PvM · 9 November 2008
Perhaps the following references can help you understand my viewpoint
Adami Evolution of Complexity
Schneider Evolution of Biological Information
Daniel Pope · 10 November 2008
PvM · 10 November 2008
PvM · 10 November 2008
PvM · 10 November 2008
Abdul Sattar Real · 12 December 2009
Excellent. Thank you very much
Simon B · 17 December 2009
Given the way you have addressed the information content of the human genome in this article, could one address the information content of the 32 volume 2010 edition of the Encyclopedia in the same way?
And what sort of value for storage of the EB's information content would be arrived at?