Icons of ID: Probability as information

Posted 6 July 2004 by PvM

↗ The current version of this post is on the live site: https://pandasthumb.org/archives/2004/07/icons-of-id-pro.html

In this episode of Icons of ID I will take a quick look at how the definition of information used by ID proponents is nothing more than an argument from probability. In fact when ID proponents claim that chance and regularity cannot create complex specified information (CSI) all they are saying is that such pathways, as far as we know, are improbable. If a pathway is found that is probable, the measure of information, which is confusingly linked to probability decreases.
In fact, I argue, that intelligent designers similarly cannot generate specified complex information since the probability of intelligent designers designing is close to 1.

Information and probability

Elsberry and Willkins on CSI

Then again, the choice of the term “complex specified information” is itself extremely problematic, since for Dembski “complex” means neither \complicated” as in ordinary speech, nor “high Kolmogorov complexity” as understood by algorithmic information theorists. Instead, Dembski uses \complex” as a synonym for “improbable”.

So how does Dembski define information?

I(X)= - \log_2 P(X)

So in other words, information is the log of the probability. But what probability is this? Others have shown how Dembski is unclear on this issue and often moves between uniform probabilities or actual probabilities, whenever it seems better to do so.
Dembski mentions in NFL that this measure of information is similar to Shannon information. In fact Shannon’s entropy is the average of Dembski’s information measure. This confusion about information and entropy is not limited to Dembski’s writings however so let’s look at Shannon entropy and information in more detail.

Claude: Shannon: A mathematical theory of communication

In 1948 Shannon published one of his seminal papers on A mathematical theory of communication.

Shannon shows that the logarithm is the natural choice for expressing the concept of information. Entropy, a weighted measure of information is basically the expected value of information present. In other words

If there are n messages X= {X_1, …, X_n} with probabilities p(X_1) … p(X_n) then the Shannon entropy of this set is defined as:

H(X)= -p(X_1) \log_2 p(X_1) + … + p(X_n) \log_2 p(X_n)

or in other words

H(X)= E ( I(X)

Entropy is maximum when all values are equiprobable.

Information is defined as

I_S(X)=H_{max} - H(X)

Information in the Shannon sense is defined as the change in entropy before and after a particular event has taken place. Shannon information, also known as surprise, is any form of data which is not already known. In fact, when rare events occur, they generate a lot of information.

Tom Schneider has some good resources:

So what we have learned so far is that Dembski’s information measure is nothing more than a probability measure similar to Shannon’s entropy measure not Shannon’s information measure.

But the choice of the term information is quite unfortunate since it has more similarity to entropy than to Shannon information.

So let’s try to understand why Dembski argues that regularity and chance cannot create CSI. The answer is simple: If such processes have a high probability of being successful, their Dembski information measure will be low.

But the same problem applies to Intelligent designers. Given a particular ‘intelligently designed’ event, its probability is high and thus its information is low. In other words, according to Dembski’s own measure, nothing can create CSI other than pure chance.

Not much of a useful tool but the poor choice of the information measure has caused much unnecessary confusion. When in fact all Dembski was doing is repeating the age-old creationist argument that evolution or abiogenesis is improbable.

Talkorigins has some good FAQ’s on what’s wrong with these arguments.

It seems that ID is not only fundamentally flawed due to a theoretical failure of its claims but also empirically flawed in that ID has failed to be scientifically relevant. But in addition to these flaws, we also recognize the flawed arguments of Dembski based on probability. All because of the confusing usage of terms like information rather than entropy.
Seems that the intelligent designer is as powerless in creating CSI. Or alternatively an intelligent designer is as capable of creating CSI as regular processes.

Tom Schneider attracted Dembski’s ire for showing how the simple processes of variation and selection can actually increase the information in a genome.

Dembski’s complexity measures have many problems.

Surprisingly various ID proponents such as for instance Fred Heeren seem to have taken Dembski’s claims too seriously.

Heeren quotes another unsupported and in fact falsified claim by Dembski

William Dembski puts it this way: “Specified complexity powerfully extends the usual mathematical theory of information, known as Shannon information. Shannon’s theory dealt only with complexity, which can be due to random processes as well as to intelligent design. The addition of specification to complexity, however, is like a vise that grabs only things due to intelligence. Indeed, all the empirical evidence confirms that the only known cause of specified complexity is intelligence.”

Careless usage of terminology, contradictory statements and examples, confusing usage of terms and inflated claims all seem to have made the design inference ‘quite problematic’.

Branden Fitelson

Understanding what “regularity,” “chance,” and “design” mean in Dembski’s framework is made more difficult by some of his examples. Dembski discusses a teacher who finds that the essays submitted by two students are nearly identical (46). One hypothesis is that the students produced their work independently; a second hypothesis asserts that there was plagiarism. Dembski treats the hypothesis of independent origination as a Chance hypothesis and the plagiarism hypothesis as an instance of Design. Yet, both describe the matching papers as issuing from intelligent agency, as Dembski points out (47). Dembski says that context influences how a hypothesis gets classified (46). How context induces the classification that Dembski suggests remains a mystery.

Elsberry and Shallit have written an excellent paper “Information Theory, Evolutionary Computation, and Dembski’s “Complex Specified Information”. They address Dembski’s fallacious reliability claims, present the differences between rarefied design and ordinary design, and the problems with apparant and actual complex specified information (CSI).

Intelligent design advocate William Dembski has introduced a measure of information called “complex specified information”, or CSI. He claims that CSI is a reliable marker of design by intelligent agents. He puts forth a “Law of Conservation of Information” which states that chance and natural laws are incapable of generating CSI.
In particular, CSI cannot be generated by evolutionary computation. Dembski asserts that CSI is present in intelligent causes and in the flagellum of Escherichia coli, and concludes that neither have natural explanations. In this paper we examine Dembski’s claims, point out significant errors in his reasoning, and conclude that there is no reason to accept his assertions.

31 Comments

Les Lane · 7 July 2004

Dembski has backed off his "Law of Conservation of Information". Immunoglobulin genes (for example) are information creating machines. Dembski recognizes this and now claims that natural systems can't create "complex information". The boundary between "simple information" and "complex information" is vague. The phrase "complex specified information" returns zero (of 13 million) articles in Science Citation Index.

rick pietz · 7 July 2004

I'm really torn by your post. On the one hand, I think the very idea that anyone spends time refuting people like Dembski is a none productive expenditure of intellectual capital. On the ohterhand, if crap like this isn't refuted, it grows a life of it's own. On the third hand, the people who buy into this crap in the first place, aren't going to ever read or hear the argument against, and if they do, they'll accept Dembski's explanations, and it still takes on a life of its own.

Pre-'urban legends' die even harder than the new ones.

Pim van Meurs · 7 July 2004

I agree, and I am struggling with these issues as well. Since I have found from past experiences that although having correct data available may not convince the committed creationists, it may affect some to investigate further. As such I believe that presenting the arguments of why the ID approach(es) do not work in an accessible manner is important.

T. Russ · 7 July 2004

Dembski's put up another essay on his site. Just giving you guys the heads up. Enjoy!

Information as a Measure of Variation By William Dembski

http://www.designinference.com/documents/2004.07.Variational_Information.pdf

steve · 7 July 2004

Comment #4663 Posted by rick pietz on July 7, 2004 05:35 PM I'm really torn by your post. On the one hand, I think the very idea that anyone spends time refuting people like Dembski is a none productive expenditure of intellectual capital. On the ohterhand, if crap like this isn't refuted, it grows a life of it's own. On the third hand, the people who buy into this crap in the first place, aren't going to ever read or hear the argument against, and if they do, they'll accept Dembski's explanations, and it still takes on a life of its own.

I think that's an important debate we need to have more of. I don't think it's clear that smart, educated people arguing with willfully-ignorant creationists is a productive use of time and energy. I think the basic reasons to argue are 1) it might do some good to the creationist 2) bystanders can be informed of how stupid the antievolution arguments are. In my opinion, 1 is not an efficient use of energy because creationists can argue the sky is plaid for decades. You can't argue with blind faith. 2 might have some merit, but any reasonable person can already see that only a tiny fringe of scientists pretend it could be wrong. Further, I'm partial to the Gould (or is it Dawkins?) argument that arguing with them gives people the misleading impression that it's an open scientific question. I think there's an outside chance that ridiculing people who make insulting, ridiculous arguments ("all scientists are lying, they really believe in creationism") might at least embarrass them into shutting up.

Steve · 7 July 2004

It's endlessly funny to me that lots of Cold Fusion papers were worth publishing, and no ID ones are. The IDiots can't even meet the bar of basic competence Cold Fusion research met.

Les Lane · 8 July 2004

Steve-

Thanks for the tip. "Cold fusion" returns 661 references on Science Citation Index. It's quantitatively roughly 20 times "more productive" than ID.

David Wilson · 12 July 2004

So how does Dembski define information? I(X) = -log₂P(X)
— PvM

This definition is not peculiar to Dembski. I have seen several texts on information theory which refer to the quantity -log₂(X) as the amount information one obtains when one learns that the event X has occurred. It is sometimes called the self-information of X. It is true, as Dembski notes in his latest opus, that the mathematical development of information theory makes very little direct use of this quantity. In textbooks its role seems to be confined to motivating the definition of entropy, and many (perhaps most) don't even mention it. Some writers on information theory (Tom Schneider, for one) object to this terminology and prefer to call the quantity -log₂(X) the surprisal of X instead. However, such writers have formed a very small minority of the authors of the literature on information theory that I have read. I consider the objections to the usage favoured by Dembski to be unreasonable. In fact, it's a usage I prefer myself.

Information is defined as I_S(X) = H_max - H(X) Information in the Shannon sense is defined as the change in entropy before and after a particular event has taken place.

Defined by whom? In the first place, if you want to follow Schneider's terminology, this should be H_before - H_after, where H_after is the conditional entropy of the quantities still unknown after the event has occurred, given the quantities whose values became known during the event. The first term of this expression will only be equal to H_max in the special case when the distribution of the quantities unknown before the event has taken place is uniform. But more to the point is the fact that this definition carries no more weight of authority behind it than the one which Dembski uses. So criticising him for using one and not the other seems to me to be completely unwarranted.

Pim van Meurs · 12 July 2004

By conflating information with probability Dembski has introduced quite a difficulty namely that information is not what we commonly consider it to mean. Rather than information being a measure of 'surprise' information becomes very similar to the concept of entropy. Because of his usage of probability as an information measure, he is faced with the problem that neither regularity/chance nor intelligent designers can create complex specified information.
The definition for information I chose is indeed for a uniform distribution which is not a bad assumption for initially random distributions and matches Shannon's usage of these concepts.

See for instance Randomness, Order and Replication

But you are right, the definition can easily be generalized further. In Dembski's latest opus he is somewhat more careful in his definitions but his usage of 'self information' or probability for information generates a lot of confusion and seems self-contradictory.

David Wilson · 16 July 2004

By conflating information with probability Dembski has introduced quite a difficulty namely that information is not what we commonly consider it to mean.
— Pim van Meurs

Dembski does not conflate information with probability (not, at least, in any of his writings that I have seen). A logarithm of a probability, which, as you have noted, is what Dembski uses as his definition of information, is not the same thing as a probability. As far as I am aware, Dembski has nowhere conflated these two things. As I also pointed out in my previous post, the definition is one that is commonly used in textbooks on information theory, so the way Dembski uses it seems perfectly reasonable to me. The one objection I would have to his introduction of the term into The Design Inference, say, is that it is completely gratuitous. The technical arguments in The Design Inference are all about statistical inference and have very little to do with information theory. The one concept from information theory that Dembski introduces---namely, the definition of information---is completely irrelevant to his arguments. As to the matter of the definition's not corresponding to "what we commonly consider" "information" to mean, that happens to be a feature of every definition of information used in information theory. The definitions are chosen so that the quantities used as measures of the "amounts of information" produced by message sources will have some of the more obvious properties we intuitively expect them to. As a result, however, we find that they must also have other properties which some people find quite counterintuitive. If you have any objections on this score, you need to take them up with the founders of information theory. Dembski had nothing whatever to do with it.

Rather than information being a measure of 'surprise' information becomes very similar to the concept of entropy.

I'm afraid I don't understand what your objection is here. All writers on information theory whom I have ever seen use the term "surprise" or "surprisal" in a technical sense have always used it to refer to the log of probability. So if you think information should be a "measure of 'surprise'" (as you appear to do here), then Dembski's choice of the log of probability is precisely what you want.

The definition for information I chose is indeed for a uniform distribution which is not a bad assumption for initially random distributions and matches Shannon's usage of these concepts. See for instance Randomness, Order and Replication

But the quantity R = logτ - H which Prof Lee calls "Shannon information" in this presentation has nothing whatever to do with "change in entropy before and after a particular event has taken place". In fact, neither does it correspond to anything which Shannon ever gave the name "information" to, and Lee should be shot for calling it "Shannon information". The quantity logτ appearing in the expression is simply the maximum average rate per symbol at which a source with an alphabet of τ sybols can produce information. The quantity H is the entropy rate per symbol of the source under consideration, or, in other words, the actual average rate per symbol at which it produces information. As Lee notes in a little inset box on slide 18, Shannon called the quantity R/logτ, the redundancy of the source. He had very good reasons for choosing this terminology, and it's a much better choice than what Lee has decided to replaced it with.

Erik 12345 · 16 July 2004

Pim van Meurs wrote in the blog entry above:
[qb]Dembski mentions in NFL that this measure of information is similar to Shannon information. In fact Shannon's entropy is the average of Dembski's information measure.[/qb]
About Dembski's definition, David Wilson commented:
[qb]This definition is not peculiar to Dembski. I have seen several texts on information theory which refer to the quantity -log2(X) as the amount information one obtains when one learns that the event X has occurred. It is sometimes called the self-information of X. It is true, as Dembski notes in his latest opus, that the mathematical development of information theory makes very little direct use of this quantity. In textbooks its role seems to be confined to motivating the definition of entropy, and many (perhaps most) don't even mention it.
The first statement is, for reasons to be explained below, mathematically ill-defined and the second is partially, but not completely, wrong.

Let us recall that an event is a set of outcomes. For example, when we consider tossed dice, we often choose the outcomes to be the face of the dice that lands upwards, i.e. the outcomes are 1, 2, 3, 4, 5, and 6. (One could make finer distinctions and consider also the orientation of the dice and even its position on the table, but it is not common. An outcome is a result of the experiment that is fully resolved to whatever precision we have chosen to consider.) An example of an event is the set of all odd outcomes, i.e. {1,3,5}. Another example of an event is the set of all outcomes smaller than 5, i.e. {1,2,3,4}.

Now, it is true that both Dembski and some information theory textbooks define "information" and "self-information"/"surprisal", respectively, in an event A as

-log(Pr(A))

The difference between the definitions is in the restrictions on the A's we are allowed to plug into the formula. This is a difference that most people probably don't think too much about, but it is a very important difference nonetheless. Dembski allows us to plug in any event A. Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, ..., An. The events we are allowed to plug into the formula for "self-information"/"surprisal" are then restricted to one of these partitions. This restriction is crucial, because without it is meaningless to speak about average "self-information"/"surprisal" (note that Dembski makes no such restriction and it is consequently meaningless to speak about the average of his information). We can meaningfully average over all outcomes or over all partitions, but not over all events.

Shannon entropy is therefore NOT the average of Dembski's information measure and Dembski's definition is NOT the same as the "self-information"/"surprisal" introduced in some presentations of information theory.

David Wilson · 17 July 2004

Pim van Meurs wrote in the blog entry above: " Dembski mentions in NFL that this measure of information is similar to Shannon information. In fact Shannon's entropy is the average of Dembski's information measure." About Dembski's definition, David Wilson commented: "This definition is not peculiar to Dembski. I have seen several texts on information theory which refer to the quantity -log2(X) as the amount information one obtains when one learns that the event X has occurred. It is sometimes called the self-information of X. It is true, as Dembski notes in his latest opus, that the mathematical development of information theory makes very little direct use of this quantity. In textbooks its role seems to be confined to motivating the definition of entropy, and many (perhaps most) don't even mention it." The first statement is, for reasons to be explained below, mathematically ill-defined and the second is partially, but not completely, wrong.
— Erik 12345

Since you quote two statements by Pim van Meurs, and several by me, it's not at all clear to me which two of them you are referring to. I assume that "The first" refers to the following statement of Mr van Meurs's:

In fact Shannon's entropy is the average of Dembski's information measure.

It is true that, when taken in isolation, this sentence is ambiguous. However, the fault here is not Mr van Meurs's, but mine. In extracting the quotation from his original post, I lifted it out of a context where it was unambiguous and accurate. My apologies for the confusion.

Now, it is true that both Dembski and some information theory textbooks define "information" and "self-information"/"surprisal", respectively, in an event A as -log(Pr(A)) The difference between the definitions is in the restrictions on the A's we are allowed to plug into the formula. This is a difference that most people probably don't think too much about, but it is a very important difference nonetheless. Dembski allows us to plug in any event A. Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, ..., An. The events we are allowed to plug into the formula for "self-information"/"surprisal" are then restricted to one of these partitions.
— Erik 12345

I categorically dispute these last two statements. For the purposes of defining "information"/"self-information"/"surprisal" or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.

This restriction is crucial, because without it is meaningless to speak about average "self-information"/"surprisal" (note that Dembski makes no such restriction and it is consequently meaningless to speak about the average of his information). We can meaningfully average over all outcomes or over all partitions, but not over all events.
— Erik 12345

I assume that by "over all partitions" here you really meant "over all events in a partition". It makes no more sense to average over all partitions than it does to average over all events. Yes, of course, if you're going to talk about an average of self-information then you will have to specify the set of events over which you're taking the average. But I'm afraid I can see no reason why this should require you to place any restrictions at all on the events for which you are allowed to define self-information. Until the recent appearance of his latest opus Dembski has never, as far as I know, tried to take an average of his "information", so he has never needed to specify a partition over which to take it. In his latest paper, just referred to, where he does take such an average, it is, as far as I can see, perfectly well-defined.

... Dembski's definition is NOT the same as the "self-information"/"surprisal" introduced in some presentations of information theory.
— Erik 12345

I am afraid I am still mystified as to why you think this. Here is Dembski's definition of "measure of information" as given on page 127 of No Free Lunch:

Thus, we define the measure of information in an event of probability p as -log₂ p.

Please tell me how this differs in any essential respect from the definition of self-information as given, for instance, in the entry on Information Theory in the Encyclopedic Dictionary of Mathematics published by MIT press.

David Wilson · 17 July 2004

"In fact Shannon's entropy is the average of Dembski's information measure." It is true that, when taken in isolation, this sentence is ambiguous. However, the fault here is not Mr van Meurs's, but mine. In extracting the quotation from his original post, I lifted it out of a context where it was unambiguous and accurate. My apologies for the confusion.
— I

Oops. On rereading my original post, I find I have not quoted this text at all. I had actually snipped it out of my quotation before posting, so I don't need to assume any blame for whatever problems Erik 12345 has with it. Given that the statement appears a couple of paragraphs before Mr van Meurs gives the equations that make sense of it, I would replace "the average" with "an average". But apart from that, I can't see anything much wrong with it.

Erik 12345 · 17 July 2004

I categorically dispute these last two statements. For the purposes of defining "information"/"self-information"/"surprisal" or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.
— David Wilson

My local library is closed until monday, so I can't cite any textbook right now. What I can do is recall how logarithms of probabilities arise in information theory (by which I mean communication theory): The goal of information theory is take a probabilistic description of data and design a way to encode the data so that it can be reliably and efficiently transmitted, stored and decoded. In the simplest case, we have a partition {A1,A2,...,An} of the possible outcomes and the objective should associate with each partition a codeword. This encoding should satisfy the following demands: (i) Each codeword must be unique. (ii) No codeword can be a prefix of any other. (iii) The average codeword length should be minimal, subject to the constraints (i) & (ii). If we denote the number of binary digits in the codeword assigned to Ak by L(Ak), then the constraints (i) & (ii) turn out to be equivalent to the constraint (*) SUM 2^-L(Ak) = L(A1) Pr(A1) + L(A2) Pr(A2) + ... + L(An) Pr(An). Minimizing the average codeword subject to (*) but ignoring fact that the codeword lengths are integers, one finds that the optimal code should assign codewords with lengths L(Ak) = -log(Pr(Ak)). With this choice, the average codeword length is exactly the Shannon entropy of the partition. Since codeword lengths are necessarily integers, the Shannon entropy is really only a lower limit on the average codeword length and the real integer-valued optimal codeword lengths can be quite different from -log(Pr(Ak)). However, there always exist a nearly optimal code that assigns no more than -log(Pr(Ak))+1 binary digits to Ak. This is how logarithms arise in information theory. In misguided attempts to present the above derivation more pedagogically, some authors have named the quantity -log(Pr(Ak)) "self-information" or "surprisal". And the operational meaning of the Shannon entropy of a partition---the very reason for its introduction in information theory---has been obscured by the omission of the above optimization problem and its solution. Only in the context of the above optimization problem is it legitimate to attach the unit "binary digits" to the logarithms -log(Pr(Ak)), because only in this context do the logarithms represent codeword lengths. It should be clear from all this that "self-information"/"surprisal" must be defined relative to a partition and that it would lose all its meaning if this relativization was dropped. We may infer from the fact that "self-information"/"surprisal" is given the unit "binary digits" or "bits" that it is restricted to a partition. (It may be remarked that Dembski also likes to give his logarithms the unit "bits", but since Dembski clearly does not restrict attention to a partition, we should instead infer that he is unaware of the necessity of such a restriction or that he uses the word "bits" for purely rhetorical purposes.)

I assume that by "over all partitions" here you really meant "over all events in a partition". It makes no more sense to average over all partitions than it does to average over all events.
— David Wilson

Yes. English is not my native language and for some reason I have great difficulty remembering whether "a partition" is a set of mutually exhaustive and disjoint events or an element of this set.

Yes, of course, if you're going to talk about an average of self-information then you will have to specify the set of events over which you're taking the average. But I'm afraid I can see no reason why this should require you to place any restrictions at all on the events for which you are allowed to define self-information.
— David Wilson

Reason #1: The "self-information"/"surprisal" is introduced for the single purpose of being averaged. Reason #2: Only with the restriction to events from a given partition is it legitimate to attach the unit "binary digits" to the "self-information"/"surprisal". In information theory, that is the unit attached to it.

Until the recent appearance of his latest opus Dembski has never, as far as I know, tried to take an average of his "information", so he has never needed to specify a partition over which to take it. In his latest paper, just referred to, where he does take such an average, it is, as far as I can see, perfectly well-defined.
— David Wilson

I agree that Dembski hasn't made that mistake. He doesn't take such an average in his paper on "variational information" --- the average he does take (an average of the squared ratio of two probability distributions/measures) is, as you note, well-defined, though. The unit of "generalized binary digits", on the other hand, is pretty bogus.

Please tell me how this differs in any essential respect from the definition of self-information as given, for instance, in the entry on Information Theory in the Encyclopedic Dictionary of Mathematics published by MIT press.
— David Wilson

Quote the definition and tell me what unit, if any, is attached to the quantity and if codes/codewords are discussed. Then I'll be able to tell you if it is different from Dembski's definition (which is inspired by the philosophical literature, where one is uninterested in good codes and interested in the properties of "information", such as additivity for independent events, and its relation to knowledge).

Richard Wein · 17 July 2004

It seems to me the issue here is not the formula that Dembski uses but the context in which he uses it. When information theorists use the formula I = -log2(P), they do so in a context where it makes some sense to interpret this as "information". As David correctly notes, however, Dembski's introduction of the concept of "information" into what is purely a statistical argument is entirely gratuitous.

The Design Inference tells us to reject hypothesis H when
P(S|H) < alpha
where S is a "specified event" and alpha is a probability bound as proposed by Dembski.

What Dembski then does is apply the transformation I = -log2(P) to this inequality, telling us to reject H when
-log2{P(S|H)} > -log2{alpha}
or
I(S|H) > -log2{alpha}

Clearly, there is no genuine difference between the original inequality and the transformed one. In this context I is not being used as a measure of information in any useful sense. It is merely a rescaled probability.

David also correctly notes that very little use of the quantity I = -log2(P) is made in information theory. Shannon's seminal 1948 paper does not even mention it. I believe this is because it is not in fact a useful measure of information. The useful work in Shannon information theory is performed by ensemble measures such as entropy.

Richard Wein · 17 July 2004

For what it's worth, here's a link to something I wrote on the subject a while ago:

http://www.talkorigins.org/design/faqs/nfl/#shannon

I should add that I don't claim to be any expert on this subject.

Erik 12345 · 18 July 2004

If we denote the number of binary digits in the codeword assigned to Ak by L(Ak), then the constraints (i) & (ii) turn out to be equivalent to the constraint (*) SUM 2^-L(Ak) = L(A1) Pr(A1) + L(A2) Pr(A2) + . . . + L(An) Pr(An).
— I wrote

Looking at this again, I see that I* made the unfortunate mistake of joining two separate equations together here. The constraint (aka "the Kraft inequality") is (*) SUM 2^-L(Ak) <= 1 and the average codeword length is L_average = L(A1) Pr(A1) + L(A2) Pr(A2) + . . . + L(An) Pr(An). -------- * It appears that I can blame the PandasThumb's script, because it removes text written after "<=" in the comments box when you preview; thus, what you post after a preview is not what you originally wrote in the comments box. So if you have a "<=" in your comments, you better follow the sequence preview--browser's back button--post, rather than preview--post.

Erik 12345 · 18 July 2004

This post is just an experiment with the Post-a-Comment script. The first section was previewed and modified by the preview script. The second section was written after previewing and contains exactly the same text as the first section originally contained.

Section 1 (previewed): If we denote the number of binary digits in the codeword assigned to Ak by L(Ak), then the constraints (i) & (ii) turn out to be equivalent to the constraint

(*) SUM 2^-L(Ak) = L(A1) Pr(A1) + L(A2) Pr(A2) + . . . + L(An) Pr(An).

subject to (*) ...

Section 2 (not previewed): If we denote the number of binary digits in the codeword assigned to Ak by L(Ak), then the constraints (i) & (ii) turn out to be equivalent to the constraint

(*) SUM 2^-L(Ak) <= 1.

Minimizing the average codeword length

= L(A1) Pr(A1) + L(A2) Pr(A2) + . . . + L(An) Pr(An).

subject to (*) ...

Section 3: Conclusion There is an evil undocumented feature in the preview script that can completely change the meaning of a comment.

Russell · 18 July 2004

Dr. 12345: There is an evil undocumented feature in the preview script that can completely change the meaning of a comment.

Yes. I've noticed that. I think the evil lurks in the "less than" sign. I think it works if you don't preview, but once you do, it aborts everything that follows

David Wilson · 19 July 2004

Erik 12345 wrote: " ...... Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, . . . , An. The events we are allowed to plug into the formula for "self-information"/"surprisal" are then restricted to one of these partitions." I categorically dispute these last two statements. For the purposes of defining "information"/"self-information"/"surprisal" or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.
— I

My memory played me false here, and I'll have to eat some humble pie. On browsing through a selection of expositions on information theory I find that there are indeed plenty of them which do only define "self-information" for events restricted to lie within particular partitions, including the Encyclopedic Dictionary of Mathematics, which I had firmly believed did not. It would be a slight exaggeration to say that they "imposed a restriction" on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. Nevertheless, it is still a fact that they did not find it necessary to adopt that generalisation themselves. That said, the assertion of Erik's which I have requoted above is also false. I was easily able to find plenty of texts on information theory which did not "require us to first partition the possible outcomes into non-overlapping events A1, A2, . . . , An." These texts did give definitions which allowed any event to be plugged into the formula. I'll give a list of some of these texts later on tonight (Oz time) and quote one of the definitions for Erik's analysis of how it differs (or not) from Dembski's.

Erik 12345 · 19 July 2004

My memory played me false here, and I'll have to eat some humble pie. On browsing through a selection of expositions on information theory I find that there are indeed plenty of them which do only define "self-information" for events restricted to lie within particular partitions, including the Encyclopedic Dictionary of Mathematics, which I had firmly believed did not. It would be a slight exaggeration to say that they "imposed a restriction" on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. Nevertheless, it is still a fact that they did not find it necessary to adopt that generalisation themselves.
— David Wilson

Very well. Did they sermonize against a generalization of the formula -log(Pr(A)) from probabilities of partitions to any positive number (whether probabilities or not)? If not, what do you make of the absence of such sermons?

That said, the assertion of Erik's which I have requoted above is also false. I was easily able to find plenty of texts on information theory which did not "require us to first partition the possible outcomes into non-overlapping events A1, A2, . . . , An." These texts did give definitions which allowed any event to be plugged into the formula. I'll give a list of some of these texts later on tonight (Oz time) and quote one of the definitions for Erik's analysis of how it differs (or not) from Dembski's.
— David Wilson

Thanks in advance. In the mean time, I can report that in my brief and modest survey of my local library's modest information theory shelves, I actually did not find an explicit and relevant definition of "self-information". I checked the indices for the term "self-information" and quickly flipped through a modest number of books to see how entropy was introduced. Cover & Thomas define "self-information", but they have something different in mind than we do in this discussion (they have in mind the mutual information between a stochastic variable and itself, i.e. the Shannon entropy of a stochastic variable). Other authors do not name the quantity. Here's one example:

From Amiel Feinstein's "Foundations of Information Theory", 1958, pp. 2-3 "Intuitively, we would agree that we receive information whenever we are informed of an event whose occurrence was previously not certain. Furthermore it is reasonable that, within certain limits, at least, the more likely the event is, the less information is conveyed us by the knowledge of its occurrence. Ignoring for the moment this last remark, we can already introduce a certain amount of formalism into the discussion. Let x represent an event (i.e., its occurrence) and x' its complement (i.e., its nonoccurrence), and let p_x and p_x' denote the probabilities of these two events, so that p_x + p_x' = 1. Let I_x denote the amount of information conveyed to us by the knowledge of the occurrence of x. Since x is specified only by its probability p_x, we assume I_x to be a function of p_x, i.e., I_x = I(p_x), where I() is a non-negative function defined in the range of values of p_x, namely 0 < p_<= 1, the value p_x being meaningless in the present discussion. Similarly we put I_x' = I(p_x'). Since the probability of receiving the amount I_x of information is p_x, and that of receiving I_x' is p_x', the average (or expected) amount of information received is given by p_xI(p_x) + p_x'I(p_x'). Similarly, if we have a set x₁, ..., x_n of mutually exclusive events such that p_x1 + ... + p_xn = 1, then it is reasonable to consider p_x1I(p_x1) + ... + p_xnI(p_xn) the average amount of information conveyed by the knowledge of which x_i actually occurred. If p_xi = 0, evidently the corresponding term should simply be omitted from consideration. The choice of function I() is as yet in no way indicated."

Feinstein proceeds by discussing certain axioms that should be fulfilled by the function I() and "derives" the Shannon entropy of his partition using these axioms (this undue attention to the neat, yet for the purpose of communication theory completely irrelevant, properties of Shannon entropy is in my view a large contributing cause to many misunderstandings and mystifications). It would make no sense to generalize the argument of Feinstein's I-function to probabilities of arbitrary events. Generalizing the events in the logarithms -log(Pr(A)) of communication theory to arbitrary events, that are not taken from some explicitly or implicitly understood partition, can be done only at the expense of making the quantity meaningless. For instance, there can be zillions of different events that all have probability 1/2 and it is obvious that we cannot simultaneously encode them all by assigning single-bit codewords to each one! An example is a uniform probability distribution over a continuous sample space {0 <= x <= 2}; in this case there are infinitely many events of the form {a <= x <= a+1}, 0 <= a <= 1, that all have probability 1/2.

David Wilson · 20 July 2004

Erik 12345 wrote: " ...... Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, ..., An. The events we are allowed to plug into the formula for "self-information"/"surprisal" are then restricted to one of these partitions." ...... ... the assertion of Erik's which I have requoted above is also false. I was easily able to find plenty of texts on information theory which did not "require us to first partition the possible outcomes into non-overlapping events A1, A2, ..., An." These texts did give definitions which allowed any event to be plugged into the formula. I'll give a list of some of these texts later on tonight (Oz time) and quote one of the definitions for Erik's analysis of how it differs (or not) from Dembski's.
— I

Here is the list of texts: Amiel Feinstein, Foundations of Information Theory, McGraw-Hill, New York, 1958, p. 2

Norman Abramson, Information theory and coding Published, Mcgraw-Hill, New York, 1963, p.12

J. Aczél and Z. Daróczy, Measures of information and their characterizations, Academic Press, New York, 1975, p.71ff

Richard W. Hamming, Coding and information theory, 2nd edn., Prentice-Hall, Englewood Cliffs N.J, 1986, p.104ff

Klaus Krippendorff, Information theory : structural models for qualitative data, sae, Beverly Hills, 1986, p.14

John B. Anderson and Seshadri Mohan, Source and channel coding : an algorithmic approach, Kluwer Academic Publishers, Boston, 1991, p.5

Michael A. Nielsen and Issac L. Chuang, Quantum computation and quantum information, Cambridge University Press, Cambridge, 2000, p.501-502

There are minor differences between the ways these books treat what I have been calling "self-information". Some of them specify units while others don't; in some the treatments are quite brief while in others they span several pages; most do not use a compact one- or two-word name to refer to the concept, but introduce it with phrases like "amount of information--surprise, uncertainty--in the occurrence of an event" (Hamming), "amount of information conveyed to us by the knowledge of the occurrence of x", where x is an event (Feinstein). Nevertheless, I maintain that in all of them, the concept being defined is essentially the same as what Dembski calls "information".

David Wilson wrote: "Please tell me how this [Dembski's definition of "information"] differs in any essential respect from the definition of self-information as given, for instance, in the entry on Information Theory in the Encyclopedic Dictionary of Mathematics published by MIT press." Quote the definition and tell me what unit, if any, is attached to the quantity and if codes/codewords are discussed. Then I'll be able to tell you if it is different from Dembski's definition (which is inspired by the philosophical literature, where one is uninterested in good codes and interested in the properties of "information", such as additivity for independent events, and its relation to knowledge).
— Erik 12345

As I indicated in my previous post, I was mistaken about way "self-information" is defined in the Encyclopedic Dictionary of Mathematics. Here, however, is the definition given in Abramson's book:

Definition. Let E be some event which occurs with probability P(E). If we are told that the event E has occurred, then we say we have received I(E) = log P(E)^-1 (2-1). units of information. ..... If we use a logarithms to the base 2, the resulting unit of information is called the bit (a contraction of binary unit).

I'm pretty sure all of the above-listed books except Neilsen and Chuang would have reasonably comprehensive treatments of the coding theorems, though I didn't specifically check them all. Abramson's and Hamming's certainly do. Nielsen and Chuang only treat quantum codes, not classical ones. Abramson introduces the noiseless source coding problem for the purposes of motivation in an introductory chapter before he giving any definitions, but he does not present the solution there. The comprehensive treatment comes later, in chapter 3. In a footnote on page 7 he says

From this point on we shall make use of the contraction binit for binary digit. It is important to make a distinction between the binit (binary digit) and the bit (a unit of information which we shall define in Chapter 2). As we shall see, a binit may contain one bit of information under certain circumstances.

Hamming makes the same distinction in his book (p. 106), but he uses the same term "bit" to refer to a binary digit and the unit of information:

We are using the word "bit" in two different ways, both as a digit in the number base 2 and as a unit of information. They are not the same and we shall be careful to say "bit of information" when we mean that definition and there could be confusion.

David Wilson wrote: " ... It would be a slight exaggeration to say that they "imposed a restriction" on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. ..." Very well. Did they sermonize against a generalization of the formula -log(Pr(A)) from probabilities of partitions to any positive number (whether probabilities or not)? If not, what do you make of the absence of such sermons?
— Erik 12345

These questions are obviously rhetorical, but I'm not sure I understand what you're driving at. If you're suggesting that my comment about "sermonising" was gratuitous, then you're right, and I apologise. "Post in haste, repent at leisure" as they say.

From Amiel Feinstein's "Foundations of Information Theory", 1958, pp. 2-3 " ... Let x represent an event (i.e., its occurrence) and x' its complement (i.e., its nonoccurrence), and let p_x and p_x' denote the probabilities of these two events, so that p_x + p_x' = 1. Let I_x denote the amount of information conveyed to us by the knowledge of the occurrence of x. Since x is specified only by its probability p_x, we assume I_x to be a function of p_x, i.e., I_x = I(p_x), where I( ) is a non-negative function defined in the range of values of p_x, namely 0 < p_ ≤ 1, the value p_x = 0 being meaningless in the present discussion. Similarly we put I_x' = I(p_x')." Feinstein proceeds by discussing certain axioms that should be fulfilled by the function I( ) and "derives" the Shannon entropy of his partition using these axioms .... It would make no sense to generalize the argument of Feinstein's I-function to probabilities of arbitrary events.
— Erik 12345

But it seems to me that Feinstein himself has already assumed that his I-function can take probabilities of arbitrary events (at least of ones of non-zero probability) as arguments. He states explicitly that I( ) is defined over the interval (0, 1]. If I(p_x) is only defined for events belonging to a partition, how does he manage to prove that I( ) must be proportional to a negative logarithm? As far as I am aware this can't be done without somewhere using the identity I(p_x∩y) = I(p_x) + I(p_y) for some independent events x, y of positive probability less than 1. That identity does not make sense unless I( ) is at least defined for the probabilities of the three events x, y, and x∩y, which cannot possibly form a partition.

(this undue attention to the neat, yet for the purpose of communication theory completely irrelevant, properties of Shannon entropy is in my view a large contributing cause to many misunderstandings and mystifications).
— Erik 12345

Chacun à son goût. I happen to like this approach. I certainly think it helped me gain a better understanding of why something like mutual information, for instance, should be defined in the way it is. Nevertheless, the only purpose it ever serves in communication theory seems to be pedagogical. So if it really is a large contributing cause to misunderstandings that would certainly be a good reason to avoid using it for pedagogical purposes. However, the fact remains that the definition Dembski has used for "information" is well-established in the information theory literature and as far as I can see he has accurately reproduced it. My opinion remains that criticism of him merely for doing this is unreasonable.

David Wilson wrote: "Until the recent appearance of his latest opus Dembski has never, as far as I know, tried to take an average of his "information", so he has never needed to specify a partition over which to take it. In his latest paper, just referred to, where he does take such an average, it is, as far as I can see, perfectly well-defined." I agree that Dembski hasn't made that mistake. He doesn't take such an average in his paper on "variational information"
— Erik 12345

He takes such an average on page 4, where he discusses the relationship between entropy and his definition of information. Incidentally, his "variational information" is not "additive" as he claims on page 9. His alleged proof contains a couple of errors. I posted a counterexample to the claim on talk.origins a couple of weeks ago. Here it is in a slightly more readable form.

Let Ω be the set { 00, 01, 10, 11 }, μ the uniform probability measure on Ω, which assigns equal weights of ¼ to each of the four points. Let A₁ be the σ-algebra { {}, {00,01}, {10,11}, Ω} (with {} representing the empty set), A₂ the σ-algebra { {}, {00,10}, {01,11}, Ω }, and ν the probability measure which assigns weights of 1/5 to each of the three points 00, 01, 10, and a weight of 2/5 to the point 11. These σ-algebras and measures satisfy all the conditions stated by Dembski at the bottom of page 9, and the σ-algebra A generated by A₁ and A₂ is just the power set of Ω (i.e. the set of all its subsets). The Radon-Nikodym derivatives of μ and its restrictions &mu₁, μ₂ to the σ-algebras A₁ and A₂ with respect to ν, are given by the following table: __________00__01__10__11 dμ/dν 5/4 5/4 5/4 5/8 d&mu₁/dν 5/4 5/4 5/6 5/6 d&mu₂/dν 5/4 5/6 5/4 5/6 Plugging these values into Dembski's definition of variational information, I get I( μ | ν ) = log₂( 35/32 ) I( &mu₁ | ν ) = I( &mu₂ | ν ) = log₂( 25/24 ) Thus, I( &mu₁ | ν ) + I( &mu₂ | ν ) = 2 log₂( 25/24 ) = log₂( 625/576 ) which is certainly not equal to I( μ | ν ), as it should be if Dembski's claim were correct.

Erik 12345 · 23 July 2004

Here is the list of texts: Amiel Feinstein, Foundations of Information Theory, McGraw-Hill, New York, 1958, p. 2 Norman Abramson, Information theory and coding Published, Mcgraw-Hill, New York, 1963, p.12 J. Aczél and Z. Daróczy, Measures of information and their characterizations, Academic Press, New York, 1975, p.71ff Richard W. Hamming, Coding and information theory, 2nd edn., Prentice-Hall, Englewood Cliffs N.J, 1986, p.104ff Klaus Krippendorff, Information theory : structural models for qualitative data, sae, Beverly Hills, 1986, p.14 John B. Anderson and Seshadri Mohan, Source and channel coding : an algorithmic approach, Kluwer Academic Publishers, Boston, 1991, p.5 Michael A. Nielsen and Issac L. Chuang, Quantum computation and quantum information, Cambridge University Press, Cambridge, 2000, p.501-502
— David Wilson

Feinstein's text does not belong in your list, since he explicitly restricts attention to partitions. Abramson's text, however, does belong in your list. I find his distinction between a "bit" and a "binit" unclear. He writes: "We note, also, that if P(E) = 1/2, then I(E) = 1 bit. That is, one bit is the amount of information we obtain when one of the possible equally likely alternatives is specified. Such a situation may occur when one flips a coin or examines the output of a binary communication system." (p. 13, emphasis in original) I find it tempting to interpret this as saying that the event E is restricted to one of the events in a partition of the sample space into two halves. However, all things considered, I'm inclined to interpret Abramson's introduction of I(E) as not relativized on a partition of the sample space. I suppose that Abramson can avoid my factual objection (e.g. that there can be infinitely many events of probability 1/2 and that it is absurd to simultaneously associate each of these with its own binary digit) by distinguishing a binary unit (a "bit" in Abramson's terminology) from a binary digit (a "binit" in Abramson's terminology). This would avoid a factual error at the price of irrelevance, for what good is a binary unit in those situations when it doesn't correspond to binary digit? Communication theory is about binary digits. Abramson, like many other textbook authors (e.g. Cover & Thomas write something to the effect that it is irresistable to play around with axioms for "information measures" and take on faith that what results is actually useful), does make clear that the considerations discussed during introduction of the definitions are unrelated to the actual justification for the definitions. To be clear, I concede that some textbooks on information theory do not relativize what we here call "self-information" to a choice of partition. What I don't agree with is that this is a good idea or that those who follow it should not be criticized. It is perverse to introduce definitions via discussions that are irrelevant to the subject at hand. And one should distinguish between stuff that is introduced only as a result of these perversions and stuff that is actually of use in information theory.

These questions are obviously rhetorical, but I'm not sure I understand what you're driving at. If you're suggesting that my comment about "sermonising" was gratuitous, then you're right, and I apologise. "Post in haste, repent at leisure" as they say.
— David Wilson

There is no need for apologies, as my question was neither rhetorical nor an irrational reaction to the word "sermonise". I don't know if you agree with that the quantity -log(Pr(E)) is of relevance only when E is taken from some partition of the sample space. By asking what you make of the absence of sermons against, say, definitions of -log(x) as the "information" associated with the positive real number x (whether a probability or not), I am simply trying to get you to either embrace the absurdity or commit to some relevance demands (which will make it hard to retain -log(Pr(E)) for arbitrary events, since it is quite irrelevant).

But it seems to me that Feinstein himself has already assumed that his I-function can take probabilities of arbitrary events (at least of ones of non-zero probability) as arguments. He states explicitly that I( ) is defined over the interval (0, 1].
— David Wilson

Arbitrary probabilities of events from a partition, not probabilities of arbitrary events!

If I(px) is only defined for events belonging to a partition, how does he manage to prove that I( ) must be proportional to a negative logarithm? As far as I am aware this can't be done without somewhere using the identity I(px∩y) = I(px) + I(py) for some independent events x, y of positive probability less than 1. That identity does not make sense unless I( ) is at least defined for the probabilities of the three events x, y, and x∩y, which cannot possibly form a partition.
— David Wilson

Your identity is not used by Feinstein. He uses the standard axioms, where the Shannon entropy of the partition {x₁,...,x_n} is expressed recursively like this: H(x₁,...,x_n) = H(x₁, x₁') + (1-p_x1)H(x₂,...,x_n | x₁'), where x₁' is the complement of x₁ (the union of x₂,...,x_n) and the second term is the entropy of a partition of a smaller probability space obtained by conditioning on the event x₁'. This, of course, imposes demands on the I-function. You could still point out that expanding the r.h.s. gives an expression involving the I-function of x₁, x₂, ..., and x₁'. These events could not possibly be part of the same partition, but the point is that the I-function is evaluated only as a means to determine entropies of partitions. The recursive formula relates the Shannon entropies of different partitions and therefore also the I-functions for events from different partitions. But this does not prevent the I-function from being relativized on a fixed, but arbitrary, partition any more than it prevents the Shannon entropy from being relativized in the same way. Dembski, in contrast, has introduced his "information" via the philosopher's argument that -log(Pr(E)) is the only continuous function that is decreasing in Pr(E) and additive for independent events. That is, Dembski does not express his requirements of "information" in terms of the Shannon entropy of a partition.

Chacun à son goût. I happen to like this approach. I certainly think it helped me gain a better understanding of why something like mutual information, for instance, should be defined in the way it is. Nevertheless, the only purpose it ever serves in communication theory seems to be pedagogical. So if it really is a large contributing cause to misunderstandings that would certainly be a good reason to avoid using it for pedagogical purposes.
— David Wilson

Which genuine insights about mutual information did you gain from this approach that would have required a larger effort if you had only had access to the communication theoretic context (the average number of binary digits by which the optimal average codeword length can be reduced by exploiting that the value of another stochastic variable is known) and the statistical context (a viable test statistic in tests for independence)? It probably isn't a cause for misunderstandings among those who read textbooks themselves, but it is, I infer, a contributing cause of misleading popular and interdisciplinary presentations. At the popular level, there seems to be a widespread misconception that you're doing information theory simply computing logarithms of probabilities. At the interdisciplinary level, there seems to be a lack of interest in distinguishing the use of Shannon entropies and Kullback-Leibler divergences for data coding, for obtaining Bayesian a priori-probabilities via the Maximum Entropy Principle, and as test statistics in non-Bayesian hypothesis testing. The functions used in information theory seem to have acquired an air of authority that make them the unquestioned choice even in applications outside the field for which they introduced.

However, the fact remains that the definition Dembski has used for "information" is well-established in the information theory literature and as far as I can see he has accurately reproduced it. My opinion remains that criticism of him merely for doing this is unreasonable.
— David Wilson

My opinion is that the reproduction of discussions, similar to those which some authors include for purely pedagogical reasons, as if they were the actual basis for information theory is evidence of a lack of understanding that is a legitimate target for criticism. This applies especially to mathematicians like Dembski, who should be able to figure out that the criterion for doing doing information theory has little to do with log-transforming probabilities per se and everything to do with data encoding.

Erik 12345 · 23 July 2004

Dr. 12345: There is an evil undocumented feature in the preview script that can completely change the meaning of a comment. Yes. I've noticed that. I think the evil lurks in the "less than" sign. I think it works if you don't preview, but once you do, it aborts everything that follows
— Russel

Yeah, I reached the same conclusion. Just for the record, to aid those who evaluate my posts by weighing my authority against the authority of my opponents, I'll note that I don't have a PhD in any field.

Erik 12345 · 23 July 2004

Incidentally, his "variational information" is not "additive" as he claims on page 9. His alleged proof contains a couple of errors. I posted a counterexample to the claim on talk.origins a couple of weeks ago. Here it is in a slightly more readable form.
— David Wilson

I understand that the Radon-Nikodym derivates w.r.t. to the counting measure c are dμ/dc = f = (1/4, 1/4, 1/4, 1/4) dν/dc = g = (1/5, 1/5, 1/5, 2/5) since summing f(x) and g(x) for all x in an event A gives μ(A) and ν(A), respectively. This gives me dμ/dν as f(x)/g(x), since summing such terms weighted by ν gives the measure μ. But how did you determine dμ₁/dν? Anyway, despite that I don't understand all the ingredients in your counterexample, I think your conclusion is right. It seems to me that some modification is required, such as requiring the reference measure ν to factorize in the same manner as μ.

Russell · 23 July 2004

Dr. 12345: "Just for the record, to aid those who evaluate my posts by weighing my authority against the authority of my opponents, I'll note that I don't have a PhD in any field."

Yes, and Dembski has two, which speaks worlds about the significance of "PhD". I'm going to continue to think of "doctor" in its etymological sense ("teacher").

I gave up trying to follow the Wilson - 12345 discussion. It's over my head. But can we summarize for the masses? As I understand it DW said that one or a few or some of the technical indictments of Dembski's work are unwarranted. Much discussion between DW and E1 later: sort of yes, sort of no.

Big picture now: are there any mathematicians reading this who find Dembski's arguments, specifically with respect to biology, compelling?

I find his understanding of biology so ludicrous that I'm not strongly motivated to educate myself on the mathematical legerdemain he uses to rationalize it.

(He reminds me of a math prof at my college who had a "mathematical proof" that all numbers are equal to 47 (our school's numerical mascot). Only that that prof knew it was a joke.)

steve · 23 July 2004

I don't think it reflects poorly on the PhD degree. I have known many science PhDs, and they are all very intelligent. Unfortunately, intelligence is not always homogenously distributed throughout someone's range of thinking. Some people are smart in everything they do, some are a little smarter in some things than others, and some people are intelligent in some respects and crazy in others.
There's no doubt Kurt Godel was among the brightest all-time logicians. And bright at math, bright at physics. Einstein enjoyed talking physics with him. Yet, was also somewhat crazy. He died of starvation because he thought everyone was out to poison him.
Lots of people are bright at some things, crazy at others. You can have a brilliant mathematician who thinks communism's a good idea. A brilliant journalist who thinks Sun Myung Moon is the second coming. It seems like especially on religious topics, some bright people can turn off their minds and keep believing nonsense. Like Shermer said, it's not that smart people are without stupid beliefs, but they're really good at coming up with justifications.

Russell · 23 July 2004

Steve:

I'm certainly not suggesting that a PhD is negatively correlated with having worthwhile thoughts to share. I am proposing, though, that you can garner any number of degrees without ever having a worthwhile thought to share.

In the case of our ID friends, it may be that there's some Godel-like island of competence that I'm not aware of. (Well, rhetoric. I'd have to grant they're good at that.)

But worthwhile thoughts?

steve · 23 July 2004

Same difference :-)

(to use an oxymoron, like Loving God (w/r/t the biblical one))

David Wilson · 31 July 2004

Feinstein's text does not belong in your list, since he explicitly restricts attention to partitions.
— Erik 12345

I disagree. But since there are still plenty of other texts on the list I can't be bothered arguing the point any further.

"... It would be a slight exaggeration to say that they "imposed a restriction" on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. ..." Erik 12345 replied: "Did they sermonize against a generalization of the formula -log(Pr(A)) from probabilities of partitions to any positive number (whether probabilities or not)? If not, what do you make of the absence of such sermons?" Me: "These questions are obviously rhetorical, but I'm not sure I understand what you're driving at. If you're suggesting that my comment about "sermonising" was gratuitous, then you're right, and I apologise. "Post in haste, repent at leisure" as they say." Erik 12345: "There is no need for apologies, as my question was neither rhetorical nor an irrational reaction to the word "sermonise". .... By asking what you make of the absence of sermons against, say, definitions of -log(x) as the "information" associated with the positive real number x (whether a probability or not), I am simply trying to get you to either embrace the absurdity or commit to some relevance demands (which will make it hard to retain -log(Pr(E)) for arbitrary events, since it is quite irrelevant).
— I

Ok, what I was missing was your intention of referring to an association of the word "information" with -log(x). The difference between such an association and that of the word "information" with -log(Pr(E)) is that (as far as I know) no one has ever attempted to make it, and I can see no reason why anyone should want to. I can thus also see no reason why it should even occur to anyone to object to it. However, as documented by my list of references, there are plenty of writers on information theory who do adopt a definition similar to Abramson's without imposing any restrictions on the event E, and most of whom attempt to justify it by appealing to conditions they claim one might intuitively expect such a definition to satisify. I don't find that absurd at all.

I don't know if you agree with that the quantity -log(Pr(E)) is of relevance only when E is taken from some partition of the sample space. ...
— Erik 12345

No, I don't. But even if I did, any event E always belongs to some partition of the sample space (if nothing else, it belongs to the partition { E, E'}, where E' is its complement). Moreover, while it only makes sense to take averages of -log(Pr(E)) over partitions of the sample space, there is no reason why such a partition can't be completely arbitrary. So, if you're going to give the quantity -log(Pr(E)) a name at all, there seems to me to be no point imposing restrictions on the events E for which the name will be defined.

Which genuine insights about mutual information did you gain from this approach that would have required a larger effort if you had only had access to the communication theoretic context (the average number of binary digits by which the optimal average codeword length can be reduced by exploiting that the value of another stochastic variable is known) and the statistical context (a viable test statistic in tests for independence)?
— Erik 12345

That's impossible for me to judge, since I have never seen an exposition which attempts to motivate the definition of mutual information by appealing to either of the items you mention. In all of the expositions I have seen, the definition is motivated by giving some argument that it represents the average amount of "information" which one random variable provides about another. It is only after the definition has been made that theorems relating it to coding rates are then proved. I agree however that it is the theorems on coding rates which do provide the ultimate justification for the definition. Heuristic arguments for motivating a definition are all well and good, but unless you can get around to doing something useful with it, they don't amount to much.

My opinion is that the reproduction of discussions, similar to those which some authors include for purely pedagogical reasons, as if they were the actual basis for information theory is evidence of a lack of understanding that is a legitimate target for criticism. This applies especially to mathematicians like Dembski, who should be able to figure out that the criterion for doing doing information theory has little to do with log-transforming probabilities per se and everything to do with data encoding.
— Erik 12345

I guess we are mostly in agreement on this point. The only use Dembski seems to make of the definition is to wave it around as an excuse for claiming he is doing "information theory". However, my criticism would not be that he has chosen a wrong definition of "information", but that he hasn't done anything useful or interesting with it. I am also very sceptical that he can do anything useful or interesting with it. I am, however, willing to be convinced otherwise.

David Wilson · 31 July 2004

I understand that the Radon-Nikodym derivates w.r.t. to the counting measure c are dμ/dc = f = (1/4, 1/4, 1/4, 1/4) dν/dc = g = (1/5, 1/5, 1/5, 2/5) since summing f(x) and g(x) for all x in an event A gives μ(A) and ν(A), respectively. This gives me dμ/dν as f(x)/g(x), since summing such terms weighted by ν gives the measure μ. But how did you determine dμ₁/dν?
— Erik 12345

I was assuming that by dμ_i/dν Dembski actually meant dμ_i/dν_i, where ν_i is the restriction of ν to the σ-algebra A_i. The standard definition of Radon-Nikodym derivative requires the two measures to be defined on the same σ-algebra. If μ and ν are defined on the σ-algebra A, and μ is absolutely continuous with respect to ν, the Radon-Nikodym derivative dμ/dν is the unique ν-integrable function which satisfies the equation: μ(S) = ∫_S dμ/dν dν for all events S in A. If μ and ν are defined on different σ-algebras, A₁ and A₂ say, then the above equation only makes sense if S belongs to both A₁ and the ν-completion of A₂. And then for dμ/dν to be uniquely defined, it has to be measurable with respect to the ν-completion of A₁ ∩ A₂. In the case of my counterexample, requiring dμ_i/dν to be A_i-measurable means that dμ₁/dν has to be constant on the sets {00,01} and {10,11} while dμ₂/dν has to be constant on the sets {00,10} and {01,11}. Applying the above equation for these functions and sets gives dμ_i/dν(x) = μ(S)/ν(S) for all x in any set S on which dμ_i/dν is required to be constant.

Anyway, despite that I don't understand all the ingredients in your counterexample, I think your conclusion is right. It seems to me that some modification is required, such as requiring the reference measure ν to factorize in the same manner as μ.
— Erik 12345

Yes the result does hold in that case. The proof is not all that difficult, but I have to admit it gave me a lot more trouble than it should have.