Paul Nelson has developed a liking for ORFans, sequences of DNA that appear to code for proteins, but have (or had) no
currently detectable homology to other genes. He feels they represent a difficulty for "Darwinian" accounts of gene origin and common descent. I have previously discussed why
ORFans present no challenge to modern evolutionary theory, Dr. Nelson even showed up in the discussion.
More recently, he has been promoting ORFans again, without indicating he has learnt anything at all from our discussion. In particular, in a recent article in the
Christian Post he claims that 28% of the genes in
Mycoplasma genitalium are ORFans.
Nearly one-third of the protein-coding genes of mycoplasma, the simplest "free-living thing" up until last year, are unknown genes or ORFans.
Unfortunately for him, the actual number is zero. Yes, that's right,
zero. How did he get it so wrong?
Let me repeat that again. The number of ORFans in
Mycoplasma genitalium is zero (Siew et al 2003). What Nelson has done is confused two entirely different concepts, and it doesn't help his case at all.
The papers Nelson is thinking of look at the number of
essential genes in the bacterium
Mycoplasma genitalium (Glass et al., 2006, Hutchison et al 1999).
M. genitalium has one of the smallest genomes known for a bacterium that can be grown in pure culture, with only 482 protein coding genes. What these papers did was systematically mutate these genes to find out which ones were truly essential. They found that 382 of the 482 protein coding genes were essential. However:
"One of the surprises about the essential gene set is its inclusion of 110 hypothetical proteins and proteins of unknown function. Some of these genes likely encode enzymes with activities reported in M. genitalium, such as transaldolase (21), but for which no gene has yet been annotated." (Glass et al., 2006)
That is, roughly 28% of the proteins in the essential set are proteins of unknown function. Here is where Nelson makes his mistake. He equates proteins having an unknown function with being an ORFan (having no known homology with other genes). And this is just wrong.
In fact, there are large families of proteins of unknown function that are found in nearly all organisms studied. They form perfectly good phylogenies, we just don't know what they do (Galperin & Koonin 2004). For example the
E. coli gene ybeM has homologs in 52 other organisms, from yeast to humans, and is a member of the nitrilase family, but we just don't know what it does. This is
diametrically opposed to what Nelson wants to claim, that there are large number of genes that have no apparent ancestors.
"Where are the similar sequences that gave rise to these ORFan genes? Where are the necessary intermediates that must have been there? Where are the parents, if you will, of these mysterious genetic words?"
What probably got Nelson confused was that 28% of
M. genitalium putative protein coding genes
were orphans when the
M. genitalium genome was first reported. The percentage of unknown protein coding genes is also 28%. The same figures, but completely different concepts.
Unfortunately for Nelson, by 2003 homologs for all the putative protein coding genes had been found (Siew et al., 2003). Those similar percentages, possibly with help from
Morton's Demon , caused Nelson to confabulate the two separate concepts, genes with no apparent homologs (ORFans) and genes with unknown function*.
This is an object lesson in reading research papers carefully.
Another aspect Nelson keeps on misrepresenting is the numbers of ORFans. As we sample more genomes, the numbers of ORFans rise.
But, because we are finding many more known genes, and finding relatives for genes that were previously ORFans, the percentage of ORFans is decreasing. The figure below (Taken from Siew et al., 2003) shows this. The dots represent the number of protein coding genes that we are finding, the sold line is the number of ORFans we are finding, and the dashed line is the percentage of ORFans we have found.

Clearly, as time goes on, we are finding more and more relatives for the ORFans (in 2003, the number of protein coding ORFans was around 5%#, by 2005 (Wilson et al 2005), it was down to 1.2%. Unfortunately for Nelson, we are finding the alleged "missing parents" of the ORFans+. And this is with only a fractionally minute sampling of the diversity of microbial life (we have sampled only 0.02% of all bacterial genomes, and we are biased towards a subset of human pathogens at that).
Back in April I issued the following challenge to Paul Nelson
I would also assume, in respect for scholarly accuracy, that in your next talks you will show the exponential decrease in percentage ORFans, as well as the increase in number, as well as mentioning the experimental tests of hypotheses about ORFans that have been under taken and their results.
He hasn't done that as yet, and I issue the challenge again Paul. Show the exponential decrease graphs and mention how biologists are testing hypotheses about ORFans.
Oh, and don't conflate "unknown function" with ORFan.
Update: Just to make my point, I've gone through the essential unknown function and hypothetical proteins in
M. genetalium by hand (see table 2 in the
Supporting documents of the Glass et al., 2006 paper). They are present in large to modest families, not ORFans.
For example
MG125 (white bar in
Fig 2 of Glass et al., 2006), is of unknown function, but is a member of the Cof hydrolase family, and has many
homologs in a wide variety of bacteria .
MG442 is a conserved hypothetical protein of unknown function that has
similarities to GTPase enzymes, but we still don't know what it does.
MG461, HD domain protein of unknown function, but with
widely distributed homolgs.
MG459 a conserved hypothetical protein of unknown function with many homologs in other bacteria, similar in structure to MECDP_synthase.
MG074 is a conserved hypothetical similar to trichodiene synthase in a number of bacteria.
And so on. You can do this for yourself, Paul Nelson could have done it himself after I alerted him to the problem back in April. Of the 110 "unknowns" 65 are from systems which have good phylogenies, and known relations to function proteins, but we just don't know what they do yet, 45 are
conserved hypothetical proteins, where we don't even have structural clues to function yet. But note the "conserved", that's because they are found in a wide variety of organisms.
Bottom line: not ORFans.
Update to the Update: I have created a table with BLAST tree views of the 45
conserved hypothetical proteins, it turns out that many of them either are members of
Clusters of Orthologous Groups (ie parts of an extened family, for example MG337 is a member of COG0822C) or close to members of COG's (Like MG101, which has a
HTH_GTNR domain and appears to be related to transcrition factors). Anyway, if you want a copy of the table, with hyperlinks to the trees and COGs, go to
Southern Skywatch Feedback and click on the email address there. The Treeviews can be misleading, as they don't show all the related proteins (click on the show removed sequences link to see the more distant relatives). I'm off to the
Australian Health and Medical Research Congress for a week, so I'm turning of the comments while I'm away, as I won't be able to moderate. Sorry.
*ORFans and proteins with unknown function can of course coincide, but in this case they don't
#I'm ignoring the small ORFans, most of which are likely artefacts or bits of viral DNA, see my original post for details and Daubin & Ochman 2004.
+ Finding relatives is not necessarily easy, but the recent solution of crystal structure for ORFans is greatly increasing our ability to find relatives (eg Siew et al, 2005).
Daubin V, Ochman H. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004 Jun;14(6):1036-42.
Galperin MY, Koonin EV 'Conserved hypothetical' proteins: prioritization
of targets for experimental study (2004) Nucleic Acids Res. 32, 5452-5463
Glass JI, et al Essential genes of a minimal bacterium. Proc Natl Acad Sci U S A. 2006 Jan 10;103(2):425-30.
Hutchison CA, et al., Global transposon mutagenesis and a minimal Mycoplasma genome. Science. 1999 Dec 10;286(5447):2165-9.
Siew N, Fischer D. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins. 2003 Nov 1;53(2):241-51
Siew N, Saini HK, Fischer D. A putative novel alpha/beta hydrolase ORFan family in Bacillus. FEBS Lett. 2005 Jun 6;579(14):3175-82.
25 Comments
Sir_Toejam · 20 November 2006
W. Kevin Vicklund · 20 November 2006
Andrea Bottaro · 20 November 2006
Ian Musgrave · 20 November 2006
Glen Davidson · 20 November 2006
In addition to the other reasons for why ORFans may exist, can't Nelson understand that all of the relatives of some genes may have gone extinct? I think we're back at God of the gaps yet again, the notion that if not all genes have extant relatives, then poof, the magic dragon did it.
Still no curiosity from Nelson as to why homologies exist at all, I see. Of course this comes down to one of the problems of theology (or at least much theology), one is not supposed to ask "why" or "how" of God. Since "God could do it" according to their description of this unobserved "entity", what should they care about how, why, or what it means?
Explain the relatedness of the large majority of genes that are presently known, Nelson. And if that explanation can be determined to be sound and truly explanatory, then make a plea about the ORFans. As you're not interested in explaining anything so far, leave the relevant questions up to the big boys and girls.
Glen D
http://tinyurl.com/b8ykm
Lurker · 20 November 2006
From the Christian Post article: "Nelson gave his colleague William Dembski's basic definition of Intelligent Design as the study of patterns in nature that are best explained as a result of intelligence."
I don't get how something is "a study" when it is already "best explained." Can somebody please elaborate that for me? Thanks.
bjm · 20 November 2006
Yes, it means 'we are not going to waste our time on any work - there's preaching to be done'
Sir_Toejam · 20 November 2006
...and:
"we don't need your pathetic level of detail!"
MarkP · 20 November 2006
It would be nice to see them just once admit openly that the reason they do no research is because they already know the answers. Behe came close in Dover.
Nick (Matzke) · 20 November 2006
Unsympathetic reader · 20 November 2006
Ian writes: (To Paul Nelson) "Show the exponential decrease graphs and mention how biologists are testing hypotheses about ORFans."
Better yet, he could describe how ID theorists have been testing hypotheses about ORFans. After all, the sorts of sequence analyses used in this field are dirt cheap and easy to do. The data is available in free databases. So where are the ID publications?
What a fruitful research program...
PvM · 20 November 2006
In Paul's defense, he is a young earth creationist, which means that scientific evidence can be interpreted only with the Truth in mind.
Marek 14 · 21 November 2006
"In Paul's defense, he is a young earth creationist, which means that scientific evidence can be interpreted only with the Truth in mind."
I cannot help but imagine the Truth in his mind, as a beautiful, shining white light that, unfortunately, sears the brain from within.
I wonder which is more dangerous - Truth, Lie, or Uncertainty. Or maybe it's some other grand concept, like Ingrown Nails...
Paul A. Nelson · 21 November 2006
Hi Ian,
Well, I can't say I wasn't warned. At an ID research meeting this past spring, during my presentation on ORFans (no, I can't shut up about them), a molecular geneticist there interrupted me to say that "unknown function" and "ORFans" strictly defined are not co-extensive classes. All ORFans are unknown function, but not all unknown function genes are ORFans, which I myself wrote when we last discussed ORFans here. So I knew that, but have been sloppy in my public presentations. I'll issue a correction, and ask that any videos or audios of my lectures on the topic include a clarification.
[Ian, in my original reply, the next paragraph described in detail how I'm going to work through the whole Glass et al. 2006 data set, which I should have done months ago (ouch). Now however I see from your update that you've done that already. If you have a file of your results, would you mind sending it to me? Short of that, can you send me the blast parameters that you used? I'm curious, for instance, how you would describe MG076, which I picked at random this morning to run blastp on. Also, how have you decided that short ORFans are annotation artifacts?]
So my Thanksgiving holiday will be spent working through the Glass et al. 2006 data set.
The second issue raised by your post concerns the evidential significance of ORFans to the design v. naturalistic evolution debate. I should tell you that many scientists on my side of the aisle have warned me in pretty strong terms that I'm paying too much attention to ORFans. 'They won't be significant, Paul,' they say, 'until their crystal structures are solved, and MANY more whole genomes are in. Genomic taxonomic sampling is much too sparse to say more than that.' I suppose to spare myself further embarrassment, I should agree and just shut up --- but then I run across findings like this:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=16650908&query_hl=2&itool=pubmed_docsum
What else is out there? ORFans are woefully neglected with respect to their potential significance. So I talk about them a lot, obviously, hoping to attract attention to a really first-rate puzzle. Notice that, as I wrote the last time, decreasing percentage --- the feature of the puzzle you want me to stress --- pales in biological import to novel structures and taxonomically-restricted essentiality, which, when coupled with increasing number, do raise genuine problems for current theories of evolution. Indeed, if novel structures continue to be found, and those proteins are functionally essential but taxonomically restricted, the findings will have deep theoretical consequences. I think you underplay the shock that ORFans caused many genomics researchers (see, e.g., Russell Doolittle's comments in 2002 and 2005, surveying the ORFans and conserved hypothetical puzzles), shock that arises against the backdrop of the assumption of universal common ancestry and theories of gene and protein evolution. (Btw, in the Q & A of my public lectures, if asked, I always bring up the leading evolutionary hypotheses for ORFans, such as rapid sequence evolution and viral reservoirs.)
Anyway: thank you for calling me on my sloppiness. I should have paid more attention in April. The pain this time...okay, let's just say you've got my full attention.
Unsympathetic reader · 21 November 2006
Paul Nelson writes: "I should tell you that many scientists on my side of the aisle have warned me in pretty strong terms that I'm paying too much attention to ORFans."
Yes, it is interesting that the percentage of ORFans is decreasing over time. That suggests that "common design" must be far more common than "special design" and not as you originally anticipated. This suggests the creator was not of the 'Burger King' variety: i.e. Special orders *do* upset Us.
"ORFans are woefully neglected with respect to their potential significance."
Really? By IDists or biologists? I think the evolutionary biochemists are absorbing the data as it comes in.
PvM · 21 November 2006
Seems to me Paul reads too much into the 'shock' of orfans. His focus on common ancestry nay quest to prove Common Ancestry to be flawed (what happened to the Dissertation or the paper on ontogenetic depth btw?) causes him to see minor puzzles which arise when datasets increase and are tools and knowledge fail to keep up.
That's science...
It becomes ID when we throw up our hands and consider this ignorance to be evidence of something more.
Ian Musgrave · 21 November 2006
Steviepinhead · 21 November 2006
'Rev Dr' Lenny Flank · 21 November 2006
Hey Paul, I have some unanswered questions for you, here:
http://www.geocities.com/lflank/nelson.html
'Rev Dr' Lenny Flank · 21 November 2006
Ian Musgrave · 21 November 2006
B. Spitzer · 21 November 2006
'Rev Dr' Lenny Flank · 22 November 2006
Ian Musgrave · 23 November 2006
I've gone through and made a table of the conserved hypotheticals with hyperlinks to BLAST tree results. A lot of them do turn out to have COG or other recognised domains (you just have to know where to look in the BLAST results) that I missed first time round. I'm going to go around and add hyperlinks to the COG domains, but if anyone wants the table (eg Paul), let me know and I will send you a copy).
Ian Musgrave · 24 November 2006
G'Day All. I've put an update to my update in the main post. I'm off for a week at the Australian Health & Medical Research Congress, so I won't be able to moderate comments (or likely even view them, given that I'll be flat out), so I'm turning comments off while I'm away. If anyone wants a copy of the table of conserved hypotetical proteins and their hyperlinked tree views, email me via the link in the update to the update. Catch you later.