Nelson vs Mycoplasma: ORFans redux.

Posted 20 November 2006 by Ian Musgrave

Paul Nelson has developed a liking for ORFans, sequences of DNA that appear to code for proteins, but have (or had) no currently detectable homology to other genes. He feels they represent a difficulty for "Darwinian" accounts of gene origin and common descent. I have previously discussed why ORFans present no challenge to modern evolutionary theory, Dr. Nelson even showed up in the discussion. More recently, he has been promoting ORFans again, without indicating he has learnt anything at all from our discussion. In particular, in a recent article in the Christian Post he claims that 28% of the genes in Mycoplasma genitalium are ORFans.

Nearly one-third of the protein-coding genes of mycoplasma, the simplest "free-living thing" up until last year, are unknown genes or ORFans.

Unfortunately for him, the actual number is zero. Yes, that's right, zero. How did he get it so wrong? Let me repeat that again. The number of ORFans in Mycoplasma genitalium is zero (Siew et al 2003). What Nelson has done is confused two entirely different concepts, and it doesn't help his case at all. The papers Nelson is thinking of look at the number of essential genes in the bacterium Mycoplasma genitalium (Glass et al., 2006, Hutchison et al 1999). M. genitalium has one of the smallest genomes known for a bacterium that can be grown in pure culture, with only 482 protein coding genes. What these papers did was systematically mutate these genes to find out which ones were truly essential. They found that 382 of the 482 protein coding genes were essential. However:

"One of the surprises about the essential gene set is its inclusion of 110 hypothetical proteins and proteins of unknown function. Some of these genes likely encode enzymes with activities reported in M. genitalium, such as transaldolase (21), but for which no gene has yet been annotated." (Glass et al., 2006)

That is, roughly 28% of the proteins in the essential set are proteins of unknown function. Here is where Nelson makes his mistake. He equates proteins having an unknown function with being an ORFan (having no known homology with other genes). And this is just wrong. In fact, there are large families of proteins of unknown function that are found in nearly all organisms studied. They form perfectly good phylogenies, we just don't know what they do (Galperin & Koonin 2004). For example the E. coli gene ybeM has homologs in 52 other organisms, from yeast to humans, and is a member of the nitrilase family, but we just don't know what it does. This is diametrically opposed to what Nelson wants to claim, that there are large number of genes that have no apparent ancestors.

"Where are the similar sequences that gave rise to these ORFan genes? Where are the necessary intermediates that must have been there? Where are the parents, if you will, of these mysterious genetic words?"

What probably got Nelson confused was that 28% of M. genitalium putative protein coding genes were orphans when the M. genitalium genome was first reported. The percentage of unknown protein coding genes is also 28%. The same figures, but completely different concepts. Unfortunately for Nelson, by 2003 homologs for all the putative protein coding genes had been found (Siew et al., 2003). Those similar percentages, possibly with help from Morton's Demon , caused Nelson to confabulate the two separate concepts, genes with no apparent homologs (ORFans) and genes with unknown function*. This is an object lesson in reading research papers carefully. Another aspect Nelson keeps on misrepresenting is the numbers of ORFans. As we sample more genomes, the numbers of ORFans rise. But, because we are finding many more known genes, and finding relatives for genes that were previously ORFans, the percentage of ORFans is decreasing. The figure below (Taken from Siew et al., 2003) shows this. The dots represent the number of protein coding genes that we are finding, the sold line is the number of ORFans we are finding, and the dashed line is the percentage of ORFans we have found.

Clearly, as time goes on, we are finding more and more relatives for the ORFans (in 2003, the number of protein coding ORFans was around 5%#, by 2005 (Wilson et al 2005), it was down to 1.2%. Unfortunately for Nelson, we are finding the alleged "missing parents" of the ORFans+. And this is with only a fractionally minute sampling of the diversity of microbial life (we have sampled only 0.02% of all bacterial genomes, and we are biased towards a subset of human pathogens at that). Back in April I issued the following challenge to Paul Nelson

I would also assume, in respect for scholarly accuracy, that in your next talks you will show the exponential decrease in percentage ORFans, as well as the increase in number, as well as mentioning the experimental tests of hypotheses about ORFans that have been under taken and their results.

He hasn't done that as yet, and I issue the challenge again Paul. Show the exponential decrease graphs and mention how biologists are testing hypotheses about ORFans. Oh, and don't conflate "unknown function" with ORFan. Update: Just to make my point, I've gone through the essential unknown function and hypothetical proteins in M. genetalium by hand (see table 2 in the Supporting documents of the Glass et al., 2006 paper). They are present in large to modest families, not ORFans. For example MG125 (white bar in Fig 2 of Glass et al., 2006), is of unknown function, but is a member of the Cof hydrolase family, and has many homologs in a wide variety of bacteria . MG442 is a conserved hypothetical protein of unknown function that has similarities to GTPase enzymes, but we still don't know what it does. MG461, HD domain protein of unknown function, but with widely distributed homolgs. MG459 a conserved hypothetical protein of unknown function with many homologs in other bacteria, similar in structure to MECDP_synthase. MG074 is a conserved hypothetical similar to trichodiene synthase in a number of bacteria. And so on. You can do this for yourself, Paul Nelson could have done it himself after I alerted him to the problem back in April. Of the 110 "unknowns" 65 are from systems which have good phylogenies, and known relations to function proteins, but we just don't know what they do yet, 45 are conserved hypothetical proteins, where we don't even have structural clues to function yet. But note the "conserved", that's because they are found in a wide variety of organisms. Bottom line: not ORFans. Update to the Update: I have created a table with BLAST tree views of the 45 conserved hypothetical proteins, it turns out that many of them either are members of Clusters of Orthologous Groups (ie parts of an extened family, for example MG337 is a member of COG0822C) or close to members of COG's (Like MG101, which has a HTH_GTNR domain and appears to be related to transcrition factors). Anyway, if you want a copy of the table, with hyperlinks to the trees and COGs, go to Southern Skywatch Feedback and click on the email address there. The Treeviews can be misleading, as they don't show all the related proteins (click on the show removed sequences link to see the more distant relatives). I'm off to the Australian Health and Medical Research Congress for a week, so I'm turning of the comments while I'm away, as I won't be able to moderate. Sorry. *ORFans and proteins with unknown function can of course coincide, but in this case they don't #I'm ignoring the small ORFans, most of which are likely artefacts or bits of viral DNA, see my original post for details and Daubin & Ochman 2004. + Finding relatives is not necessarily easy, but the recent solution of crystal structure for ORFans is greatly increasing our ability to find relatives (eg Siew et al, 2005).

Daubin V, Ochman H. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004 Jun;14(6):1036-42.

Galperin MY, Koonin EV 'Conserved hypothetical' proteins: prioritization of targets for experimental study (2004) Nucleic Acids Res. 32, 5452-5463

Glass JI, et al Essential genes of a minimal bacterium. Proc Natl Acad Sci U S A. 2006 Jan 10;103(2):425-30.

Hutchison CA, et al., Global transposon mutagenesis and a minimal Mycoplasma genome. Science. 1999 Dec 10;286(5447):2165-9.

Siew N, Fischer D. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins. 2003 Nov 1;53(2):241-51

Siew N, Saini HK, Fischer D. A putative novel alpha/beta hydrolase ORFan family in Bacillus. FEBS Lett. 2005 Jun 6;579(14):3175-82.

25 Comments

Sir_Toejam · 20 November 2006

he has been promoting ORFans again, without indicating he has learnt anything at all from our discussion

not that it isn't obvious, but it's their job NOT to learn anything. ignorance is strength, remember?

W. Kevin Vicklund · 20 November 2006

The figure below (Taken from Siew et al., 2003) shows this.

I'm not seeing a figure. Is it my browser, or has it not yet been added (or is there a link I'm missing)?

Andrea Bottaro · 20 November 2006

Those similar percentages, possibly with help from Morton's Demon, caused Nelson to confabulate the two separate concepts, genes with no apparent homologs (ORFans) and genes with unknown function*

I initially thought you meant "conflate" there, but then I went to the dictionary and found:

conÂ·fabÂ·uÂ·late (kn-fby-lt) intr.v. conÂ·fabÂ·uÂ·latÂ·ed, conÂ·fabÂ·uÂ·latÂ·ing, conÂ·fabÂ·uÂ·lates ... 2. Psychology. To fill in gaps in one's memory with fabrications that one believes to be facts.

LOL

Ian Musgrave · 20 November 2006

I'm not seeing a figure.

MovableType was playing up severely this morning, and appeares to have eaten the figure. I've fixed it and the figure is there now.

Glen Davidson · 20 November 2006

In addition to the other reasons for why ORFans may exist, can't Nelson understand that all of the relatives of some genes may have gone extinct? I think we're back at God of the gaps yet again, the notion that if not all genes have extant relatives, then poof, the magic dragon did it.

Still no curiosity from Nelson as to why homologies exist at all, I see. Of course this comes down to one of the problems of theology (or at least much theology), one is not supposed to ask "why" or "how" of God. Since "God could do it" according to their description of this unobserved "entity", what should they care about how, why, or what it means?

Explain the relatedness of the large majority of genes that are presently known, Nelson. And if that explanation can be determined to be sound and truly explanatory, then make a plea about the ORFans. As you're not interested in explaining anything so far, leave the relevant questions up to the big boys and girls.

Glen D
http://tinyurl.com/b8ykm

Lurker · 20 November 2006

From the Christian Post article: "Nelson gave his colleague William Dembski's basic definition of Intelligent Design as the study of patterns in nature that are best explained as a result of intelligence."

I don't get how something is "a study" when it is already "best explained." Can somebody please elaborate that for me? Thanks.

bjm · 20 November 2006

Yes, it means 'we are not going to waste our time on any work - there's preaching to be done'

Sir_Toejam · 20 November 2006

...and:

"we don't need your pathetic level of detail!"

MarkP · 20 November 2006

It would be nice to see them just once admit openly that the reason they do no research is because they already know the answers. Behe came close in Dover.

Nick (Matzke) · 20 November 2006

What probably got Nelson confused was that 28% of M. genitalium putative protein coding genes were orphans when the M. genitalium genome was first reported. The percentage of unknown protein coding genes is also 28%. The same figures, but completely different concepts. Unfortunately for Nelson, by 2003 homologs for all the putative protein coding genes had been found (Siew et al., 2003). Those similar percentages, possibly with help from Morton's Demon , caused Nelson to confabulate the two separate concepts, genes with no apparent homologs (ORFans) and genes with unknown function*. This is an object lesson in reading research papers carefully.

Wow. This is clearly the Takedown of the Week.

Unsympathetic reader · 20 November 2006

Ian writes: (To Paul Nelson) "Show the exponential decrease graphs and mention how biologists are testing hypotheses about ORFans."

Better yet, he could describe how ID theorists have been testing hypotheses about ORFans. After all, the sorts of sequence analyses used in this field are dirt cheap and easy to do. The data is available in free databases. So where are the ID publications?

What a fruitful research program...

PvM · 20 November 2006

In Paul's defense, he is a young earth creationist, which means that scientific evidence can be interpreted only with the Truth in mind.

Marek 14 · 21 November 2006

"In Paul's defense, he is a young earth creationist, which means that scientific evidence can be interpreted only with the Truth in mind."

I cannot help but imagine the Truth in his mind, as a beautiful, shining white light that, unfortunately, sears the brain from within.

I wonder which is more dangerous - Truth, Lie, or Uncertainty. Or maybe it's some other grand concept, like Ingrown Nails...

Paul A. Nelson · 21 November 2006

Hi Ian,

Well, I can't say I wasn't warned. At an ID research meeting this past spring, during my presentation on ORFans (no, I can't shut up about them), a molecular geneticist there interrupted me to say that "unknown function" and "ORFans" strictly defined are not co-extensive classes. All ORFans are unknown function, but not all unknown function genes are ORFans, which I myself wrote when we last discussed ORFans here. So I knew that, but have been sloppy in my public presentations. I'll issue a correction, and ask that any videos or audios of my lectures on the topic include a clarification.

[Ian, in my original reply, the next paragraph described in detail how I'm going to work through the whole Glass et al. 2006 data set, which I should have done months ago (ouch). Now however I see from your update that you've done that already. If you have a file of your results, would you mind sending it to me? Short of that, can you send me the blast parameters that you used? I'm curious, for instance, how you would describe MG076, which I picked at random this morning to run blastp on. Also, how have you decided that short ORFans are annotation artifacts?]

So my Thanksgiving holiday will be spent working through the Glass et al. 2006 data set.

The second issue raised by your post concerns the evidential significance of ORFans to the design v. naturalistic evolution debate. I should tell you that many scientists on my side of the aisle have warned me in pretty strong terms that I'm paying too much attention to ORFans. 'They won't be significant, Paul,' they say, 'until their crystal structures are solved, and MANY more whole genomes are in. Genomic taxonomic sampling is much too sparse to say more than that.' I suppose to spare myself further embarrassment, I should agree and just shut up --- but then I run across findings like this:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=16650908&query_hl=2&itool=pubmed_docsum

What else is out there? ORFans are woefully neglected with respect to their potential significance. So I talk about them a lot, obviously, hoping to attract attention to a really first-rate puzzle. Notice that, as I wrote the last time, decreasing percentage --- the feature of the puzzle you want me to stress --- pales in biological import to novel structures and taxonomically-restricted essentiality, which, when coupled with increasing number, do raise genuine problems for current theories of evolution. Indeed, if novel structures continue to be found, and those proteins are functionally essential but taxonomically restricted, the findings will have deep theoretical consequences. I think you underplay the shock that ORFans caused many genomics researchers (see, e.g., Russell Doolittle's comments in 2002 and 2005, surveying the ORFans and conserved hypothetical puzzles), shock that arises against the backdrop of the assumption of universal common ancestry and theories of gene and protein evolution. (Btw, in the Q & A of my public lectures, if asked, I always bring up the leading evolutionary hypotheses for ORFans, such as rapid sequence evolution and viral reservoirs.)

Anyway: thank you for calling me on my sloppiness. I should have paid more attention in April. The pain this time...okay, let's just say you've got my full attention.

Unsympathetic reader · 21 November 2006

Paul Nelson writes: "I should tell you that many scientists on my side of the aisle have warned me in pretty strong terms that I'm paying too much attention to ORFans."

Yes, it is interesting that the percentage of ORFans is decreasing over time. That suggests that "common design" must be far more common than "special design" and not as you originally anticipated. This suggests the creator was not of the 'Burger King' variety: i.e. Special orders *do* upset Us.

"ORFans are woefully neglected with respect to their potential significance."

Really? By IDists or biologists? I think the evolutionary biochemists are absorbing the data as it comes in.

PvM · 21 November 2006

Seems to me Paul reads too much into the 'shock' of orfans. His focus on common ancestry nay quest to prove Common Ancestry to be flawed (what happened to the Dissertation or the paper on ontogenetic depth btw?) causes him to see minor puzzles which arise when datasets increase and are tools and knowledge fail to keep up.
That's science...
It becomes ID when we throw up our hands and consider this ignorance to be evidence of something more.

Ian Musgrave · 21 November 2006

Well, I can't say I wasn't warned.
— Paul Nelson

No, you can't say that. Way back in April I said this. True, I should have done the BLAST searches then (but so should you have), but I had an honours student finishing up, so I was distracted.

So I knew that, but have been sloppy in my public presentations.
— Paul Nelson

Sloppy? Paul, its not sloppy, your claim is out and out wrong! Surely the disjuncture between Siew et al 2003 (zero ORFans in Mycoplasma genitalium) vs the Glass paper was a clue (it should have been a clue for me too, but I was preoccupied with amyloid cross-linking at the time).

I'll issue a correction, and ask that any videos or audios of my lectures on the topic include a clarification.
— Paul Nelson

Thank you Paul, you are a scholar. I trust you will also include the exponential decrease in the percentage of protein-coding ORFans in further presentations as well.

I'm going to work through the whole Glass et al. 2006 data set, which I should have done months ago (ouch). Now however I see from your update that you've done that already. If you have a file of your results, would you mind sending it to me? Short of that, can you send me the blast parameters that you used?
— Paul Nelson

No, I didn't keep a file. BLAST results don't keep as permanent URL's (GRRR), but the BLAST TREE view does (although you don't get the statistics with it). However, I didn't find this out until last night, and then between the PT server going flakey and the NCBI server going flakey, I just did a few in tree view. But to make your life simple, first go to the XL spreadsheet from Glass and sort it on "Disrupted in Current Study" and "Common name". Choose only "conserved hypothetical protein" anything else automatically has relatives, that cuts you down to 45 proteins. Then go to the Mycoplasma genitalium genome COG table and click on the "not in COGs" list. Several of the conserved hypothetical proteins are in the COGs list, and automatically have relatives. You can just copy the GI number from the "not in COGs" table and past directly into BLAST. I'm using bog standard BLAST, no fancy fine tuning.

I'm curious, for instance, how you would describe MG076, which I picked at random this morning to run blastp on.
— Paul Nelson

Probably a remote relative of cysteinyl-tRNA synthetase, running the search two different ways consistently picks this up. The C elegans thing is unusual, but could be horizontal gene transfer. Then there is MG105, another non-COG protein. It has extensive homologies throughout the bacterial kingdom, and is probably a membrane protein, but no one seems to know what any of them do. It's still not an ORFan.

Also, how have you decided that short ORFans are annotation artifacts?
— Paul Nelson

Or viruses (or fragments of virus). See Daubin V, Ochman H. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004 Jun;14(6):1036-42. Fukuchi S, Nishikawa K. Estimation of the number of authentic orphan genes in bacterial genomes. DNA Res. 2004 Aug 31;11(4):219-31, 311-313. Got to make my boys breakfasts now. More later.

Steviepinhead · 21 November 2006

Nelson, in full-on wishful-thinking mode:

many scientists on my side of the aisle

If only that were "many biologists," as opposed to engineers, opticians, chiropractors, and *ahem* "doctors of law," eh, Paul? Though I did think the "aisle" thingy was revealing, in a Freudian slip kind of way.

'Rev Dr' Lenny Flank · 21 November 2006

Hey Paul, I have some unanswered questions for you, here:

http://www.geocities.com/lflank/nelson.html

'Rev Dr' Lenny Flank · 21 November 2006

At an ID research meeting this past spring

Bible study group, Paul? What sort of, uh, "research" does DI *do*, anyway, Paul . . . . ?

Ian Musgrave · 21 November 2006

I should tell you that many scientists on my side of the aisle have warned me in pretty strong terms that I'm paying too much attention to ORFans. "They won't be significant, Paul," they say, "until their crystal structures are solved, and MANY more whole genomes are in."
— Paul Nelson

Listen to them. Look, we have only explored something like 0.02% of bacterial genomes, we have hardly touched the bacteriophage genomes. There is a lot of stuff out there that we still do not know. We are currently still evaluating the genomes we have. A recent revision dropped many ORFs, including 45 ORFs from E. coli. Several other studies have had similar results, a reanalysis of genomes in 2005, found a large number of ORFans were annotation arefacts, not to mention Fukuchi S, Nishikawa K. Estimation of the number of authentic orphan genes in bacterial genomes. DNA Res. 2004 Aug 31;11(4):219-31, 311-313. E. coli protein ykfE was a prototypical ORFan up until 2004, when it was discovered to have many relatives. In 2004 the genome of Methanococcus maripaludis was reported to have 129 ORFans, but they could only show 27 of them actually encoded proteins. If you run a BLASTP today, a mere two years later, you will find that most of these putative protein coding ORFans now have matches in genbank. One, MMP0025, is even a member of a COG family (DUF7C1). That shows fast things are progressing.

I suppose to spare myself further embarrassment, I should agree and just shut up - but then I run across findings like this: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=A bstractPlus&list_uids=16650908&query_hl=2&itool=pubmed_docsum What else is out there?
— Paul Nelson

DNA topoisomerase V is unique at the moment, just like ykfE was back in 2001. Given the paucity of databases sampled, that is not entirely surprising. Nor is it very likely to stay unique. The authors suggest that it may be of viral origin. Given the large number of E. coli ORFans that turn out to be from bacteriophages, a viral origin is entirely plausible. Also it is possible that it is a highly divergent member of existing, poorly sampled families. Some ORFans that are likely to encode proteins (rather than the larger group which may be artefacts or viruses or broken viruses) are apparently evolving rapidly, and this may erase their previous history. I'm sure there is a lot of weird stuff out there, still waiting to be discovered. Biology is like that.

ORFans are woefully neglected with respect to their potential significance.
— Paul Nelson

Not really, there is a lot of work going on with these things (you have to use a combination of search terms, like "unique ORFs" to turn up some of this work, not everyone uses ORFans as a keyword), and if the small ORFans in general represent bacteriophage horizontal gene transfer, rather than just the alpha-proteobacteriacea, that is important (see for example Current Opinion in Genetics & Development Volume 14, Issue 6 , December 2004, Pages 616-619). Even if these ORFans did represent de novo gene formation, we can still recover phylogenies. We can recover phylogenies in the face of significant horizontal gene transfer (eg see Ge F, Wang LS, Kim J The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol. 2005 Oct;3(10):e316.,) de novo gene formation is exactly the same in principle.

Notice that, as I wrote the last time, decreasing percentage - the feature of the puzzle you want me to stress - pales in biological import to novel structures and taxonomically-restricted essentiality....
— Paul Nelson

But it is essential to prevent misrepresentation of the issue. Novel structures aren't a problem for evolution (pace sperm specific dynenin, and maize ion channels, and the majority of protein coding ORFans turn out to be of limited novelty anyway), and as we've just seen, taxonomically restricted essentiality isn't. There is no point emphasising something that doesn't exist (or is so limited it easily falls within known processes of horizontal gene transfer and gene loss).

(Btw, in the Q & A of my public lectures, if asked, I always bring up the leading evolutionary hypotheses for ORFans, such as rapid sequence evolution and viral reservoirs.)
— Paul Nelson

That shouldn't be left to the Q&A session; it should be brought out up front.

B. Spitzer · 21 November 2006

DNA topoisomerase V is unique at the moment, just like ykfE was back in 2001. Given the paucity of databases sampled, that is not entirely surprising. Nor is it very likely to stay unique. The authors suggest that it may be of viral origin. Given the large number of E. coli ORFans that turn out to be from bacteriophages, a viral origin is entirely plausible. Also it is possible that it is a highly divergent member of existing, poorly sampled families. Some ORFans that are likely to encode proteins (rather than the larger group which may be artefacts or viruses or broken viruses) are apparently evolving rapidly, and this may erase their previous history. I'm sure there is a lot of weird stuff out there, still waiting to be discovered. Biology is like that.

I'm probably just marking myself with the Scarlet "N" (for Nerd), but I have to say it: I love biology. What a wild intricate ride it offers, and (like a good novel) so many of the plot twists are unanticipated. Genetic sequencing is making so much information available, so rapidly, that we're going to be catching up with it for decades. Brilliantly happy decades. --B, or, rather, N

'Rev Dr' Lenny Flank · 22 November 2006

I love biology.

As do I. In my case, that is DESPITE my schooling in the subject, not because of it. My high school biology teacher's idea of "teaching" was to sit in front of the class on a stool and read straight from the textbook, in the most deadly monotone you can imagine. We were halfway through the year before we even SAW any animal in class, live or dead. I learned more biology on one attentive walk through the woods than I ever did from that, uh, "teacher". (sigh)

Ian Musgrave · 23 November 2006

I've gone through and made a table of the conserved hypotheticals with hyperlinks to BLAST tree results. A lot of them do turn out to have COG or other recognised domains (you just have to know where to look in the BLAST results) that I missed first time round. I'm going to go around and add hyperlinks to the COG domains, but if anyone wants the table (eg Paul), let me know and I will send you a copy).

Ian Musgrave · 24 November 2006

G'Day All. I've put an update to my update in the main post. I'm off for a week at the Australian Health & Medical Research Congress, so I won't be able to moderate comments (or likely even view them, given that I'll be flat out), so I'm turning comments off while I'm away. If anyone wants a copy of the table of conserved hypotetical proteins and their hyperlinked tree views, email me via the link in the update to the update. Catch you later.