A couple of months ago we started looking at the concept of fitness landscapes and at some new papers that have significantly expanded our knowledge of the maps of these hypothetical spaces. Recall that a fitness landscape, basically speaking, is a representation of the relative fitness of a biological entity, mapped with respect to some measure of genetic change or diversity. The entity in question could be a protein or an organism or a population, mapped onto specific genetic sequences (a DNA or protein sequence) or onto genetic makeup of whole organisms. The purpose of the map is to depict the effects of genetic variation on fitness.
Suppose we want to examine the fitness landscape represented by the structure of a single protein. Our map would show the fitness of the protein (its function, measured somehow) and how fitness is affected by variations in the structure of the protein (its sequence, varied somehow). It's hard enough to explain or read such a map. Even more daunting is the task of creating a detailed map of such a widely-varying space. Two particular sets of challenges come to mind.
1. To make any kind of map at all, we need to match the identity of each of the variants with its function.
2. To create a detailed map, we need to examine many thousands -- or millions -- of variants. This means we need to be able to make thousands of variants of the protein.
So let's take the second challenge first: how do we make a zillion variants of a protein? Well, we can introduce mutations, randomly, into the gene sequence for the protein and use huge collections of those random variants in our analysis. The collection is called a library, and believe it or not, the creation of the library isn't our biggest challenge. Because if the library only contains gene sequences, then it's no use in an experiment on protein fitness. We need our library of gene sequences to be translated into a library of proteins. How are we going to do that? And remember the first challenge: we need to be able to identify each variant. So even if we can get our gene sequences made into protein, how will we be able to identify the sequences after we've mapped the fitness of all the variants?
Or, in simpler terms, here's the problem. It's pretty straightforward to make a library of DNA sequences. And it's pretty straightforward to study the function of a protein. (Note to hard-working molecular biologists and protein biochemists: no, I'm not saying it's easy.) The problem is getting the two together so that we can study the function of the proteins with biochemistry but then identify the interesting variants using the powerful tools of molecular biology. What we need is a bridge between the two.
The bridge most commonly used in such experiments is a technique called protein display. There are a few different ways to do it, but the basic idea is that the DNA sequence is encapsulated so that it remains linked to the protein it creates. One cool way to do this is to hijack a virus and force it to make itself using your library. The virus will use a DNA sequence from your library, dutifully make the protein that is encoded by that DNA sequence, and displaying that protein on its surface. There's our bridge: a virus, with the protein on the surface ready for analysis and the DNA sequence stored inside the same virus. Brilliant, don't you think?
Yes, but there's one more problem to be solved. We said we want to do this millions of times. That means we have to grab the viruses of interest, get the DNA out of them, and read off the sequence of that DNA. (That's how we can identify the nature of the variation.) Millions of times. Methods of protein display provided the bridge, but until very recently a crippling bottleneck remained: the sequencing of the DNA was too time-consuming to allow the identification of more than a few thousand variants at a time.
That was then. This is now: the era of next-generation sequencing, in which DNA sequences can be read at blinding speed and at moderate cost. (A currently popular technology is Illumina sequencing.) These techniques have given us unprecedented capacity to decode entire genomes and to assess genetic variation on genome-wide scales. A few months ago, the same methods were used to eliminate that last bottleneck in the use of protein display, demonstrating how a protein fitness map can be generated simply and at very high resolution. The article is "High-resolution mapping of protein sequence-function relationships" (doi) from Nature Methods, by Douglas Fowler and colleagues in Stan Fields' lab at the University of Washington.
The experiment focused on one interesting segment of one protein. The segment is called a WW domain and it's an interesting building block which is found in various proteins and which mediates interactions between different proteins. (A sort of docking site.) The authors chose the WW domain both for its interesting functions and because it has been used in protein display experiments of the type they performed. Then they created their tools.
1) They generated a library of more than 600,000 variants of the domain, displayed on the surface of their chosen bridge -- the T7 bacteriophage (a virus that targets bacteria).
2) They designed a means to assess the function of the variants. Because the function of the WW domain is docking, they used docking as their functional criterion, and then devised a straightforward system to detect the strength of the binding of the variants to a typical docking partner. (For the biochemically inclined, they used a simple peptide affinity binding assay on beads.)
Then the key experimental step: the authors used that system to select the variants that can still bind. In other words, they selected the functional variants. The selection step was moderate in strength, and the idea is that variants that bind really well will be enriched at the expense of variants that bind less well. Variants that don't bind at all will be removed from the library.
They repeated the selection step six times in succession. So, the original library was subjected to selection, generating a new library, which was subjected to selection again, and so on, until the experimenters had six new libraries. Why the repetition? It's one of the really smart aspects of the experiment and it has to do with the strength of selection. If selection were quite strong, such that only the strongest-binding variants survive, then the analysis will just yield a few strong-binding variants. That's a simple yes-or-no question, providing no information about the spectrum of binding that can be exhibited by the variants. Instead, the authors tuned the system so that selection is moderate, leading to enrichment but not complete dominance of the stronger-binding variants. Recall that binding represents fitness in this experiment; this means that the authors subjected their population to a moderate level of selection in order to map the fitness of a large number of variants. By repeating the selection, they could watch as some variants gradually increased in frequency. Sounds kind of like evolution, doesn't it?
Finally, the scientists subjected those libraries to Illumina sequencing, thus closing the loop between function and sequence. (In genetic terms, we would say that they closed the loop between phenotype and genotype.) And at that point they were able to draw fitness landscapes of unprecedented resolution, shown in the graphs on the right. The top graph shows the original library. The height of each peak represents frequency in the library, and the two horizontal axes represent each possible sequence of that WW domain. Notice that the original library is complex and diverse, as indicated by numerous peaks on the graph. The second and third graphs show the library after three and six rounds of selection. Note the change in the number of peaks and in their relative sizes: selection has reduced the complexity of the library, removing variants that are far less fit and altering the relative amounts of the survivors. The first three rounds of selection reduced the library to 1/4 the original size, and after six rounds it was down to 1/6 original size, but still contained almost 100,000 variants.
The bottom graph, then, is a fitness landscape, of a segment of a protein, at very high resolution. More technically, it depicts the raw data (relative amounts of surviving variants) that the authors used to determine relative fitness; to make that assessment, they calculated "enrichment ratios" to account for the fact that the initial library didn't contain equal amounts of each variant. These enrichment data enabled them to calculate the extent to which each point in the sequence is amenable to change, and then to identify the particular changes at those points that led to changes in fitness. Now that's high resolution.
The power of approaches like this should be obvious: disease-related mutations can be identified in candidate genes, and the same approach can be used to map the landscape of resistance to drugs in pathogens or cancer cells. And, of course, evolutionary questions of various kinds are much more tractable when tackled with methods like this. The authors expect the payoff to be immediate:
Because the key ingredients for this approach -- protein display, low-intensity selection and highly accurate, high througput sequencing -- are simple and are becoming widely available, this approach is readily applicable to many in vitro and in vivo questions in which the activity of a protein is known and can be quantitatively assessed.
Now, given these vast opportunities now available to scientists interested in protein evolution, wouldn't you think that design theorists who write on the topic will be eager to get involved in such studies? I sure would, especially since the lab that did this work is within a short drive of the epicenter of intelligent design research, a research insitute headed by a scientist whose professional expertise and interest lies in the analysis of protein sequence-function relationships. As I've repeated throughout this series, there's something strange about a bunch of scientists who want to change the world but who can't be bothered to interact with the rest of the scientific community, a community that in this case is well-represented in active laboratories right down the road. (I'm eager to be proven wrong on this point, by learning that ID scientists have interacted with the Loeb lab or the Fields lab.)
More to the point, there's something tragically ironic about the fact that the ID movement is headquartered in Seattle, inveighing against "Darwinism" while obliviously amidst a world-class gathering of scientists who are busy tackling the very questions that ID claims to value.
(Cross-posted at Quintessence of Dust.)
-------
Fowler, D., Araya, C., Fleishman, S., Kellogg, E., Stephany, J., Baker, D., & Fields, S. (2010). High-resolution mapping of protein sequence-function relationships Nature Methods, 7 (9), 741-746 DOI: 10.1038/nmeth.1492
84 Comments
mrg · 6 February 2011
DS · 6 February 2011
But there are no beneficial mutations! What? Oh.
But the experiment was intelligently designed and ... what? Oh, never mind.
DS · 6 February 2011
But selection is just a tautology! What? Oh.
But there is no new information ... what? Oh, never mind.
midwifetoad · 7 February 2011
It's pretty obvious why ID proponentsists don't want to do this research.
Their talking point is that fitness peaks are isolated -- you can't get her from there.
Most importantly, there's no such thing as a functional random sequence.
RBH · 7 February 2011
The shape of the gross topography shown on the three figures above depends on the ordering of points on the two horizontal axes. The original population displays a fitness landscape that looks 'rough' in the parlance of fitness landscapes, with the array of peaks associated with protein variants more or less randomly distributed on the surface. With increasing rounds of selection it at least superficially appears that more organization of the surface emerges. Most obviously, the last graph shows a distinct linear series of peaks on the right of the graph with a couple of isolated peaks to the left while the middle of the graph is bereft of significant peaks.
My question is what is the ordering principle for the points on the two horizontal axes? Is it something like sequence similarity measured in terms of amino acid differences? The axis labels are too small for my old eyes to make out.
RBH · 7 February 2011
I will add, by the way, that the existence of multiple peaks on the final graph puts the lie (yet again!) to the "one true sequence" notion typically assumed by IDiots' probability calculations.
RBH · 7 February 2011
Gromit · 7 February 2011
I am not sure that some of these small-minded comments are helpful to the credibility of the blog, or this site. I cannot speak for what is going on in Seattle, but I can say that scientists among my own colleagues who suspect that intelligence may have been a factor in encoding the digital software of life, are interested in these things. This aside, I read the blog and the paper by Fowler et al. and found this topic to be very interesting. I would like to make some comments:
1. I noticed that they state that their results capture general features of the WW domain evolutionary process. I was pleased to see this, as it is something that seems to be predicted using the evolutionary data available in the web-based Pfam database. Measuring the relative frequencies of each amino acid at each site is not a new thing. I have found that after about 500 sequences, the relative frequencies begin to stabilize and by the time 1,000 or more sequences for the same family are analyzed, there is little change in the frequency distribution of each amino acid at each site, even with the addition of additional sequences. What this suggests is that even though the number of likely functional sequences is far too great to ever adequately sample, evolution has had enough time to sufficiently sample sequence space such that we can get a pretty good idea of what these relative frequencies are from the resulting evolutionary data. Relative frequencies do begin to stabilize within just 500 or a 1,000 sequences.
2. What I particularly like about their experiment is that they are generating novel sequences not found in the evolutionary record preserved in Pfam. Since they are generating, selecting, and reading their own novel sequences, the utility of just working with a small domain is obvious. They seem to indicate in a couple places that their results do appear to be consistent with the relative frequencies found in nature.
3. I notice that their results also reflect the reality that some sequences are more fit than others. 97.2% of their sequences turned out to be deleterious relative to the wild-type. This underscores an important methodological point. There is a temptation, in measuring the relative frequencies, to remove redundant sequences which then gives each sequence equal value. Of course, this should be done for double or triple entries of the same information, but I am of the opinion that it should not be done for sequences that prove to be identical for different taxonomic groups. I would expect that sequences that confer a higher fitness value to the organism will tend to appear more often in the evolutionary record. Indeed, in Fowler’s experiment, the wild-type increased in abundance by a factor of 1.75. This sort of redundancy in the record is important data and the relative frequencies need to reflect this if one wants to plot the size of functional sequence space for a given family of proteins. Conversely, if all redundant sequences are removed, effectively resulting in all sequences having the same fitness value, which is not the case, as Fowler’s results also show. In real life databases, however, it is easier said than done to weed out double or triple entries, if they appear under different identifiers.
4. One last point (and some of the posters on this thread, along with the blogger, may find it disturbing if they follow up on this for themselves), is that the relative frequency distribution for a given protein family (providing sample size was large enough to stabilize the distribution) can provide us with an upper limit for the total number of estimated functional sequences and the size of functional sequence space for that family. I wonder why Fowler et al. did not do that. When that is done, the upper limit for the number of possible functional sequences is truly massive. However, in comparison with the size of overall sequence space, the size of functional sequence space for a typical protein family is disturbingly miniscule. I say ‘disturbingly’, because given the functional target sizes that emerge, an evolutionary search engine, plodding along at physicochemical speeds is vastly underpowered for the search .... and ‘vastly’ is an understatement. Personally, I think it is the elephant in the room that some, like Eugene Koonin have tried to address by postulating an infinite number of universes as a solution. Intelligence can easily encode functional genetic sequences into a genome. Indeed, we have started to build our own artificial proteins. But the notion of an evolutionary search, crawling along at pathetically slow physicochemical search speeds, really does need work in light of the amino acid frequency distributions and what they entail regarding the size of functional sequence space. To summarize, Fowler et al. may not realize it, nor the blogger, but the very next step after determining the frequency distribution for each amino acid at each site is to use that data to compute the target size of functional sequence space for a protein. The blogger may want to do some work on this and ponder his results. That will make for a most interesting blog indeed.
Gromit · 7 February 2011
I think that RBH does not understand the graphs in Figure 2a in the paper by Fowler et al. If RBH would read the caption to the Figure, he/she will see that the peaks do not represent functional sequences occurring in sequence space/landscape. Rather, they represent the relative frequency of each amino acid at each site. With regard to the repeatability of these peaks, I am very confident that if the experiment were repeated, the same peaks would emerge. I would go further and say that if evolutionary data is used for the WW domain, the same peaks will emerge as well. The graphs have nothing to do with how many functional sequences there are ...... there are likely to be more than we could possibly sample. As I mentioned in my previous post, however, we will get the same peaks with sample sizes of only 500 or 1,000 functional sequences.
mrg · 7 February 2011
eric · 7 February 2011
DS · 7 February 2011
Gromit wrote:
"When that is done, the upper limit for the number of possible functional sequences is truly massive. However, in comparison with the size of overall sequence space, the size of functional sequence space for a typical protein family is disturbingly miniscule. I say ‘disturbingly’, because given the functional target sizes that emerge, an evolutionary search engine, plodding along at physicochemical speeds is vastly underpowered for the search .… and ‘vastly’ is an understatement."
Well that might be true if there were only one organism that was evolving. But the reality is that there are billions, perhaps even trillions of organisms that are evolving or dying. Since many different variants with adaptive characteristics were recovered from only 600,000 variants, that hardly seems to be prohibitive for evolution.
So Gromit, when your colleagues, who are interested in such things and yet cannot seem be bothered to perform any experiments such as this, try to pawn off the old one correct protein probability calculations, do you set them straight? Are you worried about their credibility?
RBH · 7 February 2011
raven · 7 February 2011
raven · 7 February 2011
mrg · 7 February 2011
Looks like Gromit has been reading STEALTH CREATIONISM FOR DOGS.
raven · 7 February 2011
mrg · 7 February 2011
Man, you guys actually read through that doubletalk? I would have suffocated before I got to the end of it.
John Vanko · 7 February 2011
fnxtr · 7 February 2011
"Encoding the digital software of life" pretty much gives it away.
Gromit, are you Steve P.?
Or just Trolling For Grades?
Stanton · 7 February 2011
Mike Elzinga · 7 February 2011
mrg · 8 February 2011
Terenzio the Troll · 8 February 2011
Hywel · 8 February 2011
I am at a loss to argue on any scientific basis, however...I did grow up in a christian family, and had to attend church twice a week, every week until the age of 16. At 14-15 years old I attended a couple of seminars by the infamous creationist Ken Ham. Having been nurtured from a young age (before I could form opinions for myself based on alternative theories) I was convinced of the creation story. Ken Ham is a very effective speaker, and he mixes an appealing sense of humor into his lectures which made us all laugh at the plight of evolutionists. His approach was convincing and engaging, and listening to his talks within a room of at least another 150-200 christians, it was incredibly easy to embrace his seemingly informed defence of creationism without much thought. When you come from a background such as mine, you are surrounded by people who all believe creation (and Christianity in its entirety) as complete indisputable fact - and most importantly, it is incredibly difficult to break away from it...which, at long last (I am now 27) I accomplished. It is a sobering thought to see comments from "Gromit" arguing from a view point that I once held. The point was, being conditioned to believe ABSOLUTELY that the words of The Bible are completely literal - even to this day, I can still hear that inner voice (so carefully nurtured and maintained in early life by my Christian family) crying out each time I argue against creationism, but through endless questioning and the help of figures such as Dawkins, Hitchins, and funnily enough...a certain James Randi (look him up if you're not already familiar with him!), I managed to finally liberate myself from the beliefs of a religion (just like all religions in my opinion) which excercises its powerful message upon the young and naive. Gromit...well, you are deluded. Set yourself free like I did, and rejoice in the sublime beauty of a sunrise from a non-believers perspective...it's more beautiful than it ever was when I believed in Creation.
DS · 8 February 2011
Hywel,
Well said and congratulations. It is important to remind people how liberating it can be to throw off the oppression of indoctrination. The price is high but the rewards are great. Just remember that what you have earned is the right to have opinions informed by the evidence. Automatic rejection of any particular form of argument can lead to just another form of self delusion.
In this case, all you have to do is ask why Gromit provided only hand waving arguments and vague generalizations about large numbers when trying to impugn a detailed research finding. The desperation there is obvious. He simply has to deny evolution at any cost, even when it is staring him in the face. In this case, the evidence is clear and consistent with modern evolutionary theory. It is inconsistent with many creationist talking points and you can use this evidence as ammunition if anyone tries to use those fallacious arguments on you.
eric · 8 February 2011
Gromit · 8 February 2011
A few comments:
1. Lads, sifting through the rubbish posted above, I observe that a good deal of energy is going into ad hominem responses. To help move this discussion along, let us assume that I am an Inmate in the Local Insane Asylum who manages to sneak over to the computer at the nursing station late at night when the nurse is down wing. That way, we will not have to waste time wondering if I am a scientist or not, or even if I am a moron, and can thus focus on the merit of points raised.
2. I have little interest in, or knowledge of, what transpires in the USA with regard to these discussions. If the above responses are exemplary of how Americans engage this topic, then maybe one or two can grasp why the rest of the world has little interest in this sort of drivel. Given my confession of disinterest in the American controversy over ID, my comments will have to be more general in nature. From what I observe, if the unwashed masses somehow feel that there is something dodgy about certain aspects of Darwinian theory, the problem is staring at you every time you look in the mirror. The public needs a lot more than a stream of rubbish, of the sort we see above, to convince them that you really know what you are talking about. Forget about the creationists; they are not a threat (at least in my experience). The real threat is that you do not know your stuff. Some examples follow below.
3. No one here seems to have a clue as to how to compute the size of functional sequence space for any given protein family once you have the relative frequencies of each amino acid at each site. If you are going to come up with a story on how an evolutionary search happened to find thousands of protein families, then step one is to determine the size of the search targets. You have all the data you need in web based archives such as Pfam to do that. It is a gold mine of evolutionary search history. Use it. The equations to use are simple and available as well. The software to run the data can easily be coded by even upper school graduates. Once you have the frequency of occurrence of each amino acid at each site in a sequence, it is easy to calculate an upper limit for the frequency of occurrence of functional sequences for a protein family. You need to know that stuff if you are going to convince the public. This sort of American ‘Redneck’ Darwinism that I see on this forum is not all that persuasive.
4. No one here seems to have actually sat down to calculate how many evolutionary searches could have taken place over the past 4 billion years by the entirety of organic life. You do not have a clue, do you! If you are going to create stories of how functional sequences were discovered, you are going to need to know how many searches or trials you have at your disposal. Get some numbers ready for the public. Perhaps then you will be a wee bit more persuasive.
5. RBH needs to read the original paper so that he understands the figure used in the initial blog. The figure in the blog is from the paper.
6. Terenzio the Troll needs to learn how to convert from base 4 to base 2.
7. Mike Elzinga needs to forget about what education the Inmate in the Asylum has (see my opening) and start putting forth something of substance. Try figuring out (3) and (4). I should not have to spoon-feed him.
8. Eric needs to stop quoting his hero, Mr. Behe, and learn the difference between evolving a new disulfide bond and locating a novel protein family in sequence space. Big difference, Eric!
9. Amongst my colleagues er ... fellow inmates at the asylum here, I have never heard of anyone who believes that there is only one, true functional sequence per protein family. A minute or two in Pfam should lay to rest any such delusion. The upper limit for the average 300-residue protein family is many orders of magnitude greater than just one. Good grief! Why do you get your nappies in a knot arguing that there is more than just one true sequence ..... if there are doubters, point them to Pfam ..... and Pfam only lists a minuscule sampling of what is likely to be a set that is numerous orders of magnitude larger.
10. I’m trying to help you here. Your assignment for tomorrow is to figure out the answers to (3) and (4), and please show your work; do not just give me the answer. If you are going to present a persuasive case to the unwashed masses, you will need to understand how you got those numbers.
DS · 8 February 2011
Gromit,
Actually, no we don't. You are the one who is arguing against the most predictive and most explanatory theory in the history of science. You are the one who is arguing that something or other is impossible. The evidence is clear that evolution has indeed happened. If you dispute that it can happen, the burden of proof is on you to provide evidence that it cannot.
So I guess that you do set you "colleagues" straight every time they use the one true protein crap. Good for you.
Now perhaps you can explain to us why, if the search space is so large, that in a mere 600,000 variants multiple adaptive variants were discovered. Perhaps you can explain why all of the search space must be explored in order to find any of these adaptive peaks. Then you can explain all of the other mechanisms, such as gene duplication, that enable more comprehensive searches. Then you can explain how the number of bacteria in a ton of soil disproves evolution.
mrg · 8 February 2011
raven · 8 February 2011
Flint · 8 February 2011
Gromit seems obsessed with explaining IN DETAIL, EXACTLY why the bumblebee can't fly. So far, he says that nobody understands the precise mechanisms of bumblebee flight in any detail, that all possible ways it can fly haven't been explored, and nobody even knows how to explore it. Nobody can specify the number of possible ways that hypothetical bumblebees MIGHT fly, if any actually could. Every possible point in bumblebee space has not been examined. Nobody can specify the degree to which life has explored bumblebee space over billions of years.
In response, people rather reasonably point out that bumblebees DO fly. Perhaps not very efficiently, or elegantly, or fast, but well enough for their purposes. And that the claim that bumblebees cannot fly is wrong by direct observation, even if nobody knows whether bumblebees actually explored all the ways they might have done it better.
raven · 8 February 2011
Kevin B · 8 February 2011
mrg · 8 February 2011
Mike Elzinga · 8 February 2011
Mike Elzinga · 8 February 2011
Science Avenger · 8 February 2011
eric · 8 February 2011
Dale Husband · 8 February 2011
If Gromit thinks a long post full of crap is any more effective in science than a short post of crap, he is sadly mistaken. He might fool a few people who are scientifically illiterate, of course. That's the people Creationism and the phony claims and arguments supporting it always appeal to.
Steve Matheson · 8 February 2011
John Vanko · 8 February 2011
Gromit = self-serving, self-aggrandising arrogance.
Instructing others to go "do their homework" when he is completely incapable of "doing the homework" himself.
Reminds me of nothing so much as that blowhard Timothy Wallace, author of TrueOrigins.org - all haughty words, absolutely no substance.
I think Gromit is Wallace (or at the very least his clone).
See if you agree.
mrg · 8 February 2011
SWT · 8 February 2011
Terenzio the Troll · 9 February 2011
John Vanko · 9 February 2011
Stanton · 9 February 2011
mrg · 9 February 2011
mrg · 9 February 2011
PS: Besides. Stanton, I wasn't complaining about tactlessness. I was said the Gromit should have expected as much.
Stanton · 9 February 2011
Stanton · 9 February 2011
mrg · 9 February 2011
Mike Elzinga · 9 February 2011
mrg · 9 February 2011
My own specific annoyance with crackpots is the game of: "You can't convince me I'm wrong!" That is unarguably true -- when Dorothy Parker was asked to use "horticulture" in a sentence, she replied: "You can lead a horticulture but you can't make her think." -- but beside the point: "I have no obligation to listen to you for a second."
I get over 125,000 unique visitors a month on my website these days, and not one of them has the least interest in it except to the extent I've got something to offer them. The audience does not serve me, I serve the audience, and if I don't serve them, they leave and there's nothing I can say about it.
Hubert Humphrey once said: “The right to be heard does not automatically include the right to be taken seriously.”
Gromit · 9 February 2011
It is like a dream here! I will close with two points:
1. It looks like I even have to hold Matheson’s hand who doesn’t seem to have the wit to understand what ‘relative frequency’ is. Read the label on the vertical axis, Mr. Matheson. Note the phrase ‘mutants/total’. ‘Mutants/total’ is the relative frequency of each amino acid at each site for sequences selected for functionality. I get the sense that you are talking about a paper, the methodology of which you do not even understand. One thing is clear, you have no concept of the implications of what one can do with that information. I will say this again and hopefully, there is someone here with enough neurons firing, that he/she will actually be able to grasp this ... once you have the relative frequency of each amino acid at each site for a large set of functional sequences (or for Mr. Matheson’s sake, ‘mutants/total’ for each amino acid at each site in a set of sequences selected for functionality), you are then in a position to compute the total number of functional mutant sequences/total sequences both functional and non-functional. To spoon feed a bit here, Fowler et al. examined a 33 residue sequence. If we use the 20 amino acids most commonly appearing in organic life, that gives us a total set of 20^33 sequences (the set of all possible sequences if there is no requirement for functionality). The target size in sequence space for functional WW domain sequences can be calculated once you have the numbers for the relative frequency of each amino acid at each site for those sequences that are functional (provided your sample size was large enough for the frequencies to stabilize). These data can be obtained either by experimentation, a Fowler et al. did, of by the experimental results provided by evolution, as listed in databases such as Pfam, where one can find multiple sequence alignments of hundreds or thousands of different functional sequences for the same protein family. In the case of the WW domain, Pfam has 2,909 sequences listed, although some may be redundant. I am not going to take the time to run those sequences (it would take me a couple hours to do the computational analysis from start to finish), but based on my work with other short protein sequences, I would estimate that the upper limit for the number of functional sequences for the WW domain would be somewhere around 10^11 (or 100 billion). In other words, I am estimating that the upper limit for the number of functional WWdomain sequences is roughly 100 billion. However, due to pairwise, 3rd order, 4th order, etc. dependencies in the 3D structure, the actual number of functional sequences will probably be several orders of magnitude smaller. Some of the intelligentsia here are greatly encouraged due to the fact that it was so easy for Fowler et al. to find functional sequences for the WW domain. If they will read the paper, however, there were only an average of two mutations/sequence. My own work suggests that an average of 3 different amino acids/site are functional (this is just an average .... some sites are highly conserved and others will permit all 20 amino acids). Given this, if one generates a large number of sequences that differ from the wild type in only two locations, the probability of generating a large number of functional, mutant sequences is very high. 100 billion functional sequences sounds like a lot, but in a 33-residue sequence space, it represents a target size of only 10^-22. That is a sobering thought. That is why Fowler et al. did not use randomly generated 33-residue sequences. Of course, as I stated, my 100 billion functional sequence estimate is only an estimate. I could give you a much more accurate upper limit if I ran the 2,909 sequences listed in Pfam to compute the relative frequencies of each amino acid at each site and then ran the computation to see what the overall relative frequency of entire functional sequences in sequence space are. I have done this for a several protein families, and the paper is now in the review process. You are not going to like the results. As an aside, it will not surprise me if the comprehension level of many in this group was insufficient to complete the reading of this first point and they concluded that this was nothing but sophistry.
2. I can see that this hooting, chest-thumping group of primates has neither the desire nor the wit to estimate an upper limit for the total number of trials for the evolutionary search engine of life. So much for testable science. As my last act of kindness before I retire from this discussion, I will not do your work for you, but I will give you a reference that you can use. David Dryden et al. published a short paper in the Journal of the Royal Society Interface in 2008, where he estimated an ‘extreme upper limit’ of 4 x 10^43 different amino acid sequences that could have been explored. My own calculations (using 10^30 life forms, 4 billion years, an average genome size, a fast replication rate, and a fast mutation rate) suggest that he has been generous, but that is fine. Dryden suggested that a reduced sequence space has been adequately explored. Actual research, however, is suggesting that most of the 20 amino acids are indispensable for most 3D structural domains, so a reduced search space is not on.
So there you go; a freebie from me. When it comes to finding the thousands of different protein families of life, you have a total of 10^43 trials. That sounds like a pretty big number, does it not? I expect to see you use this number from now on (though please have the honesty to say that it is an extreme upper limit). With that number, the full diversity and disparity of protein families has been discovered .... at least so it is thought ... but since science is testable, you should be able to test that hypothesis, though I see no hint here that any of this crowd will be involved in doing real, testable science.
Closing comment: I have had it here. I am very disappointed in the nonsense and utter lack of thinking demonstrated in this forum. You could not even start the assignment I gave you yesterday. Hopefully, I have helped you out a wee bit by giving you 10^43 trials to play with, although I very much doubt you will know what to do with it.
mrg · 9 February 2011
raven · 9 February 2011
DS · 9 February 2011
Gromit,
So you reject the one true protein for the "you have to make every conceivable protein starting from just one protein" argument. Great. I'm sure everyone is completely fooled by your nonsense.
Look dude, if you are unwilling to answer questions, why on earth would you expect others to afford to you the same respect you deny to them? Good bye.
eric · 9 February 2011
mrg · 9 February 2011
SWT · 9 February 2011
Mike Elzinga · 9 February 2011
John Vanko · 9 February 2011
Dear god, he does like to hear himself talk.
And he takes great pleasure in belittling others, nice guy.
Good riddance.
Stanton · 9 February 2011
David · 9 February 2011
OK, here's something I've been wondering for a while now. Gromit, if you care to answer for yourself and your "colleagues" (BTW love the air of mystery-"we have top men working on it right now... top men") go right ahead. If anyone else cares to chime in who’s familiar with the biologic people, all the better.
So, in Doug Axe’s view, every protein structure is an island, and going from one protein to another, speaking mixaphorically, is like the backwoods of Maine: “you can’t get there from here”. But the only evidence presented is this “the fraction of all possible sequences that fold into this particular structure with this particular function is very small” argument which a) nobody really disputes and b) is completely irrelevant. It’s like using the fact that less than 30% of the earth is dry land to figure out if you can walk from NY to SF without getting your feet wet. This is a really crucial point- what you’re currently standing on tells you a lot more about what’s nearby than just knowing the global average.
Axe, presumably, in that he trained with very competent scientists, accepts that close orthologs developed from stepwise mutation from some common ancestor. But if that’s plausible, what about orthologs that share 50% or even only 20% identity? And beyond that, what about paralogs with clear remote identity but very different functions, and at that point, what about all sequences in the same family or superfamily?
I’m not even arguing why drawing the line at any particular point is implausible at this point. I just don’t see any clearly explained criterion for where that line lies. The average prevalence in all of sequence space is a useless statistic here. So where’s the line, and why? I don’t even need to see a peer-reviewed paper, vanity press would be fine, heck, even a supercilious comment in a blog would be a start.
fnxtr · 9 February 2011
mrg · 9 February 2011
Steve Matheson · 9 February 2011
Flint · 9 February 2011
Mike Elzinga · 9 February 2011
Dale Husband · 9 February 2011
DS · 10 February 2011
What Gromit doesn't realize is that when he runs away he loses. He is the one who is trying to convince everyone else that they are wrong and he is right. So far, he hasn't convinced anyone of anything. Of course, if he really want to convince scientists, the only way to do that is in the peer reviewed literature.
This guy hasn't even stated exactly what he think is impossible, let alone why. Apparently he think that if a single generation of bacteria in a cubic foot of spoil cannot produce every possible protein starting with just a single one that somehow evolution is impossible. And of course, he hasn't even considered any of the mechanism for generating genetic variation besides simple point mutations.
No wonder he fixated on a few mildly rude comments and used them as an excuse to run away. That was all he had left after all the bluster and false bravado. Does he really think that anyone is going to be fooled by that?
mrg · 10 February 2011
harold · 10 February 2011
eric · 10 February 2011
Flint · 10 February 2011
eric · 10 February 2011
mrg · 10 February 2011
"Probability calculations are the last refuge of a scoundrel." -- Jeff Shallit
eric · 10 February 2011
Shallit is incorrect; they are the first. :)
Mike Elzinga · 10 February 2011
Shebardigan · 10 February 2011
Mike Elzinga · 10 February 2011
Shebardigan · 10 February 2011
Terenzio the Troll · 11 February 2011