PICS-Ord or: How to Stop Worrying and Use Ambiguous Regions in Phylogenetic Analysis
Posted 30 January 2011 by Reed A. Cartwright
I've had the good fortune of having some papers published recently, and my intention is to eventually summarize all of them for our readers. The first paper that I'm going to discuss is a methodology paper concerning a way of extracting phylogenetic information from regions of multiple sequence alignments that are full of indels (insertions and deletions) and difficult to align:
PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination. (link)
Robert Lücking, Brendan P Hodkinson, Alexandros Stamatakis and Reed A Cartwright
BMC Bioinformatics 2011, 12:10 doi:10.1186/1471-2105-12-10
Background We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.
Results Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model.
Conclusions Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED.
Availability An implementation of the PICS-Ord algorithm is available from http://scit.us/projects/ngila/wiki/PICS-Ord. It requires both the statistical software, R http://www.r-project.org and the alignment software Ngila http://scit.us/projects/ngila.
My co-author, Brendan Hodkinson, has already covered it on his blog, and the software can be downloaded from my website.
In phylogenetic analysis, regions of sequences that contain a lot of overlapping indels are difficult to align, and whatever alignment that is produced is considered to have a low signal-to-noise ratio. Typically these regions are removed from any phylogenetic analysis, but we propose a method to convert the information they contain into mutli-state characters.
In molecular biology, an alignment is a partial reconstruction of the evolutionary history of a group of sequences. In an alignment, all residues found in the same column are considered to be descended from a single residue in the ancestral sequence. (Of course, insertions violate this description, but I won't get into that.) Alignments are not direct observations. They are actually inferences based on the patterns of sequences found in the dataset. Often times there are particular areas in which the alignment is difficult to resolve. Take this example:
A typical problem in multiple sequence alignments where a section is full of gaps and contains a complicated phylogenetic signal. Dark red: high certainty that alignment is accurate; Dark blue: low certainty that alignment is accurate..
It was constructed via the GUIDANCE webserver. (A great resource that everyone should use.) In this example, we have a region defined by a lot of sequence variation created by many insertions and deletions. The alignment is not well defined here, and in most applications it will just be removed, and the data "thrown away".
But is this the only solution? In our paper we develop a methodology, dubbed PICS-Ord (download), that provides an easy solution for extracting phylogenetic information from problematic regions chosen by its user. PICS-Ord works through a three-step process:
Realign the segments in pairs using Ngila, and calculate the likelihood of the alignment from an evolutionary model. This produces a distance matrix of the segments.
This might seem a bit odd at first. "Why not just use the distance matrix directly?" That would be great if we could, but there aren't any phylogenetic programs that we know off that allow the mixing of distance matrices and sequence data. With our method, we get discrete, ordered characters that can be used in popular programs like, RAxML.
There are three example files in the PICS-Ord distribution, and I'll illustrate its usage with example1.fas. The alignment of these sequence fragments is messy:
But instead of throwing it away, you can process it with PICS-Ord and get a clean set of ordered characters that contain approximately the same phylogenetic information as the sequences above. The command ./picsord example1.fas produces the following output.
This dataset can now be analyzed along side the rest of your data using my co-author's Alexis Stamatakis's, RAxML phylogenetic software. Let's assume that you have a set of four sequences, and you want to find their phylogeny. However, there are two regions of the alignment that are difficult to align and you have generated PICS-Ord codes for them.
Once this is done, you need to create a Phylip formatted file containing all the data (data.phy):
4 14
Seq1 0123ACGTTT2345
Seq2 1022ACGGTT2234
Seq3 0031AGTTTT1201
Seq4 1010ACGTTT2101
And a partition file (part.txt) that will describe to RAxML the different sections:
MULTI, p1 = 1-4
DNA, p2 = 5-10
MULTI, p3=11-14
Next you can estimate phylogenetic trees from your data using RAxML can now generate your trees: ./raxmlHPC -q part.txt -s data.phy -K GTR -m GTRGAMMA -n T1. The -K parameter defines the model to use for multi-state characters (PICS-Ord) codes.
We feel that PICS-Ord opens up a new way of handling alignment issues for phylogenetic analysis. In our paper we find that it tends to improve phylogenetic analysis. On several datasets used in lichen systematics, we found that it tended to increase the accuracy of the trees. The confidence of accurately resolved clades tended to increase, while the confidence of incorrectly resolved clades tended to decrease. In other words, both true positives and false positives were identified more strongly when using PICS-Ord encoded data.
We've talked about a couple related methods while working on this project. If we develop any enhancements, I hope to be able to share them with you.
I saw the talk on this at the Evolution 2010 conference in Portland IIRC, glad that it is now published and available, looks exciting. I think I'll suggest we do this paper in our Bioinformatics journal club sometime in the near future, we certainly have some problematic multi-gene datasets that this could prove useful on.
If you could assume distances are multivariate normally distributed (approximately) and have covariances and variances obtained from the tree -- which takes some assuming -- you could get likelihoods for a set of distances on the tree, and use likelihood as the common “currency” in a joint method of inferring trees using both a statistical model for the nucleotide sequences and a corresponding statistical model for the distances. Note I said “if”.
There have been generalized least squares methods (I am thinking particularly of work by Kishino and Hasegawa and earlier by Ranajit Chakraborty) which derived expectations and covariances for distances from an underlying nucleotide model.
By default we use the the evolutionary model described in Cartwright (2009) as implemented in Ngila. This model is used to setup the substitution and log-affine gap costs used by Ngila aligner code. Ngila optimizes global alignment using the method of Miller and Myers (1988), which supports more complex monotonic penalty functions than the standard Gotoh approach.
Of course, PICS-Ord is written to be flexible. So much of what we do can be swapped out for something you prefer.
John Harshman · 31 January 2011
This being Panda's Thumb, this would be a good opportunity for you to spoon-feed me (and anyone else who might be interested) with an explanation of the model. Could it be used in an attempt to create the simultaneous alignment/tree-building program that everyone is looking for? Is is useful beyond pairwise comparisons? That is, could it be used to decide likelihoods for multi-sequence alignments?
FYI: Most of this comment will be about research that predates PICS-Ord.
The model in Cartwright (2009) is not an attempt to create an alignment+tree program. (See Bali-Phy for a good implementation of that.) The model I developed is only for pair-wise alignments, and was developed as part of a project to estimate indel mutation rates without alignment biases.
To understand the model, first thing you need to do is look at Figure 1. It is a hidden-Markov model that represents alignments based on a chain of events (match/mismatch, insertion, deletion). The important thing that it does is that it allows indels to follow a power-law distribution, which improves the correlation of alignments with biology. All the parameters are expressed in evolutionary terms.
Right now, I haven't adapted the model to multiple sequence alignments; I may at somepoint. But I am really interested in pushing the limits of what I can do using just pairwise comparisons.
John Harshman · 31 January 2011
I'm just an old country systematist, but I dimly see where you're going. Perhaps "dimly" should be emphasized. I'm interested in indel evolution, for which among other things it would be nice to be able to infer the size distribution of indels. Though since there are several indel-generating processes, it seems as if a good distribution would be the sum of several different functions, at least for insertions. Maybe deletions are all the same.
For hard-to-align regions, if you could come up with a variety of indel histories, attach a likelihood to each, and integrate over the full set, that would help.
I can already size distributions of indels using pairwise alignments. So if that is what you want to do with your data, we should talk.
We've also talked about using some other software of mine that instead of finding an optimal alignment using the Cartwright (2009), calculates the evolutionary divergence of two sequences by integrating over possible alignments
John Harshman · 31 January 2011
Reed A. Cartwright said:
I can already size distributions of indels using pairwise alignments. So if that is what you want to do with your data, we should talk.
Perhaps we should. I have computed a size distribution that first assumes the multiple alignment and phylogeny are correct, then optimizes gaps over the tree by parsimony, treating contiguous gaps as single indels. But when indels overlap in local regions of the tree, there are ambiguities that I just ignore, and this probably distorts the distribution to some unknown extent. Taking into account alternative indel histories over the tree, weighted by likelihood, might allow incorporation of those ambiguous bits into the estimate. This is a great data set for that sort of thing, by the way, since there are many thousands of indels across 169 species.
Your distributions are likely to be highly biased. (Of course we could compare methods and see if they differ in practice.)
Multiple sequence alignment programs typically do a bad job modeling indels, often penalizing them too much. Plus they assume a non-biological model: a geometric distribution.
What are your 169 species? Is this your bird dataset? What are the sequences you are using? Etc.
John Harshman · 1 February 2011
Yes, it's the birds. I'm using all the introns in that set, a total of 33 in 15 loci. Yes, most alignment programs are bad at indels. Much of the alignment is a product of heavy tweaking by eye, or just plain eyeball alignment from the beginning. You might like to play with it.
For what it's worth, I've tested the claim that alignments are biased toward deletions by using PRANK, and the deletion bias shown in the published alignment isn't affected. (Though I don't much like PRANK's alignment, at least that parameter remains the same.)
John Harshman · 1 February 2011
If you have any interest, you can get the full alignment here:
Me too, as you can see. We have more, by the way. There's an enlarged matrix (both in taxa and introns) in preparation.
D. P. Robin · 3 February 2011
I realize that this is off on a tangent, but is there a good book or other resource for understanding cladistics? I have several books (like "On the Origin of Phyla") and I keep getting confused by the terminology. I think if it were all laid out, I could get a better handle on it.
dpr
John Harshman · 3 February 2011
D. P. Robin said:
I realize that this is off on a tangent, but is there a good book or other resource for understanding cladistics?
Have you looked at Wikipedia? No book comes to mind.
Henry J · 3 February 2011
Cladistics: groups within larger groups within still larger groups within even larger still groups, etc.
(i.e., another name for a study of the branches in an evolving nested hierarchy.)
18 Comments
Dan Gaston · 30 January 2011
I saw the talk on this at the Evolution 2010 conference in Portland IIRC, glad that it is now published and available, looks exciting. I think I'll suggest we do this paper in our Bioinformatics journal club sometime in the near future, we certainly have some problematic multi-gene datasets that this could prove useful on.
Joe Felsenstein · 30 January 2011
If you could assume distances are multivariate normally distributed (approximately) and have covariances and variances obtained from the tree -- which takes some assuming -- you could get likelihoods for a set of distances on the tree, and use likelihood as the common “currency” in a joint method of inferring trees using both a statistical model for the nucleotide sequences and a corresponding statistical model for the distances. Note I said “if”.
There have been generalized least squares methods (I am thinking particularly of work by Kishino and Hasegawa and earlier by Ranajit Chakraborty) which derived expectations and covariances for distances from an underlying nucleotide model.
Reed A. Cartwright · 30 January 2011
Thanks, Joe. I do think that is one of the directions we want to take this.
John Harshman · 30 January 2011
What model are you using to get the likelihoods in the ambiguous regions? And for simplicity, I might also ask how Ngila gets its alignments.
Reed A. Cartwright · 30 January 2011
By default we use the the evolutionary model described in Cartwright (2009) as implemented in Ngila. This model is used to setup the substitution and log-affine gap costs used by Ngila aligner code. Ngila optimizes global alignment using the method of Miller and Myers (1988), which supports more complex monotonic penalty functions than the standard Gotoh approach.
Of course, PICS-Ord is written to be flexible. So much of what we do can be swapped out for something you prefer.
John Harshman · 31 January 2011
This being Panda's Thumb, this would be a good opportunity for you to spoon-feed me (and anyone else who might be interested) with an explanation of the model. Could it be used in an attempt to create the simultaneous alignment/tree-building program that everyone is looking for? Is is useful beyond pairwise comparisons? That is, could it be used to decide likelihoods for multi-sequence alignments?
Reed A. Cartwright · 31 January 2011
FYI: Most of this comment will be about research that predates PICS-Ord.
The model in Cartwright (2009) is not an attempt to create an alignment+tree program. (See Bali-Phy for a good implementation of that.) The model I developed is only for pair-wise alignments, and was developed as part of a project to estimate indel mutation rates without alignment biases.
To understand the model, first thing you need to do is look at Figure 1. It is a hidden-Markov model that represents alignments based on a chain of events (match/mismatch, insertion, deletion). The important thing that it does is that it allows indels to follow a power-law distribution, which improves the correlation of alignments with biology. All the parameters are expressed in evolutionary terms.
Right now, I haven't adapted the model to multiple sequence alignments; I may at somepoint. But I am really interested in pushing the limits of what I can do using just pairwise comparisons.
John Harshman · 31 January 2011
I'm just an old country systematist, but I dimly see where you're going. Perhaps "dimly" should be emphasized. I'm interested in indel evolution, for which among other things it would be nice to be able to infer the size distribution of indels. Though since there are several indel-generating processes, it seems as if a good distribution would be the sum of several different functions, at least for insertions. Maybe deletions are all the same.
For hard-to-align regions, if you could come up with a variety of indel histories, attach a likelihood to each, and integrate over the full set, that would help.
Reed A. Cartwright · 31 January 2011
I can already size distributions of indels using pairwise alignments. So if that is what you want to do with your data, we should talk.
We've also talked about using some other software of mine that instead of finding an optimal alignment using the Cartwright (2009), calculates the evolutionary divergence of two sequences by integrating over possible alignments
John Harshman · 31 January 2011
Reed A. Cartwright · 1 February 2011
Your distributions are likely to be highly biased. (Of course we could compare methods and see if they differ in practice.)
Multiple sequence alignment programs typically do a bad job modeling indels, often penalizing them too much. Plus they assume a non-biological model: a geometric distribution.
What are your 169 species? Is this your bird dataset? What are the sequences you are using? Etc.
John Harshman · 1 February 2011
Yes, it's the birds. I'm using all the introns in that set, a total of 33 in 15 loci. Yes, most alignment programs are bad at indels. Much of the alignment is a product of heavy tweaking by eye, or just plain eyeball alignment from the beginning. You might like to play with it.
For what it's worth, I've tested the claim that alignments are biased toward deletions by using PRANK, and the deletion bias shown in the published alignment isn't affected. (Though I don't much like PRANK's alignment, at least that parameter remains the same.)
John Harshman · 1 February 2011
If you have any interest, you can get the full alignment here:
http://www.biology.ufl.edu/earlybird/aln.html
Reed A. Cartwright · 1 February 2011
Ooo. I like introns.
John Harshman · 2 February 2011
Me too, as you can see. We have more, by the way. There's an enlarged matrix (both in taxa and introns) in preparation.
D. P. Robin · 3 February 2011
I realize that this is off on a tangent, but is there a good book or other resource for understanding cladistics? I have several books (like "On the Origin of Phyla") and I keep getting confused by the terminology. I think if it were all laid out, I could get a better handle on it.
dpr
John Harshman · 3 February 2011
Henry J · 3 February 2011
Cladistics: groups within larger groups within still larger groups within even larger still groups, etc.
(i.e., another name for a study of the branches in an evolving nested hierarchy.)