Название | Principles of Microbial Diversity |
---|---|
Автор произведения | James W. Brown |
Жанр | Биология |
Серия | |
Издательство | Биология |
Год выпуска | 0 |
isbn | 9781683673415 |
In the Jukes and Cantor method, any difference in two sequences is scored equivalently; for each position in a pairwise comparison, the bases are either a match or they are not. A commonly used alternative is the Kimura two-parameter model, in which transitions (purine to purine or pyrimidine to pyrimidine) and transversions (purine to pyrimidine or pyrimidine to purine) are scored differently because transitions are much more common than transversions (Fig. 5.1). These scores are based on presifting the alignment to determine the relative frequency of transitions to transversions, and these different types of changes are scored accordingly. It is even possible to have a six-parameter model, in which each type of substitution (G:A, G:C, G:U, A:U, A:C, and U:C) is scored differently (Fig. 5.2).
It is also possible to “weigh” the score of each position (column) in an alignment differently based on how conserved that position is; a difference in a conserved position is then scored as a greater difference than a difference in more variable positions. This requires alignments with many sequences so that variability at each position can be measured reliably, and so very often these are predetermined for the class of RNA being analyzed. The Weighbor algorithm used by the Ribosomal Database Project does this; the name stands for “weighted neighbor joining.” Distance matrices from protein alignments usually use a scoring table derived from the observed relative frequency with which any amino acid is substituted by another from a huge collection of aligned protein sequences, e.g., the PAM tables.
There are also different ways in which gaps can be dealt with. In most treeing algorithms, gaps are ignored; these positions are counted as neither a match nor a mismatch. This is not because they are unimportant; in fact, because insertions and deletions are less common than nucleotide substitutions, they are potentially more important than substitutions. However, it is not clear how to deal with gaps for a variety of reasons. The obvious case is where the alignment contains sequence fragments, i.e., partial sequences, instead of full-length sequences (Fig. 5.3). Partial sequences have two kinds of gaps: gaps that represent bases that are not present in that sequence (indels), and gaps that represent regions of sequence outside of the region for which the sequence is available. The algorithm cannot distinguish between these because alignments do not distinguish between different kinds of gaps.
Figure 5.2 A six-parameter substitution model scores each possible substitution differently. doi:10.1128/9781555818517.ch5.f5.2
It is also difficult to deal with the fact that adjacent gaps are not independent. A string of gaps probably represents the insertion or deletion of more than one base at the same time, not the one-at-a-time insertion/deletion of individual bases. For example, a five-base string of gaps most likely represents a single insertion/deletion of five nucleotides, not five independent insertions/deletions of single nucleotides. Sophisticated algorithms use a large scoring penalty for a single gap but then only a very small additional penalty for additional adjacent gaps.
In addition, it is not clear how to deal with variation at the 5′ and 3′ ends of the RNA; for example, some RNase P RNAs have the rho-independent terminator stem-loop at the end of the RNA removed while some do not (and in at least some organisms, the RNA exists in both versions). What to do with all the gaps in aligned RNAs in which this structural element is removed? And because most RNA sequences are determined from their genes, the exact ends of the encoded RNAs are most often not even known.
Figure 5.3 An alignment showing two fundamentally different types of gaps. All of the gaps in the upper half of the sequences are indels; at least one sequence in the database has nucleotides at these positions, but these sequences do not. Some of the sequences in the bottom half of the alignment are partial sequences, i.e., sequence fragments that use gaps wherever there are no sequence data. doi:10.1128/9781555818517.ch5.f5.3
The special case of G∙C bias
Sometimes even rRNA sequences change adaptively—the bane of phylogenetic analysis. The most common example is the tendency of sequences to differ in G+C content, either because the genome has an unusual G+C content (i.e., there is pressure toward either G+C or A+T richness in the genome) or because the organism is a thermophile and so might prefer G=C over A=U base pairs in its RNAs. This can cause havoc in a tree. One way around this is to do a transversion analysis, which ignores transitions and only scores transversions. The common way to do this is simply to convert all of the A’s in the alignment to G’s and all U’s to C’s. Trees are generated from these alignments in the usual fashion. These trees are, of course, based on fewer data since more than half of the phylogenetic information in the alignment has been discarded, but they should be free of G+C bias artifacts.
Long-branch attraction
One of the things substitution models fight is a treeing artifact called long-branch attraction. Long-branch attraction is the result (primarily) of an underestimation of the evolutionary distance of distantly related sequences. This underestimation results in a tendency for the longest branches in a tree to artificially cluster together; this also results in the artificial clustering of short branches. Figure 5.4 shows a very simple demonstration of how long-branch attraction can result in incorrect trees.
Long-branch attraction happens because of the difference in evolutionary rates in the branches. Therefore, it is always worth worrying about the details of trees containing branches with very different evolutionary rates, i.e., those with branches of very different lengths.
Figure 5.4 Generation of a “long-branch attraction” artifact in a phylogenetic tree. If the sub-tree to the left is the representation of how these sequences are actually related, imagine what would happen in a neighbor-joining analysis. Sequences A and B are more alike (i.e., they have a smaller evolutionary distance between them) than either is to C, and so they will be erroneously joined, as shown on the right. doi:10.1128/9781555818517.ch5.f5.4
One of the primary causes of strikingly long branches, by the way, is bad sequence or poor alignment. If the primary sequence data are poor, every mistake in the data will be counted as an evolutionary change by the treeing algorithm. Likewise, poor alignment causes most of the bases in the poorly aligned region to be counted as evolutionary changes, lengthening the branch leading to that sequence. Again, beware of trees with unexpectedly long branches! Poor alignment or bad sequence data, resulting in long branches, can combine with long-branch attraction to make trees meaningless.
Treeing algorithms
Fitch-Margoliash: an alternative distance-matrix treeing method
Another useful method for generating trees from distance matrices is that of Fitch and Margoliash, commonly called Fitch. This algorithm starts with two of the sequences, separated by a line equal to the length of the evolutionary distance between them. For example, for this distance matrix:
Evolutionary distance | |||||
A | B | C | D | E | |
A | — | — | — | — | — |