Friday, October 25, 2013

Neighbor Joining Trees

October 25, 2013. For the past two weeks, I had been reading about the introduction of several major methods for estimating phylogenetic trees. There are primarily two approaches to tree estimation:

  • Algorithmic - uses an algorithm to estimate a tree from the data. Advantages of using this method include fast speed and yielding only a single tree. Algorithmic method includes Neighbor Joining, UPGMA
  • Tree-searching - estimates many trees and then uses some criterion to decide which is the best tree of all.
Another way to categorize those methods are distance vs. character-based methods.
  • distance - Neighbor Joining, UPGMA
  • character-based - Parsimony, Maximum Likelihood, Bayesian Interference
Our discussion today focused on Neighbor Joining Method (NJ). Nonetheless, it is very important to acknowledge that the 'right' tree does not exist since we cannot know what exactly happened in the past. All of these methods only allow us to deduce the the order in which existing taxa (sequences) diverged from a hypothetical common ancestor, and to calculate the amount of changes along the branches between the diverging events.

NJ is one of the most popular distance algorithmic method. It produces a single, strictly bifurcating tree (meaning that each internal node has exactly two branched descending from it). I downloaded another file for practice, in which I am going to show you here.

First I opened up the file LargeData.meg from MEGA. The window shows DNA sequence alignment. The program only show a base when it is different from that of the first sequence. Otherwise it'll just show "."


Then we calculate the average Jukes-Cantor (JC) distance for our data. The data are not suitable for NJ if JC >1.0 In this case, I got 0.811 for this data.

Next, I constructed the tree by choosing Construct/Test Neighbor Joining Tree from Phylogeny menu. A window of Option Summary will pop up. While it is good to stay with default setting (Maximum Composite Likelihood), my mentor and I discussed several other models for the Model/Method section (for simplicity, I won't get into detail here). 
  • Jukes-Cantor model
  • Kimura 2-Parameter model (K2P)
  • Tamura-Nei model
In order to test the reliability of our tree, we also entered "bootstrap method" for Phylogeny Test with a replication of 1000 times. Bootstrapping is, in essence, a method to test how likely a certain "split" is going to occur by using computational simulation.

As we hit compute...
The number assigned to each internal node is the bootstrap value (bootstrap percentage). The high number shows that the split is not random. We can resize the tree a little bit to get a better view by clicking Display Only Topology.
Each internal node only splits into two branches (strictly bifurcating)
The bootstrap value less than 50% means we really have no idea what the branching order is. Scientists usually collapse those branches into polytomies - nodes from which more than two branches descend.
Majority rule tree
Finally, we can do various things to change the appearance of the tree to make it best represent our data...
A circular tree is an unrooted tree
This chapter was quite hard for me because it involved in a lot of equations and concept. Nonetheless, I am happy that I constructed a descent-looking phylogenetic tree in the end! Next week I will discuss with my mentor more about the bootstrapping method and hopefully it will make more sense to me!:)

Tuesday, October 8, 2013

More MEGA 5 with Protein Sequences

Last Friday (October 4th, 2013) I learned more about using MEGA 5 and Blast for constructing phylogenetic trees. This time, however, we used protein blast instead of nucleotide blast because searching protein sequences can detect much more distantly related homologs than searching nucleotide sequences. If two sequences are "homologous," we assume that they descended from a common ancestor (need to be distinguished from "similarity"). Recall that an amino acid is coded by a codon, and that the same amino acid can be coded by several codons, a silent substitution will not cause the amino acid sequence to change. For DNA sequences, there are only 4 possible states (A,T,C,G) of each characters, so when the sequences diverge greatly that there are only 25% identical, the program would classified them as not closely related even though they may have very similar amino acid sequences if you translate them. Thus, the solution id to use protein sequence as query. Proteins have 20 possible states so the lower limit of detectable homology drops to about 5% (instead of 25%). 

For this exercise we used EbgC protein sequence as a query and use blastp.


Those protein sequences are identified as closely related to our query sequences. If you click on the first entry...

The first protein sequences actually include 1087 DNA sequences. Those DNA sequences all code for the same protein sequences even though their DNA sequences vary (silent mutation). Therefore, all those 1087 sequences are very closely related. If we originally did a blastn instead of blastp, the result would only show 100 of these 1087 sequences, and we wouldn't even know other distantly related sequences. The second entry (evolved beta-galactosidase subunit beta [Escherichia coli]) is an example of distantly related sequences (even though they are still pretty close).

Later I chose several protein sequences and translated them back to DNA sequences to align them on MEGA and established a tree. The whole process took so long! Nonetheless, the main idea here is that protein sequences give us a bigger picture regarding the relatedness between species. We start from DNA sequence --> protein sequence --> protein structure --> functions. The bigger we look at , the more distantly related species we are incorporating.

I will be learning some computer code for the command line for the next few weeks. However, I will not meet with Dr. Miller for the next two weeks because of the schedule conflicts. In the meantime, I will continue reading the book and get myself more familiar with blast and MEGA:)