Friday, November 22, 2013

Happy Holiday!

Today (November 22, 2013), we had a short talk on a phone with the zoo people whom we will be working with from San Diego. They are a group of very vibrant and intelligent people even without seeing their faces. Although most of the time I was listening to the conversation as there was a lot to digest, I was nonetheless very excited for this project! After several follow-ups emails, my goal was narrowed down to a single, clear task: What is the divergence time for these avium micrbacterium strains?

Dr. Miller and I looking at data on our computer. Most of our work can be achieved by computer programs, so no fancy gears or species to show B-)


We won't be meeting next Friday because it's Thanksgiving break (Black Friday shopping!!)! Happy Holiday everyone!:)

Tuesday, November 12, 2013

Bayesian Inference vs. Maximum Likelihood

Last Friday (November 8, 2013), my mentor and I discussed a bit about the Bayesian Inference of Trees and the program BEAST.

Like Maximum Likelihood (ML), Bayesian Inference is a character-based tree method, and they both generate several trees and use some criterion to decide which tree is the best. However, BI differs from ML in that BI "seeks the tree that is most likely given the data and the chosen substitution model" whereas ML "seeks the tree that makes the data the most likely" (Hall, 140). Sounds the same, I know. However, after a short period of intense research and seeing a lot of alien equations, I interpreted the major distinction like this (please correct me if I am wrong, clarification needed!!):

  • ML - make  different trees and calculate the likelihood of each tree
  • BI -  finds the tree that constitute the X (observation) with given knowledge (i.e. the descendants or a substitution model).
Mathematically, according to Bayes' Theorem:

P(A|B) = \frac{P(B | A)\, P(A)}{P(B)}\cdot \,

In Bayesian inference, the event B is fixed (the "given knowledge") in the discussion, and we wish to consider the impact of its having been observed on our belief in various possible events A (how the tree split). In such a situation the denominator of the last expression, the probability of the given evidence B, is fixed; what we want to vary is A. Thus, the posterior probabilities are proportional to the numerator: 


P(A|B) \propto  P(A) \cdot P(B|A) \

  • P(B|A) is the prior probability, also the likelihood. It can be interpreted as P(data | hypothesis). Prior probability indicates our state of knowledge about the truth of a hypothesis before we have observed the data. 
  • P(A|B) is the posterior probability. We can rewrite as (hypothesis | data). It shows how well our models agree with the observed data --> Bayesian
I drew a picture and hopefully it helps with visualization. Suppose we know there are species A, B, C, E, and F. For ML, we are finding in what way are the species located on the tree that P(A)xP(B|A) is the largest. For BI, we know from data that A, D, E, B are in the order they are. Given this knowledge, we are finding where C and F locate so that P(A|B) is the greatest.

This is at least what I've got so far. Nonetheless, BI can be burdensome if no obvious prior distribution for a parameter exists. The researcher to ensure that the prior selected is not inadvertently influencing the posterior distribution of parameters of interest. I clarified my thoughts every time I wrote it out. Before we get into the bird data, we try to ask ourselves some fundamental questions about the phylogenetic trees. Next Friday, I will be joining a talk with the data holders to discuss our participation in the Avian mycobacteria project. 

Tuesday, November 5, 2013

More Phylo Talks!

Last Friday (November 1st, 2013), My mentor and I discussed more about the bootstrapping methods in terms of how it work. He gave me a paper, in which I will be reading this week. We also talked about the derivation of the Jukes Cantor model, which is the simplest evolutionary model used to predict the rate at which nucleotide substitutions occur during evolution. It has two main assumptions:

  • Equal frequencies of the four bases
  • The probability of changing from one state to a different state is always equal, i.e. A-->G is as likely to happen as G-->C

Q Matrix of Jukes Cantor Model
f(t) represents the mutation rate, which is equal for all if a nucleotide is substituted with a different nucleotide. The probability of changing from A to A is 1- f(t) since the sum of the probabilities should add to 1.

However, there are many modifications for this model later published, including Felenstein 81 (F81), Kimura 2-parameter (K2P), HKY85, TN93, GTR, etc.

While we are still waiting for the bird data for my project, Dr, Miller sent me a practice data, and he would like me to explore his website on my own first (http://www.phylo.org/). So I looked at the demo, uploaded the data, and let the data run by accepting the default setting. I found the website pretty user-friendly, yet I don't really know what those result means. Hopefully I will learn more about what I can get from these results this Friday.