Dr. Miller and I at the end of the year presentation. |
Tuesday, May 20, 2014
Final Poster Session & Bicentennial!
My internship has finally come to an end. On April 30, 2014, we presented our poster during the assembly block to faculties and students. I am glad that many people stopped by and asked me questions. After analyzing my data, we concluded that most birds acquired the infection independently, while some acquired int through bird-to-bird transmission (not shown in the tree). What I was more happy about was that my mentor mentor was able to come to my presentation, and he also enjoyed other students' presentations.
Fortunately, I was also able to share my experience in one of the Academy class over the bicentennial weekend. The group asked a lot of questions and I was really happy to share my experience with STEM program. I think the fact that Emma Willard is developing this Signature Program is incredible, and I am glad to be one of the earliest participant. After 2 years of STEM intern, I'm more certain of what I want to do in the future, and I hope the program can help more students to fins their own passions!
Monday, April 21, 2014
18 Samples + ATCC Reference Strain
I didn't meet with my mentor last week because I was on a college visit. Nonetheless, the week before (April 11th, 2014), I had attempted to make several BEAST tress with 6 different BEAUti settings (no time included). I would totally include some visualizations since I've got some cool pictures except that my laptop decided to crash down and all the memory is lost (All the data run is probably too demanding for the computer oops). So, I will try to explain what I did in words.. please refer to my previous posts if needed!
I mainly focused on the effect of clock model and site model on the tree. Although I didn't actually look at the output tree visualization, I examine my the goodness of my output data on Tracer (see my previous posts). Out of 6, only one looked decent (approx. Normal), with the setting of Strict Clock Model and the GTR site model. So, I went on including the time of the samples in BEAUti and ran it in BEAST, yet the resulting tree didn't look so good..
However, my mentor gave me a new input file where our previous reference strain MAV104 was substituted by ATCC strain because MAV104, which was isolated from an AID patient was too distant from our M.avium samples, and that ATCC was originated in birds as well. Therefore, I ran the new data on BEAST (time included) as well as on GARLI and RAxML, where the trees were made based exclusively on the sequences. Something promising happened! The BEAST tree looks almost exactly like the RAxML tree with a little variations, which is reasonable since BEAST builds a tree based on both sequences and time. I will go ahead and analyze the GARLI tree to see if I get the resemblance.
In the meantime, I am starting to work on my final poster! Whoo the time has gone by so quick! I just got my new emergency laptop and am trying to reinstall many software. Thankfully that I ran most of my data online so I can still download them to my computer. The thing that will take a while is the actual pictures of the trees, in which I will try to rebuild them these days.
Finally, since there is no visualization in this post, I would like to share a picture from my college visit at LA! This is me standing next to my future school mascot - the Bruin! Can you guess where will I be spending my next 4 years at? Yes - UCLA!!!! I am so glad that the application process is finally over and I can't wait for my college life!!!
I mainly focused on the effect of clock model and site model on the tree. Although I didn't actually look at the output tree visualization, I examine my the goodness of my output data on Tracer (see my previous posts). Out of 6, only one looked decent (approx. Normal), with the setting of Strict Clock Model and the GTR site model. So, I went on including the time of the samples in BEAUti and ran it in BEAST, yet the resulting tree didn't look so good..
However, my mentor gave me a new input file where our previous reference strain MAV104 was substituted by ATCC strain because MAV104, which was isolated from an AID patient was too distant from our M.avium samples, and that ATCC was originated in birds as well. Therefore, I ran the new data on BEAST (time included) as well as on GARLI and RAxML, where the trees were made based exclusively on the sequences. Something promising happened! The BEAST tree looks almost exactly like the RAxML tree with a little variations, which is reasonable since BEAST builds a tree based on both sequences and time. I will go ahead and analyze the GARLI tree to see if I get the resemblance.
In the meantime, I am starting to work on my final poster! Whoo the time has gone by so quick! I just got my new emergency laptop and am trying to reinstall many software. Thankfully that I ran most of my data online so I can still download them to my computer. The thing that will take a while is the actual pictures of the trees, in which I will try to rebuild them these days.
Finally, since there is no visualization in this post, I would like to share a picture from my college visit at LA! This is me standing next to my future school mascot - the Bruin! Can you guess where will I be spending my next 4 years at? Yes - UCLA!!!! I am so glad that the application process is finally over and I can't wait for my college life!!!
Sunday, April 6, 2014
Spring:)
Last Friday (April 4, 2014), I finally met with my mentor again after so long. However, we have a shorter meeting because I have a flight to catch that evening to Atlanta, GA for college visit. Also, since I have been missing several posts (I have been travelling a lot this month!), I will include what has been going on in this post. He has forwarded me several emails over the break to keep me updated about his conversation with the zoo people. So, over the break the zoo people have reassemble many of their mycobacterium sample sequence to increase the number of signals as well as eliminating sequence error. On the other hand, my mentor has been assessing the sequence alignment of the samples, and he did found some assembling errors such as one sequence contains an extra 1000bp and another contains extra 99bp. Additionally, he found 3 pairs of identical sequences, so we will later eliminate the duplicates.
Before Friday, I edited the 21-sample file Dr. Miller sent to me and ran it on BEAST with default setting. However, the graph on the Tracer did not look so good - it had 3 bumps, skewing overall to the left in stead of the fine Normal shape we want, so I did not proceed to the tree making. I showed my result to my mentor and we will just keep testing out with different settings.
Meanwhile, he showed me what has he been doing over the break - comparing each sample sequence to a reference sequence, in this case MAV.104, using a software called mauve to assess the validity of sample sequence and how similar / different are they to / from the reference. So to align the sequence, I clicked Align with progressiveMauve and added two sequences MAV 104 (reference) and myc01.
The top one is the reference sequence and the bottom one is our sample (myc01). Each color segment represent a contig, which is a set of overlapping DNA segments that together represent a consensus region of DNA. The bottom sequence is 2-sided simply because when we sequence the sample, there are some pieces copied from the positive strand and others copied from negative strands. Thus, what can be useful is if we reorder the sequences and convert them all to the positive strand, and you can do this by selecting Tools --> move contigs.
After the reordering, we can really see how similar is our sample sequence to that of the reference. The white gaps simply mean certain places do not match. However, this image does not tell the exact order of contigs of our sample since we've reordered it. One sequence may be preserved in bacteria / viruses though without being at the exact same site since small organisms have great ability to reorder DNA sequences.
This week my assignment is to compare several more contigs with the reference to get myself familiar with Mauve while trying to reinforce our tree. I will continue working on our tree with different BEAUti settings including using relax clock and UPGMA starting trees.
Before Friday, I edited the 21-sample file Dr. Miller sent to me and ran it on BEAST with default setting. However, the graph on the Tracer did not look so good - it had 3 bumps, skewing overall to the left in stead of the fine Normal shape we want, so I did not proceed to the tree making. I showed my result to my mentor and we will just keep testing out with different settings.
Meanwhile, he showed me what has he been doing over the break - comparing each sample sequence to a reference sequence, in this case MAV.104, using a software called mauve to assess the validity of sample sequence and how similar / different are they to / from the reference. So to align the sequence, I clicked Align with progressiveMauve and added two sequences MAV 104 (reference) and myc01.
The top one is the reference sequence and the bottom one is our sample (myc01). Each color segment represent a contig, which is a set of overlapping DNA segments that together represent a consensus region of DNA. The bottom sequence is 2-sided simply because when we sequence the sample, there are some pieces copied from the positive strand and others copied from negative strands. Thus, what can be useful is if we reorder the sequences and convert them all to the positive strand, and you can do this by selecting Tools --> move contigs.
After the reordering, we can really see how similar is our sample sequence to that of the reference. The white gaps simply mean certain places do not match. However, this image does not tell the exact order of contigs of our sample since we've reordered it. One sequence may be preserved in bacteria / viruses though without being at the exact same site since small organisms have great ability to reorder DNA sequences.
This week my assignment is to compare several more contigs with the reference to get myself familiar with Mauve while trying to reinforce our tree. I will continue working on our tree with different BEAUti settings including using relax clock and UPGMA starting trees.
Thursday, March 20, 2014
Tree Comparison
Right before the break (early March), Dr. Miller and I had been working on constructing the 26 samples tree using various programs (see previous posts). After a lot of trial and error, we finally got decent trees from each program. However, since the outcome trees looked very different in each program, we used a software TreeGraph to standardize and compare the trees. When loading the files into the software, you would have to convert the file into nexus file simply by adding .nex in the file name. Below is the comparison between 4 trees:
Fortunately, trees from GARLI and MrBayes looked exactly the same while the other two resembled them without great differences. This gave us a pretty good picture of the time-included tree in which we will later be working on. This is a rather short post because the process was lengthy and complicated that I don't think I will be able to explain it comprehensively here.
Time is going fast and it'll be April when we returned from the break! Hopefully we will be able to achieve some work before the year ends!
In order of RAxMl, GARLI, MrBayes, BEAST |
Time is going fast and it'll be April when we returned from the break! Hopefully we will be able to achieve some work before the year ends!
Sunday, February 23, 2014
More Trees with GARLI and RAxML
Last Friday (February 21, 2014), I didn't meet with my mentor because he was out of town. Nonetheless, I continued making trees with GARLI (using the right one this time) and RAxML.
The numbers on the branches are the bootstrap result. In this trial, I set a bootstrap repetition of 50. Basically, the bigger the number is, the greater support we have for that particular branch, with the greatest number of 50. Though overall the tree looks pretty decent, notice that in this tree we have only little support for myc 1,2,5,23,16, and 30.
Next, I moved on to run the samples with RAxML. The branch lengths varied significantly and I am still understanding the implication of it. The maximum bootstrap number is 100. While most of them were pretty big, the numbers for were still very small (splitto the extreme). This lack of support was due to the same sequences. myc01 is identical to myc02 where as myc16 is identical to myc23.
Therefore, the next thing I did was to run both GARLI and RAxML again with identical sequences eliminated so that they wouldn't confuse the program.
This time the resulting trees all had branches with very high bootstrap numbers, suggesting that our trees were very strong.
The next thing I am going to do it is to run the samples with MrBayes and compare the result topologies. I am finally meeting with my mentor this coming Friday, and hopefully we'll discuss more about the results!
GARLI_1 |
The numbers on the branches are the bootstrap result. In this trial, I set a bootstrap repetition of 50. Basically, the bigger the number is, the greater support we have for that particular branch, with the greatest number of 50. Though overall the tree looks pretty decent, notice that in this tree we have only little support for myc 1,2,5,23,16, and 30.
Next, I moved on to run the samples with RAxML. The branch lengths varied significantly and I am still understanding the implication of it. The maximum bootstrap number is 100. While most of them were pretty big, the numbers for were still very small (splitto the extreme). This lack of support was due to the same sequences. myc01 is identical to myc02 where as myc16 is identical to myc23.
Therefore, the next thing I did was to run both GARLI and RAxML again with identical sequences eliminated so that they wouldn't confuse the program.
GARLI_3 with identical samples eliminated. |
RAxML_2 with identical samples eliminated. |
The next thing I am going to do it is to run the samples with MrBayes and compare the result topologies. I am finally meeting with my mentor this coming Friday, and hopefully we'll discuss more about the results!
Sunday, February 16, 2014
Examining Different Trees Using Different Programs
Last Friday (February 14, 2014), my mentor wasn't able to come because his flight was cancelled due to the snow storm. However, we did chat through Skype and accomplished some work via email. In continuing our tree making, after reading several publications, we decided to run the larger data (26 samples) at once so it might be more accurate. Wayne, one of the zoo people, sent us an updated genomic sequence of the samples this time with more identified SNPs. We also want to examine the overall topology before taking time into consideration, so that we can first get a sense of how our tree would look like. Thus, we decide to look at different tree visualization tools and software for describing the difference between any two phylogenetic trees.
The programs I will be exploring in addition to BEAST are RAxML, GARLI, and MrBayes. I ran GARLI first, and did it on my mentor's website CIPRES. The whole process took a while and was quite complicated to describe it here. Basically our goal was to find a way to get a majority rule consensus tree form GARLI output. I later found out that I chose the wrong tool to run my data (there were several GARLI choices), but I still went ahead and analyze the data. I converted the output to nexus file so it can be read on Archaeopterix, a powerful tree visualization tool that supports many file formats. My final tree look like this:
I won't know how good was this tree until I make more with other tools so that I can compare them. I will also run GARLI again, using the right tool this time!
The programs I will be exploring in addition to BEAST are RAxML, GARLI, and MrBayes. I ran GARLI first, and did it on my mentor's website CIPRES. The whole process took a while and was quite complicated to describe it here. Basically our goal was to find a way to get a majority rule consensus tree form GARLI output. I later found out that I chose the wrong tool to run my data (there were several GARLI choices), but I still went ahead and analyze the data. I converted the output to nexus file so it can be read on Archaeopterix, a powerful tree visualization tool that supports many file formats. My final tree look like this:
I won't know how good was this tree until I make more with other tools so that I can compare them. I will also run GARLI again, using the right tool this time!
Tuesday, February 4, 2014
A Week Off
Friday, January 24, 2014
Small Myco bacterium Trees!
January 24, 2014. Last Friday I didn't meet with my mentor because he was out of town. Yet, over the week I was able to successfully run the BEAST on my mentor's website (finally!!). Today we made our first attempt to generate a tree for the small mycobacterium data!
One of the most important columns is the effective sample size (ESS) on the left. ESS is the number of independent samples that the trace is equivalent to, and it can help identify autocorrelation in our samples that might result from poor mixing. The ideal is to have the number >200. To do that, I eventually went back to BEAUti and changed the chain length to 5,000,000 with sampling step of 2000 (that increases our sample size in trace to 2500). It is also ideal for our graph to look Normal (this one is pretty good:)).
After confirming that our data converged to a stable posterior distribution, I used TreeAnnotator to summarize the information from a sample of trees (we have 2500) produced by BEAST onto a single “target” tree. The output of TreeAnnotator is a .nex file, which was to be loaded on FigTree program for visualization.
Voila my first tree! Everything looked good except that part in the red. In our data, myco16 and 23 were genetically identical. However, the sampling time of 23 and 30 were closer together, making closer together on the tree. Therefore, we wondered how much does date vs. DNA weighed in Bayesian trees. I went back to BEAUti and excluded myco16 from the taxa so that one sequence only correspond to one sample. The order of the samples on my second tree was good except this time the "time" (in days) was way too large!
I eventually spent the rest of the time going back and forth to see what setting in BEAUti has what effects on my tree. Again, I have generated at least 10 files in this process, including some that failed in BEAST run (I had to go over ALL the steps from BEAUti to FigTree). However, at least we were finally able to get a sense of what our tree would look like:) More trees next week!
The two main outputs of BEAST are the .log and the .tree files. First we analyzed our data by using a program called Tracer. After I imported the .log file...
After confirming that our data converged to a stable posterior distribution, I used TreeAnnotator to summarize the information from a sample of trees (we have 2500) produced by BEAST onto a single “target” tree. The output of TreeAnnotator is a .nex file, which was to be loaded on FigTree program for visualization.
Voila my first tree! Everything looked good except that part in the red. In our data, myco16 and 23 were genetically identical. However, the sampling time of 23 and 30 were closer together, making closer together on the tree. Therefore, we wondered how much does date vs. DNA weighed in Bayesian trees. I went back to BEAUti and excluded myco16 from the taxa so that one sequence only correspond to one sample. The order of the samples on my second tree was good except this time the "time" (in days) was way too large!
I eventually spent the rest of the time going back and forth to see what setting in BEAUti has what effects on my tree. Again, I have generated at least 10 files in this process, including some that failed in BEAST run (I had to go over ALL the steps from BEAUti to FigTree). However, at least we were finally able to get a sense of what our tree would look like:) More trees next week!
Saturday, January 11, 2014
First Meet in Second Semester!
January 10, 2014. Happy New Year everyone!! After enjoying some home time, I had my first meet after so long! However, I felt that I wasn't in my best condition since the jetlag made me a bit dizzy throughout the meeting. Nonetheless, since we have been lagging off for a month, we decide to start running BEAST with the smaller data we got from the zoo people.
We started making our input file with BEAUti. The data was consisted of 6 samples. I had previously converted them into NEXUS file, which was the only format accepted by BEAUti. Dr. Miller sent me a paper (estimating divergence time of viruses is close to estimating that of bacteria) and suggested us to compare / discuss our work after working separately for a while.
So, for my part, I entered the date (the day each was sampled) of my samples in months (since Jan, 2001). Then, I moved on to setting the substitution model as was suggested in the paper - HKY. I would like to test out different models later as I have read several other combinations that would fit our data. But for our first run I would just stick with the tutorial.
Tips Lane - sampling date |
As for Clock lane, I set strict clock (constant rate) because of the low diversity data we are analyzing. Next is the Tree Prior. The Priors panel allows the user to specify informative priors for all the parameters in the model, which can be helpful or burdensome, especially if no obvious prior distribution for a parameter exists (like in our case). Thus, we would need to try out different settings later. My initial setting shown below.
Exponential because bacteria usually grow exponentially |
When all was ready, I clicked generate BEAST file. However, when I uploaded the file to run BEAST, it repeatedly said error! For some reason it said that my setting resulted to 0 prior distribution. One possible reason, in which I read from the BEAST user group online, was that "BEAST starts with a randomly generated tree and if you set tight priors for the parameters, the starting tree may have an extremely low probability which makes it impossible for BEAST to proceed." I went back to the Tree Prior panel and changed random starting tree to UPGMA starting tree. While it ran fine on my computer initially, it terminated to error the next time. After going back and forth, I in total generated 7 files!! Still not working. We decided to make our own starting tree using my mentor's website next time. Yet, the whole process kind of gave me a headache for staring at the screen for 3 hours. To be honest, I am quite intimidated by the computer program as we couldn't really understand the math behind those parameters to fully understand why we were doing. Yet again, those tools are indispensable to many scientists though they would not have invented them themselves. I could only hope that they would start making more sense along the way. Trials and errors - GO SCIENCE!!
Subscribe to:
Posts (Atom)