data-raw/datelife_workflow.Rmd

---
title: "DateLife Workflow"
author: "Luna L. Sanchez Reyes"
date: "`r Sys.Date()`"
output:
      pdf_document
header-includes: \usepackage{graphicx}
graphics: yes
vignette: >
  %\VignetteIndexEntry{DateLife Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

DateLife is a service for searching and processing information on ages of any number
of taxa <!--or group of taxa-->of interest, across chronograms available in public
data repositories coming from published peer reviewed studies. It can also generate
new taxon age information<!--by using expert available phylogenies and scaling them
to absolute time using information on ages obtained from published chronograms,--> by
linking several external services and tested algorithms.

It only requires a set of taxon names as input, in the form of a comma separated
listing or vector, or of a phylogeny with taxon names on the tips.
Taxon names can correspond to binomial species names or clades
<!--common names (will we implement it?)-->.
When taxon names are clades, DateLife pulls all accepted species names within the
clade (up to OToL's limit of _____ species) from OToL's reference taxonomy using
a service of rphylotastic R package<!--citation-->.
Names belonging to subspecies or any other infraspecific category are treated as
species.
DateLife can process input names with the taxon name resolution service (TNRS),
<!--which "corrects misspelled names and authorities, standardizes variant spellings,
and converts nomenclatural synonyms to accepted names" (Boyle et al. 2013)-->
which corrects misspelled names or typos, and standardizes variation in spelling
and synonyms [@Boyle2013], increasing the probability to correctly find the queried
taxa in the chronogram database. DateLife uses TNRS to compare names against OToL's
reference taxonomy using a service from the R package rotl [@Michonneau2016].
<!--we'll implement the option to use other reference taxonomies in the future -->

DateLife's main function searches taxon names across the chronogram database specified
by the user. At the moment, it queries chronograms from OToL [@Hinchliff2015]
<!-- it cannot query chronograms from TreeBASE -there is an earlier reference from
1994 but it does not exist in its journal database: Sanderson MJ, Donoghue MJ, Piel
WH, Eriksson T: TreeBASE: a prototype database of phylogenetic analyses and an interactive
tool for browsing the phylogeny of life. Am J Bot. 1994, 81 (6): 183--> repository.
DateLife identifies chronograms having at least two taxon names, and subset them
to contain only the taxa of interest. It then stores taxon age information from
each chronogram individually as a patristic matrix, named with the citation of the
original study. This format allows a rapid summary in a number of different ways,
including:
1) citations of the original studies containing the subset chronograms,
2) a list of mrca ages of subset chronograms,
3) a list of complete subset chronograms in newick or phylo format,
4) a table containing all information retrieved in html or R's data frame format, or
5) a single chronogram summarized from subset chronograms using the Super Distance Matrix (SDM) supertree construction approach [@Criscuolo2006] or using the median of branch lengths. <!--(how to better explain the latter?? Also, I just discovered the median gives non ultrametric trees...).-->

DateLife also stores information on input taxon presence/absence across subset chronograms.
Users can choose to add ages of missing taxa to subset chronograms in different
ways, depending on the amount of knowledge they want to input or how much they want
to be involved in the steps of the addition process.
If users have no access to biological information (i.e., a character, DNA or protein
matrix), missing taxa can be added to any chronogram simply at random, or by following
taxonomic or phylogenetic knowledge from expert sources. There are a wide number
of open reference taxonomies available, such as the Catalogue of Life [@Roskov2017]
or the NCBI taxonomy database [@Federhen2012]. Expert phylogenies (with or without
branch lengths) to be used as topological constraint (backbone) can also be obtained
from a number of public repositories, such as OToL [@Hinchliff2015], TreeBASE [@Piel2002]
and Dryad (<https://www.datadryad.org//>).
At the moment, DateLife only uses OToL's synthetic tree and reference taxonomy as
expert knowledge to automatically add missing taxa to chronograms. Alternatively,
users can input a reference taxonomy or topological constraint of their choosing
or making. <!-- at some point we will implement the possibility for users to choose
different reference taxonomies automatically-->
If OToL's synthetic tree is not satisfactorily resolved for the taxa of interest,
DateLife can construct a sequence data matrix from DNA markers available from the
Barcode of Life Database (BOLD; @Ratnasingham2007), to attempt to further resolve
polytomies. It will follow OToL's synthetic tree as backbone.
<!--This molecular tree can then be dated with chronosMPL from the ape package, but is it really needed?? To estimate node ages chronosMPL uses mean path length method from Britton et al. 2002-->
<!--It can rapidly construct trees with branch lengths from
a set of lineages with sequence data from the Barcode Of Life Database (BOLD, <http://www.boldsystems.org/>), using the synthetic OToL (<https://tree.opentreeoflife.org>) as backbone, and to obtain divergence dates.
It allows direct comparison of dates obtained with different markers
available in BOLD (in plants and fungi in particular).-->
To use information from a topological constraint, DateLife calls the congruification
method described in [@Eastman2013] to find shared nodes between trees (congruent
nodes). It then fixes their ages, and add ages to remaining nodes with a dating
method that can be specified by the user. If users have access to biological data,
they can input a tree with branch lengths proportional to relative substitution
rates as topological constraint. In this case, age data from congruent nodes will
be used as calibration points.
Age data from several chronograms can be combined and congruified to be used as
calibration points in a single analysis.

Several dating methods are implemented in DateLife.
Branch Length Adjuster (BLADJ) is a simple algorithm to distribute ages of undated
nodes evenly, which minimizes age variance in the chronogram [@Webb2008].
DateLife implements BLADJ from the development R version of phylocom's R package
[@Webb2008], phylocomr (<https://github.com/ropensci/phylocomr>). It can only be
used when there is a topological constraint with no branch lengths.
PATHd8 is a non-clock, rate-smoothing method [@Britton2007] to date trees. It is
also called through R. <!--It could be used with a reference taxonomy, if we provide a tree with politomies-->
treePL, is a semi-parametric, rate-smoothing, penalized likelihood dating method
[@Smith2012]. It is called through R.
MrBayes program [@Huelsenbeck2001; @Ronquist2003] can be used when adding taxa at
random, following a reference taxonomy or a topological constraint. It draws ages
from a pure birth model, as implemented by Jetz and collaborators [-@Jetz2012].
DateLife calls MrBayes trough an R function.

DateLife can also correct negative branch lengths in several ways.