Disclaimer: This is intended to be a short introduction to the idea of phylogenetically structured data. My main goal is to make sure you know that you have to be careful when doing statistics on species data. If you want an introduction to phylogenetic comparative methods (a rich subfield of evolutionary biology) I recommend starting with Luke Harmon’s free online textbook or any of many other texts on the topic (or workshop notes online or here).

Comparative Data

This lesson is focused on how to deal with “phylogenetically structured” data. For example, imagine that we have some trait data for a set of species of bacteria, perhaps about mean cell size and minimum generation time. It is often supposed that larger cells are associated with rapid maximal growth rates. It seems that it would be a simple matter to test this association - one could simply calulate the correlation between the two variables. In fact, if you proceed without caution you run the risk of obtaining a significant, but spurious statistical result.

Some other examples of research questions involving phylogenetically structured data:
- trying to associate pathogenicity with other traits (e.g., genome size)
- trying to associate the traits of different species with different habitat types (e.g., do beige things live in the desert and dark green things in the forst)

The problem is that most (simple) statistical tests (parametric and non-parametric) assume independent datapoints, and your data is anything but. Consider that the species in your dataset share an evolutionary history, leading their traits to share a hierarchical dependency structure. Any time you are making comparisons between sets of species, you need to account for phylogenetic structure. Felsenstein (1985) was the first to really articulate the severity of this problem and provide an appropriate statistical treatment (the original linked “phylogenetic independent constrasts” paper is very clearly written and I highly recommend it).

I’m going to take you through this issue by generating some example phylogenetically-structured datasets in R and showing you how statistical tests can fail on these data.

I want to emphasize that I am not just being nitpicky - it is very easy to get spurious results with comparative data. There is a great paragraph near the end of the original Felsenstein paper:

“Some reviewers of this paper felt that the message was”rather nihilistic," and suggested that it would be much improved if I could present a simple and robust method that obviated the need to have an accurate knowledge of the phylogeny. I entirely sympathize, but do not have a method that solves the problem. The best we can do is perhaps to use pairs of close relatives as suggested above, although this discards at least half of the data. Comparative biologists may understandably feel frustrated upon being told that they need to know the phylogenies of their groups in great detail, when this is not something they had much interest in knowing. Nevertheless, efforts to cope with the effects of the phylogeny will have to be made. Phylogenies are fundamental to comparative biology; there is no doing it without taking them into account." - Felsenstein (1985)

Before we start you will need to make sure you have several packages installed. The ape package is the general R package for dealing with phylogenies, and the phytools package has a ton of extra tools for manipulating and analyzing phylogenetic data. We will use a function in the caper package to check for phylogenetic signal and a function in the picante package to run phylogenetic generalized least squares (PGLS).

if (!require("ggplot2")){
  install.packages("ggplot2")
  require("ggplot2")
}
if (!require("ape")){
  install.packages("ape")
  require("ape")
}
if (!require("phytools")){
  install.packages("phytools")
  require("phytools")
}
if (!require("picante")){
  install.packages("picante")
  require("picante")
}
if (!require("caper")){
  install.packages("caper")
  require("caper")
}

Spurious Results With Phylogenetic Structure

Let’s start be generating some random data to work with. I am going to simulate the evolution of two traits on a phylogeny, and we are going to see how often those traits end up with correlated values even though they will have no real association (other than evolving on the same tree, that is).

First let’s make some phylogenies:

# Generate 9 random phylogenies
n_trees <- 9
n_tips <- 1000
trees <- replicate(n_trees, rcoal(n_tips), simplify = FALSE)
par(mfrow=c(3,3))
for(i in 1:n_trees){
  plot.phylo(as.phylo(trees[[i]]), show.tip.label = FALSE)
}

par(mfrow=c(1,1))

Now I am going to simulate the evolution of two traits on each of these phylogenies. We don’t need to get into it now, but I am using a simple Brownian Motion model of trait evolution that is pretty standard in the field. It results in normally-distributed changes in trait values along branches on the tree. Remember, I am going to let these traits evolve independently on the tree.

# Randomly evolve two traits on these trees
trait_df <- data.frame(Trait1 = numeric(0),
                       Trait2 = numeric(0),
                       Tree = numeric(0))
for(i in 1:n_trees){
  trait_df <- rbind(trait_df, 
                    data.frame(Trait1 = fastBM(trees[[i]]),
                               Trait2 = fastBM(trees[[i]]),
                               Tree = i))
}
ggplot(trait_df,aes(x = Trait1, y = Trait2)) + 
  geom_point() + 
  stat_smooth(method="lm") + 
  facet_wrap(vars(Tree), scales = "free")