This lesson will teach you how to apply the MK-test using the iMKT R package. This package includes a number of methodological extensions to the MK-test that help correct for the influence of weak purifying selection which can mask the influence of positive selection. We will implement one of these, the Fay, Wycoff, and Wu (FWW) correction, which basically just removes low-frequency polymorphisms below some threshold (15% is a good baseline). Importantly, iMKT also has a useful web-interface where you can submit your genome alignments, which we will not be using (it is well documented on their website, but being able to run the test locally is useful when trying to automate large-scale analyses).

Preparing the Data

We are going to be working with an alignment of all genes from a number of Listeria monocytogenes genomes as well as an L. innocua outgroup, obtained from the ATGC database. The data comes in an inconvenient format, and I’ve included a helper script to extract alignments for individual genes which we will run in the command line from our main lesson folder:

cd lesson-mktest/
Scripts/reformatATGC_listeria.sh
## bash: line 0: cd: lesson-mktest/: No such file or directory

I won’t go into how this script works since it’s not particularly relevant unless you are using data from ATGC, but it will create a folder called COGs (clusters of orthologous genes) in your Data/ directory that will contain infividual fasta files corresponding to multiple alignments of each gene family. To get the analysis below to work with your own data you should produce a multiple alignment of you gene(s) of interest, with the last row in the alignment (entry in the fasta) being your outgroup. Importantly, iMKT requires that your outgroup always be the LAST entry, and that you have at least 4 sequences from your population to assess polymorphism (so 5 sequences total).

iMKT Input

Next we need to count synonymous and non-synonymous polymorphisms and substitutions. The iMKT package wants these numbers in a very specific format, and will not work otherwise. Conveniently, they have provided us with a python script to process our alignments for us. You MUST have python2 installed on your machine with the pandas and pyfaidx modules for this script to work.

python2 Scripts/sfsFromFasta_v2.py --multiFasta  Data/COGs/synCOG270381.fasta --daf my_example.daf --div my_example.div --codonTable standard

This script created two files summarizing our diversity, my_example.div and my_example.daf. If you want more details about these formats see the iMKT documentation on their website.

Inconveniently, this script doesn’t actually output data in the correct format. It seems that columns have been reordered (likely in an update of the program), so we need to rearrange the output of their python script. We use awk to shuffle columns around (for details on how to use awk see the Happy Belly Bioinformatics tutorial)

awk '{print $1"\t"$3"\t"$2}' my_example.daf > my_example_rearranged.daf
awk '{print $3"\t"$2"\t"$4"\t"$1}' my_example.div > my_example_rearranged.div

Running the Test

Finally, we can run the MK-test. Make sure you have already installed the iMKT package using devtools. I also use dplyr a lot so install that as well.

#Install dplyr
if("dplyr" %in% rownames(installed.packages()) == FALSE){install.packages("dplyr")}
#Make sure devtools installed
if("devtools" %in% rownames(installed.packages()) == FALSE){install.packages("devtools")}
# ## Install iMKT package from GitHub repository
if("iMKT" %in% rownames(installed.packages()) == FALSE){devtools::install_github("sergihervas/iMKT")}
library(iMKT)
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

And now we simply load our input files into R and run the MK-test.

daf <- read.delim("my_example_rearranged.daf",
                  header = T,
                  stringsAsFactors = F)
div <- read.delim("my_example_rearranged.div",
                  header = T,
                  stringsAsFactors = F)
head(daf)
##     daf Pi P0
## 1 0.025  0  0
## 2 0.075  0  0
## 3 0.125  0  0
## 4 0.175  0  1
## 5 0.225  0  0
## 6 0.275  0  0
head(div)
##    mi D0 m0 Di
## 1 352  0 78  2
standardMKT(daf, div)
## Warning in checkInput(daf, divergence, 0, 1): Pi contains all values == 0.
## Warning in checkInput(daf, divergence, 0, 1): D0 == 0.
## $alpha.symbol
## [1] 1
## 
## $`Fishers exact test P-value`
## [1] 0.006535948
## 
## $`MKT table`
##                Polymorphism Divergence
## Neutral class            16          0
## Selected class            0          2
## 
## $`Divergence metrics`
##            Ka Ks omega omegaA omegaD
## 1 0.005681818  0   Inf    Inf    NaN

Voila! Positive selection! I’d actually be a little careful here, iMKT is warning you that you don’t have that many polymorphisms/substitutions in a couple cells in the table. In general, the more data you have the more polymorphisms you are going to get, which can be helpful.

But wait, there’s more. Positive selection can be masked by weak purifying selection in a gene. By removing low-frequency polymorphisms we can get a better picture of positive selection in a gene.

FWW(daf, div, listCutoffs = c(0, 0.15))
## Warning in checkInput(daf, divergence, 0, 1): Pi contains all values == 0.
## Warning in checkInput(daf, divergence, 0, 1): D0 == 0.
## $Results
##                alpha.symbol Fishers exact test P-value
## Cutoff =  0               1                0.006535948
## Cutoff =  0.15            1                0.006535948
## 
## $`Divergence metrics`
## $`Divergence metrics`$`Global metrics`
##            Ka Ks omega
## 1 0.005681818  0   Inf
## 
## $`Divergence metrics`$`Estimates by cutoff`
##                omegaA.symbol omegaD.symbol
## Cutoff =  0              Inf           NaN
## Cutoff =  0.15           Inf           NaN
## 
## 
## $`MKT tables`
## $`MKT tables`$`Cutoff =  0`
##                Polymorphism Divergence
## Neutral class            16          0
## Selected class            0          2
## 
## $`MKT tables`$`Cutoff =  0.15`
##                Polymorphism Divergence
## Neutral class            16          0
## Selected class            0          2

We didn’t really see a difference here (since we were already seeing quite a lot of positive selection in this gene). But the difference will become apparent in a moment when we look across the whole genome. Finally, I want to note that you could have just uploaded the gene alignment to the iMKT website to get the same result.

Automating a Genome-Wide Scan

We want to search for selection in all 1463 gene alignments we have for L. monocytogenes.

First we will generate our input files using a for loop in bash.

# First make a list of all your fasta files
cd Data/COGs
find -type f -name "*.fasta" | awk 'gsub("./","")' > ../cogs.txt

# Now iterate over these files
cd ../
mkdir iMKTinput
while read COG; do
    #Run python script
    python2 ../Scripts/sfsFromFasta_v2.py --multiFasta COGs/$COG --daf iMKTinput/$COG.daf.tmp --div iMKTinput/$COG.div.tmp --codonTable standard
    #rearrange output
    awk '{print $1"\t"$3"\t"$2}' iMKTinput/$COG.daf.tmp > iMKTinput/$COG.daf
    awk '{print $3"\t"$2"\t"$4"\t"$1}' iMKTinput/$COG.div.tmp > iMKTinput/$COG.div
    #delete intermediate files
    rm iMKTinput/$COG.daf.tmp
    rm iMKTinput/$COG.div.tmp
done < cogs.txt

# List output
cd iMKTinput
find -type f -name "*.daf" | awk 'gsub("./","")' > ../daf_files.txt

Now we will repeatedly run the MK test.

dafs <- readLines("Data/daf_files.txt")
mkt_list <- list()
FWW_list <- list()
for(i in 1:length(dafs)){
  daf_file <- paste0("Data/iMKTinput/",
                     dafs[i])
  div_file <- gsub(".daf",".div",daf_file)
  daf <- try(read.delim(daf_file,
                    header = T,
                    stringsAsFactors = F))
  div <- try(read.delim(div_file,
                    header = T,
                    stringsAsFactors = F))
  
  mkt_list[[i]] <- try(standardMKT(daf, div))
  FWW_list[[i]] <- try(FWW(daf, div, listCutoffs = c(0, 0.15)))
}

In order to unpack our output into a nice dataframe we will define a couple of helper functions

getP <- function(x){
  if(!inherits(x,"try-error")){
    x$Results$`Fishers exact test P-value`[2] %>% return()
  } else {
    return(NA)
  }
}
getA <- function(x){
  if(!inherits(x,"try-error")){
    x$Results$alpha.symbol[2] %>% return()
  } else {
    return(NA)
  }
}

Finally we will unpack out outputs.

# MK-test
p_vals <- data.frame(p=mkt_list %>% lapply("[",2) %>% unlist(),
                     alpha=mkt_list %>% lapply("[",1) %>% unlist(),
                     gene=dafs %>% gsub(pattern=".daf",replace=""),
                     stringsAsFactors = FALSE) %>%
  subset(!is.na(p))
# FWW correction
p_vals_FWW <- data.frame(p=FWW_list %>% lapply(getP) %>% unlist(),
                     alpha=FWW_list %>% lapply(getA) %>% unlist(),
                     gene=dafs %>% gsub(pattern=".daf",replace=""),
                     stringsAsFactors = FALSE) %>%
  subset(!is.na(p))

Notice that I am removing NA values where there weren’t sufficient polymorphisms in the gene to perform the test (more data might help solve this). We should probably also correct for multiple testing.

# MK-test
p_vals$p.adj <- p.adjust(p_vals$p, method = "BH")
# FWW correction
p_vals_FWW$p.adj <- p.adjust(p_vals_FWW$p, method = "BH")

Finally, observe that with the FWW correction more genes have significant evidence of selection.

# MK-test
sum(p_vals$p < 0.05)
## [1] 8
sum(p_vals$p.adj < 0.05)
## [1] 3
# FWW correction
sum(p_vals_FWW$p < 0.05)
## [1] 12
sum(p_vals_FWW$p.adj < 0.05)
## [1] 3

And now it’s up to you. Do these genes do interesting things? Could selection be a spurious result of demographic changes? Not my problem, but it could be yours. For further reading on this and other tests for selection I highly recommend Matthew Hahn’s Book: Molecular Population Genetics.

interesting_genes <- p_vals_FWW %>% subset(p.adj < 0.05)
head(interesting_genes)
##                 p     alpha               gene        p.adj
## 535  9.381743e-05 1.0000000 synCOG272278.fasta 4.847234e-03
## 1211 5.981631e-17 0.9798712 synCOG270333.fasta 9.271529e-15
## 1251 6.410585e-05 0.9807692 synCOG270673.fasta 4.847234e-03