The point of this lesson is to show that you can do a time lagged network in R, no LSA program required. This skips the “local” element of Local Similarity Analysis, but I found that in my research I was electing just to do time lagged spearman networks. Similarly, one could use this same approach to build a time lagged sparCC network or most other statistics.
This may or may not work for graphical lasso approaches.
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[30m[32m✓[30m [34mggplot2[30m 3.2.1 [32m✓[30m [34mpurrr [30m 0.3.3
[32m✓[30m [34mtibble [30m 2.1.3 [32m✓[30m [34mdplyr [30m 0.8.3
[32m✓[30m [34mtidyr [30m 1.0.2 [32m✓[30m [34mstringr[30m 1.4.0
[32m✓[30m [34mreadr [30m 1.3.1 [32m✓[30m [34mforcats[30m 0.4.0[39m
[30m── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[30m [34mdplyr[30m::[32mas_data_frame()[30m masks [34mtibble[30m::as_data_frame(), [34migraph[30m::as_data_frame()
[31mx[30m [34mpurrr[30m::[32mcompose()[30m masks [34migraph[30m::compose()
[31mx[30m [34mtidyr[30m::[32mcrossing()[30m masks [34migraph[30m::crossing()
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mgroups()[30m masks [34migraph[30m::groups()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()
[31mx[30m [34mpurrr[30m::[32msimplify()[30m masks [34migraph[30m::simplify()[39m
library(lubridate)
Attaching package: ‘lubridate’
The following object is masked from ‘package:igraph’:
%--%
The following object is masked from ‘package:base’:
date
library(zoo)
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(SpiecEasi)
library(psych)
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
The following objects are masked from ‘package:SpiecEasi’:
cor2cov, shannon
library(igraph)
Lets bring in some time-series data from the SPOT dataset. These are ARISA fragements, their lengths associate with species identity. I downloaded the data here:
https://www.bco-dmo.org/dataset/535915
And then removed some variables that we are not using yet.
The data set is somewhat large. To save space, I’ve zipped it.
Fortunately, R can read zipped csv files just fine. Some of the months are missing arrisa data and are loaded in as “nd”. We can remove those later, but we tell R so it doesn’t freak out.
# read in data
spotSurface <- read_csv("SPOT/spot_arisa_surface.zip") %>%
# treat date column as a date, rather than a character string
mutate(date_local = ymd(date_local))
Parsed with column specification:
cols(
date_local = [34mcol_date(format = "")[39m,
arisa_frag = [31mcol_character()[39m,
rel_abund = [32mcol_double()[39m
)
How many times do we see each taxon? How abundant are they. We filter the taxa a lot if we only keep things with a mean abundance of at least 0.5%
howMany <- spotSurface %>%
group_by(arisa_frag) %>%
summarize(sightings = sum(na.omit(rel_abund > 0)),
meanAbun = mean(na.omit(rel_abund))) %>%
arrange(-sightings)
keepTaxa <- howMany %>%
filter(sightings >= 5, meanAbun >= 0.01) %>%
pull(arisa_frag)
keepTaxa
[1] "ARISA_670.5" "ARISA_538.9" "ARISA_666.4" "ARISA_662" "ARISA_686.9" "ARISA_435.5" "ARISA_682.4" "ARISA_419.5"
[9] "ARISA_676.9" "ARISA_421.5" "ARISA_689.8" "ARISA_729.4" "ARISA_570.1" "ARISA_667.6" "ARISA_573.6" "ARISA_679.4"
[17] "ARISA_561.8" "ARISA_402.4" "ARISA_828.8" "ARISA_831.8" "ARISA_594.1"
spotSurface2 <- spotSurface %>% filter(arisa_frag %in% keepTaxa)
Lets reshape everything into a wide format data frame
spotWide <- spotSurface2 %>% pivot_wider(names_from = arisa_frag, values_from = rel_abund)
There are lots of months with missing data, and lots of months that are missing but don’t have rows. Lets update this so that we have a row for every month, even if it has NA values.
spotWide2 <- spotWide %>%
na.omit %>%
mutate(yr = year(date_local), mth = month(date_local)) %>%
select(yr, mth, date_local, everything()) %>%
arrange(yr, mth)
goalDates <- tibble(
yr = rep(first(spotWide2$yr):last(spotWide2$yr), each = 12),
mth = rep(1:12, last(spotWide2$yr) - first(spotWide2$yr) + 1)
) %>%
mutate(filler_date = ymd(paste(yr, mth, 15, sep = "_"))) %>%
filter(filler_date > min(spotWide2$date_local) &
filler_date < max(spotWide2$date_local)) %>%
select(-filler_date)
spotWideAllMonths <- left_join(goalDates, spotWide2)
Joining, by = c("yr", "mth")
Ok. So we’re going to address this missing not at random data in a way that critics might call reckless. Linearly interpolating it.
spotInterp <- spotWideAllMonths %>%
mutate_at(vars(matches("ARISA")), na.approx)
spotInterp2 <- spotInterp %>%
mutate(date_local = if_else(
is.na(date_local),
ymd(paste(yr, mth, 15)),
date_local)
) %>%
select(-c(yr, mth))
spotInterpMtx <- spotInterp2 %>%
column_to_rownames("date_local") %>%
as.matrix()
spotClrMtx <- clr(spotInterpMtx)
Ok, so with this dataset, we can do any of the things from the earlier lessons. We can also export it for local similarity analysis outside of R.
Ok, lets make a new matrix, where the columns are dates, but then there are another series of columns of dates lagged by one.
spotUnLag <- spotInterpMtx[-1,]
spotLag <- spotInterpMtx[-nrow(spotInterpMtx),]
colnames(spotUnLag) <- paste("nolag", colnames(spotUnLag), sep = "-")
colnames(spotLag) <- paste("lag1", colnames(spotLag), sep = "-")
spotWLag <- cbind(spotUnLag, spotLag)
Now we’ll correlate everything vs everyting. Keep in mind that the unlagged vs unlagged are missing one row. We could get around this by doing everything in two batches.
spearCorTestClr <- corr.test(spotInterpMtx, method = "spearman", adjust = "none")
spearCorClr <- spearCorTestClr$r
spearPClr <- spearCorTestClr$p
spearCorTestClr_Lagged <- corr.test(spotWLag, method = "spearman", adjust = "none")
spearCorClr_Lagged <- spearCorTestClr_Lagged$r
spearPClr_Lagged <- spearCorTestClr_Lagged$p
source("jacob_library.R")
Some data wrangling, as per my earlier lesson
reordered_spearCorTestClr_Lagged <- reorder_cor_and_p(spearCorClr_Lagged, spearPClr_Lagged)
spearCorClr_Lagged_Reordered <- reordered_spearCorTestClr_Lagged$r
spearPClr_Lagged_Reordered <- reordered_spearCorTestClr_Lagged$p
spearCorClr_Lagged_Proc <- spearCorClr_Lagged_Reordered %>% get_upper_tri() %>% reshape2::melt() %>% na.omit() %>% rename(rho = value)
spearPClr_Lagged_Proc <- spearPClr_Lagged_Reordered %>% get_upper_tri() %>% reshape2::melt() %>% na.omit() %>% rename(p = value)
spearRhoP_lagged <- left_join(spearCorClr_Lagged_Proc, spearPClr_Lagged_Proc, by = c("Var1", "Var2"))
spearRhoP_lagged
wrangle_lagged <- function(preWrangled, coef = "rho"){ # I need to allow the coefficient name to change for the sparcc stuff
# Initial wrangling
postWrangled <- preWrangled %>%
separate(Var1, c("V1", "Var1"), sep = "-") %>%
separate(Var2, c("V2", "Var2"), sep = "-") %>%
# get rid of rows where both variables are laged
filter(!(V1 == "lag1" & V2 == "lag1")) %>%
filter(Var1 != Var2) %>%
mutate(delay = if_else(V1 == "lag1" & V2 == "nolag", -1,
if_else(V1 == "nolag" & V2 == "lag1", 1,
if_else(V1 == "nolag" & V2 == "nolag", 0, -9999)
)
)
) %>%
mutate(fdr =p.adjust(p, method = "BH"))
# Now we select the delay with the highest score.
#
# Note that we calculated the false discovery rate *before* we selected the value with the highest score, but *after* we removed the delay-vs-delay comparasons.
bestLag <- postWrangled %>%
group_by(Var1, Var2) %>%
top_n(1, !!as.name(coef)) %>%
select(-c(V1, V2))
## Add arrows for easier igraph plotting
arrowedData <- bestLag %>%
#filter(fdr < 0.05) %>%
mutate(arrow = recode(delay, `-1` = "<", `1` = ">", `0` = "-"))
arrowedData
}
spearRhoP_lagged4<- spearRhoP_lagged %>% wrangle_lagged
# spearRhoP_lagged2 <- spearRhoP_lagged %>%
# separate(Var1, c("V1", "Var1"), sep = "-") %>%
# separate(Var2, c("V2", "Var2"), sep = "-") %>%
# # get rid of rows where both variables are laged
# filter(!(V1 == "lag1" & V2 == "lag1")) %>%
# filter(Var1 != Var2) %>%
# mutate(delay = if_else(V1 == "lag1" & V2 == "nolag", -1,
# if_else(V1 == "nolag" & V2 == "lag1", 1,
# if_else(V1 == "nolag" & V2 == "nolag", 0, -9999)
# )
# )
# ) %>%
# mutate(fdr =p.adjust(p, method = "BH"))
#
# spearRhoP_lagged2
Now we select the delay with the highest score.
Note that we calculated the false discovery rate before we selected the value with the highest score, but after we removed the delay-vs-delay comparasons.
# spearRhoP_lagged3 <- spearRhoP_lagged2 %>%
# group_by(Var1, Var2) %>%
# top_n(1, rho) %>%
# select(-c(V1, V2))
# spearRhoP_lagged3
And there you have a table that can go into a network. Some thoughts. There is way more data handling to do this with sparcc, but it could be done. I really ought to automate this into a general time delay function. I don’t think this works for graphical lasoo but it could be tried.
Autocorrelation may inflate some of these scores.
I want a network where arrows point from leading to lagging nodes. Unlagged connections should be represented as lines.
# spearRhoP_lagged4 <- spearRhoP_lagged3 %>%
# filter(fdr < 0.05) %>%
# mutate(arrow = recode(delay, `-1` = "<", `1` = ">", `0` = "-"))
LaggedSpearGraph <- graph_from_data_frame(spearRhoP_lagged4 %>% filter(fdr < 0.05))
LaggedSpearGraph
set.seed(333)
plot(LaggedSpearGraph,vertex.size=2, vertex.label.cex = 0.75, edge.arrow.mode = E(LaggedSpearGraph)$arrow, vertex.label = NA, edge.arrow.size = .5)
Clearly, I’m not an igraph artist but you get the idea.
As above but this time with sparcc
Sparcc doesn’t allow relative abundance data, it has to be counts. We can fake this with alrisa, by multiplying everything by 1000 and then rounding to the nearist
spot
tp0 <- proc.time()
lagSparcc <- sparcc(spotWLagCounts)
tp1 <- proc.time()
tp1 - tp0
user system elapsed
0.361 0.021 0.379
Bootstrapping step, so we can have p values
tp0 <- proc.time()
bootSparcc <- sparccboot(spotWLagCounts, R = 100)
tp1 <- proc.time()
tp1 - tp0
user system elapsed
66.196 0.092 66.290
Slow, of course. In real life you’d want to do at least 1000 permutations.
Calculate P values
PSparcc <- pval.sparccboot(bootSparcc)
data.frame(PSparcc$cors, PSparcc$pvals)
Extract from the triangular matrix I ought to make a function to do this automatically.
clean_Psparcc <- function(ps_mtx, cNames){
cors <-ps_mtx$cors
pvals <- ps_mtx$pvals
nVars <- length(cNames)
# Dump the values into a rectangular matrix
# Empty matrix
sparCCpcors <- diag(0.5, nrow = nVars, ncol = nVars)
# Fill in upper triangle
sparCCpcors[upper.tri(sparCCpcors, diag=FALSE)] <- cors
# Fill in lower triangle
sparCCpcors <- sparCCpcors + t(sparCCpcors)
# As above, but for p values
sparCCpval <- diag(0.5, nrow = nVars, ncol = nVars)
sparCCpval[upper.tri(sparCCpval, diag=FALSE)] <- pvals
sparCCpval <- sparCCpval + t(sparCCpval)
rownames(sparCCpcors) <- cNames
colnames(sparCCpcors) <- cNames
rownames(sparCCpval) <- cNames
colnames(sparCCpval) <- cNames
return(list(cors = sparCCpcors, p = sparCCpval))
}
cleanSparccData <- clean_Psparcc(PSparcc, colnames(spotWLag))
reshape_sparcc <- function(csparcc){
sparCCpcors <- csparcc$cors
sparCCpval <- csparcc$p
reordered_all_sparcc <- reorder_cor_and_p(sparCCpcors, sparCCpval)
reordered_sparccCor <- reordered_all_sparcc$r
reordered_sparccP<- reordered_all_sparcc$p
sparccCor_processed <- reordered_sparccCor %>% get_upper_tri() %>% reshape2::melt() %>% na.omit() %>% rename(cor = value)
sparccP_processed <- reordered_sparccP %>% get_upper_tri() %>% reshape2::melt() %>% na.omit() %>% rename(p = value)
# join the two data frames
SparccP <- left_join(sparccCor_processed, sparccP_processed, by = c("Var1", "Var2")) #%>%
# # remove self correlations
# filter(Var1 != Var2) %>%
# calculate the false discovery rate to adjust for multiple p values, not yet
#mutate(fdr = p.adjust(p, method = "BH"))
SparccP
}
longLaggedSparccData <- reshape_sparcc(cleanSparccData)
head(longLaggedSparccData)
So all of my p-values are coming out really high, and the strongest correlations are returning as NA
wrangledLaggedSparccData <- wrangle_lagged(longLaggedSparccData, coef = "cor")
wrangledLaggedSparccData
wrangledLaggedSparccData %>% pull(p)
[1] 0.12 0.00 0.00 0.02 0.00 0.00 0.64 0.20 0.40 0.12 0.24 0.44 0.00 0.00 0.00 0.00 0.86 0.00 0.00 0.04 0.00
[22] 0.00 0.00 0.58 0.70 0.16 0.02 0.18 0.02 0.62 0.02 0.12 0.04 0.02 0.12 0.00 0.00 0.00 0.06 0.14 0.00 0.46
[43] 0.06 0.00 0.00 0.00 0.32 0.54 0.04 0.60 0.00 0.00 0.18 0.32 0.12 0.88 0.30 0.58 0.20 0.02 0.06 0.66 0.04
[64] 0.02 0.16 0.34 0.60 0.00 0.40 0.92 0.08 0.92 0.22 0.08 0.00 0.26 0.30 0.82 0.64 0.80 0.82 0.44 0.16 0.74
[85] 0.52 0.66 0.68 0.56 0.78 0.70 0.64 0.82 0.00 0.96 0.54 0.50 0.60 0.04 0.74 0.10 0.58 0.80 0.10 0.98 0.02
[106] 0.10 0.00 0.02 0.82 0.74 0.00 0.06 0.36 0.38 0.00 0.00 0.02 0.46 0.16 0.32 0.56 0.12 0.00 0.08 0.00 0.02
[127] 0.00 0.00 0.44 0.86 0.10 0.00 0.14 0.32 0.10 0.10 0.94 0.50 0.82 0.18 0.00 0.38 0.10 0.10 0.82 0.76 0.96
[148] 0.10 0.32 0.40 0.50 0.12 0.10 0.00 0.26 0.08 0.08 0.58 0.38 0.02 0.04 0.16 0.60 0.02 0.44 0.00 0.08 0.02
[169] 0.00 0.84 0.08 0.68 0.34 0.12 0.58 0.38 0.34 0.60 0.18 0.80 0.60 0.30 0.02 0.04 0.06 0.04 0.06 0.98 0.94
[190] 0.14 0.36 0.64 0.68 0.26 0.82 0.54 0.92 0.24 0.62 0.22 0.62 0.14 0.00 0.88 0.56 0.32 0.74 0.52 0.02 0.08
[211] 0.02 0.00 0.00 0.40 0.98 0.84 0.00 0.74 0.74 0.04 0.00 0.00 0.00 0.66 0.10 0.00 0.06 0.00 0.08 0.00 0.04
LaggedSparccGraph <- graph_from_data_frame(wrangledLaggedSparccData %>% filter(fdr < 0.05))
LaggedSparccGraph
IGRAPH 5b38e28 DN-- 21 48 --
+ attr: name (v/c), cor (e/n), p (e/n), delay (e/n), fdr (e/n), arrow (e/c)
+ edges from 5b38e28 (vertex names):
[1] ARISA_573.6->ARISA_662 ARISA_573.6->ARISA_570.1 ARISA_662 ->ARISA_573.6 ARISA_570.1->ARISA_573.6
[5] ARISA_666.4->ARISA_686.9 ARISA_570.1->ARISA_435.5 ARISA_666.4->ARISA_435.5 ARISA_686.9->ARISA_435.5
[9] ARISA_570.1->ARISA_538.9 ARISA_573.6->ARISA_538.9 ARISA_686.9->ARISA_538.9 ARISA_435.5->ARISA_538.9
[13] ARISA_686.9->ARISA_666.4 ARISA_570.1->ARISA_561.8 ARISA_573.6->ARISA_561.8 ARISA_435.5->ARISA_561.8
[17] ARISA_594.1->ARISA_561.8 ARISA_666.4->ARISA_594.1 ARISA_435.5->ARISA_594.1 ARISA_561.8->ARISA_594.1
[21] ARISA_435.5->ARISA_689.8 ARISA_594.1->ARISA_689.8 ARISA_686.9->ARISA_682.4 ARISA_670.5->ARISA_682.4
[25] ARISA_828.8->ARISA_831.8 ARISA_662 ->ARISA_402.4 ARISA_686.9->ARISA_402.4 ARISA_828.8->ARISA_402.4
[29] ARISA_831.8->ARISA_402.4 ARISA_670.5->ARISA_419.5 ARISA_828.8->ARISA_419.5 ARISA_421.5->ARISA_419.5
+ ... omitted several edges
set.seed(333)
plot(LaggedSparccGraph,vertex.size=2, vertex.label.cex = 0.75, edge.arrow.mode = E(LaggedSpearGraph)$arrow, vertex.label = NA, edge.arrow.size = .5)
As you can see, this network looks more like mola-mola as drawn by Picasso than the spearman one, but otherwise pretty qualitatively similar.