The data you will download from PhylogatR will consist of a series of folders for each species. In each folder there will be unaligned and aligned fasta files for each locus1 that is present for that species in the database. These files contain the genetic data for a species. Unaligned fasta files will have the .fa extension and aligned fasta files will have the .afa extension. In addition to these files, each species folder will contain an occurrence file, named “occurrences.txt”. This file contains metadata associated with individual species occurrences, such as accession number, taxonomy, basis of record, and geographic coordinates.
We will start by looking at the data for just one species. The next two sections of this module will take you through the process of importing your data from PhylogatR and preparing it for analysis. Data preparation is a very important part of any scientific analysis, and can often take many steps, especially when dealing with large datasets in which the data wasn’t specifically collected with the current study in mind (i.e., repurposed data). But, with the proper preparation, repurposed data (such as what you’ll encounter on PhylogatR) can be used to answer new and exciting questions!
The sections following data preparation (sections 4-6) will walk you through assigning individuals to populations to visualize sampling density and assesing spatial genetic structure through Spatial Principal Component Analysis (sPCA).
First, load all the necessary packages. If you have not already installed these packages, you will need to do that first. See here for more information on package installation in R.
#load neccesary packages 
library("ape")
library("phangorn")
library("apex")
library("ade4")
library("adegenet")
library("ggplot2")
library("maps")
library("plyr")
library("spdep")
library("adespatial")We will import all the sequence alignments for one species. In this case, there is a folder on my Desktop called “phylogatr-results” and we are going to look at data from the tundra shrew, Sorex tundrensis (pictured below). When you import your own data from PhylogatR, you will need to change the path to the folder accordingly.
Photo credit: Andrew Hope, USGS
First, we need to store the path names of all of the aligned fasta files for our chosen species. Again, make sure your path here is set to the folder of the species that you want to inspect. We will generate a list of paths to all of the aligned fasta files (i.e., all files in the species folder that end in “.afa”) using the function list.files and store this list in R as “myfiles”.
#get paths to aligned fasta files in species folder (all .afa files in downloaded PhylogatR folder)
myfiles<-list.files(path="/Users/parsons.463/Desktop/phylogatr-results/Sorex-tundrensis", pattern=".afa", full.names=TRUE)Next, we will use the paths stored in “myfiles” to import our aligned fasta files and create a multidna object. A multidna object is a class of object used to store multiple DNA sequence alignments. Sequences are stored as a list, with each element of the list being a seperate DNA alignment stored as a DNAbin matrix. For more information on DNAbin and multidna objects, see here.
We will use the function read.multiFASTA to create a multidna object containing our aligned fasta files. In the likely case that some individuals are not sequenced for all genes, setting the argument “add.gaps” to false will ensure that gap only sequences will not be added to the alignments.
#create S4 class multidna object (contains DNAbin objects for all loci/.afa alignment files)
multidna<-read.multiFASTA(files=myfiles, add.gaps=FALSE)Now we will prepare our multidna object (which is used to store multiple alignments) for its final conversion into a genind object (which is used to store individual genotypes). To do this conversion, we have to ensure that the loci names in our multidna object do not have any periods in them. Becasue the loci names in a multidna object are automatically set to the names of imported files used to generate it, (in our case, aligned fasta files ending in “.afa”), we will use the gsub function to remove the “.afa” file extension from our multidna locus names.
The returned text should contain a list of new locus names. We can see that the species Sorex tundrensis has sequence information for four loci, APOB, BRCA1, COI, and CYTB. Before moving on, be sure to double check that all file extensions (i.e., “.afa”) have been successfully removed.
#prepare multidna object for conversion to genind(s) by removing "." from locus names (which are automatically named after filenames)
(setLocusNames(multidna)<-gsub(".afa", "", getLocusNames(multidna)))## [1] "Sorex-tundrensis-APOB"  "Sorex-tundrensis-BRCA1" "Sorex-tundrensis-COI"  
## [4] "Sorex-tundrensis-CYTB"The function multidna2genind concatenates seperate DNA alignments, and then extracts SNPs of the resulting alignment to create a single genind object. We will now use this function to convert our multidna object, “multidna”, into a genind object, named “seq_genind”. Setting the argument genes=TRUE will ensure that all genes from an individual are concatenated into a single alignment in our final genind object. Alternitavely, the argument genes=LocusName can be used to create a genind object for a single gene.2
#convert to genind object
seq_genind<-multidna2genind(multidna, genes=TRUE)Now that we’ve successfully imported and prepared our genetic data, we can get started with our geographic data. First, we will import our “occurrences.txt” file into R and save it as a data frame called “occurrences”.
In this case, there is a folder on my Desktop called “phylogatr-results” and we are going to look at the data from the “occurrences.txt” file located within the species Sorex tundrensis. When you import your own data from PhylogatR, you will need to change this path accordingly.
Once that data has been imported into R, we will extract just the columns containing longitude and latitude (i.e., columns 4 and 5) and save them in R as “geo”.
#read in geographic coordinates
occurrences<-as.data.frame(read.table("/Users/parsons.463/Desktop/phylogatr-results/Sorex-tundrensis/occurrences.txt", sep="\t", fill=TRUE, header=TRUE))
geo<-occurrences[,c(5:4)]Because our downstream analysis does not allow for for duplicate localities (i.e., multiple occurrences with identical geographic coordinates), we will need to add some random noise to our coordinates to disentangle identical locations. First, we will use the function as.matrix to convert our geographic data “geo” into a matrix, which we will save as “xy”. We will then use the jitter function to add random noise to the coordinates in our “xy” matrix. The jittered coordinates will be saved in R as “xyrand”.
#jitter coordinates to remove duplicate localities
xy=as.matrix(geo)
xyrand=jitter(xy, factor=2)Finally, we will insert the jittered coordinates from “xyrand” into the “other” slot in our genind object, where we will save them in a list as “xy”. This will link our geographic coordinates to the thier appropriate genetic sequence.
#add jittered coordinates to genind object 
seq_genind$other$xy<-xyrandFirst, we will make use of genind accessors to learn more about our data. Accessors are functions that allow us to interact with the different slots of our genind object. You can learn more about genind slots and accessors by referring to pages 8-12 of the adegenet tutorial (Jombart 2015).3 We will use the accessor nInd to check the number of individuals included in our genind object.
#check number of individuals in genind object
nInd(seq_genind)## [1] 139Now we will assign each of the individuals in our chosen species to a collection locality. Individuals with the same geographic coordinates will be assigned to a single collection locality. First, we will generate locality IDs using the geographic data stored in “geo”. We will then save this information in the population slot, “pop”, of our genind object.
Note: Here, we are using the population slot of our genind object to assign individuals to collection localities. However, the population slot can also be used to include population level information, if you have that for your species.
#generate locality IDs and add assignments to genind object 
geo$ID<-id(geo[c("latitude", "longitude")], drop=TRUE)
seq_genind$pop<-as.factor(geo$ID)Now we will determine the number of individuals in each locality and plot our geographic coordinates on a map to visualize sampling denisty. In the resulting map, each individual dot represents a collection locality and the size of the dot represents how many individuals belong to that locality. Please keep in mind that when using geographic coordinates in R, longitude (the x-axis) comes first, and latitude (the y-axis) comes second.
#determine number of individuals in each locality 
counts<-ddply(geo, .(ID), "count")
#plot sampling density
world<-map_data("world")
ggplot()+ 
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill = "lightgrey", color = "gray30", size = .25)+
  coord_fixed(1.3)+
  xlab("Longitude")+
  ylab("Latitude")+
  ggtitle("Number of Individuals per Sampling Locality")+
  geom_point( data=counts, aes(x=counts$longitude, y=counts$latitude, size=freq),  color="blue", alpha = 0.4)An sPCA is deigned to investigate non-random spatial distributions of genetic variation. In other words, is the genetic variation that we observe in our species due to population structure, or is it the result of random mating?
We will use the spca function to perform a spatial principle component analysis (sPCA) on our data. This analysis implements a connection network in order to incorporate spatial data. We will use Delaunay triangulation to generate a connection network by setting the argument type=1. The connection network type can be changed by providing a different value for the type argument, and more information regarding the different types of connection networks available can be found with the command ?chooseCN.
Our genind object “seq_genind” will act as input for the analysis, which we will save in R as “mySpca”. For now, we will retain the first positive axis (by setting nfposi=1) and the first negative axis (by setting nfnega=1). The figure returned is a plot of our connection network.
#performs a sPCA using the Delaunay triangulation as connection network (type=1)
mySpca <- spca(seq_genind, type=1, ask=FALSE, scannf=FALSE, nfposi=1, nfnega=1)Now that we have completed our sPCA, we can use the results to determine how many principal component (PC) axes we want to retain for further analysis. Our goal here is to retain only the axes that explain the largest portion of the spatial genetic structure of our species. We will evaluate axis contribution in two ways: 1) plotting the sPCA eigenvalues and 2) plotting the spatial and genetic variance components of each eigenvalue.
First, we will use the function barplot to display a barplot of eigenvalues from our sPCA. Positive eigenvalues, which are located on the left side of the plot in warmer colors, indicate the presence of global structure. Global structures display positive spatial aturocorrelation, which is typically observed when populations are split into patches or located along clines. Negative eigenvalues, which are located on the right side of the plot in cooler colors, indicate the presence of local structure. Local structures display negative spatial aturocorrelation, which is typically observed when neighboring individuals are more genetically distinct than expected at random. For more information on global and local structure, see Jombart et al. (2008).4
The presence of either global or local structure is indicated by a more extreme eigenvalue. As each bar on the plot represents a single PC axis, we want to retain the axes with the most extreme (i.e., noticeably different) values. For example, the plot below suggests our data has a single global structure (as indicated by the leftmost red bar), and a potentially informative local structure (as indicated by the slightly smaller, rightmost blue bar). This would suggest that we want to retain the first positive axis and the first negative axis.
#plot eigenvalues of the analysis (stored inside the $eig component as a numeric vector) in decreasing order
barplot(mySpca$eig, main="sPCA eigenvalues", col=spectral(length(mySpca$eig)))
legend("topright", fill=spectral(2),
       leg=c("Global structures", "Local structures"))
abline(h=0,col="grey")We can confirm that our chosen axes contain the majority of spatially meaningful genetic variation by plotting the spatial and genetic variance portions of the eigenvalues. Each eigenvalue can be decomposed into a variance component (i.e., genetic variation) and a spatial autocorrelation component (i.e., spatial variation). We will use the screenplot function to graph the variance and spatial autocorrelation component of each of our sPCA eigenvalues. In the resulting figure, eigenvalues of the sPCA are denoted as λi (with i = 1, . . . , r), where λ1 is the highest positive eigenvalue (which corresponds to the leftmost bar on the previous bargraph), and λr is the highest negative eigenvalue (which corresponds to the rightmost bar an the previous bargraph). The x-axis represents variance, and the y-axis represents spatial autocorrelation.
#produce a figure that represents eigenvalues of sPCA
screeplot(mySpca)Much like what we were looking for in the previous graph, we want to make sure we are retaining the axes that contain spatially meaningful genetic variation (i.e., the ones that stand out). Fortunately, both the first (λ1) and last (λ56) axis are distinct from the rest, so our initial choice to retain the first and last axis when running the sPCA holds up. If your results suggest you should retain different axes, please refer to the following section, “Retaining Different PC Axes”, otherwise, move directly to the final section, “Visualizing Spatial Genetic Structure”.
Retaining Different PC Axes
The following example is meant to show you how to change the number and type of axes retained in your sPCA.
The barplot below suggests that the spatially meaningful genetic variance in this dataset is contained in the first three positive axes (bars indicated by blue arrows). As you can see, there is comparatively very little variance in local structure, so we won’t retain any negative axes.
The eigenvalue decompisition plot supports this notion, as the values belonging to the first three positive axes (λ1, λ2, λ3; circled in blue) clearly stand out from the rest.
Both plots suggest that the majority of spatially meaningful genetic variance in our dataset is contained in the first three positive axes. Therefore, we need to rerun our initial sPCA, while retaining just those axes. We can do this using the same spca function as before, but changing the values set for the arguments nfposi and nfnega. The argument nfposi refers to the number of positive axes to be retained, while the argument nfnega refers to the number of negative axes to be retained. For the example shown above, we will set nfposi=3 (retain first three positive axes), and nfnega=0 (retain zero negative axes).
The new sPCA will be saved as “mySpca”, rewriting any previous sPCA saved under the same name.
#performs a sPCA using the Delaunay triangulation as connection network (type=1); change nfposi=val and nfnega=val to change axes retained
mySpca <- spca(seq_genind,type=1,ask=FALSE,scannf=FALSE, nfposi=3, nfnega=0)Finally, we can use our chosen sPCA axes to visualize the spatial genetic structure of our species. We will use the colorplot function to plot PC lagged scores onto geographical space (lagged scores reduce overall noise of the data). In the plot below, the x-axis represents longitude, the y-axis represents latitude, and each dot represents an individual, with colors representing genetic similarity (i.e., similar colors = genetically similar; distinct colors = genetically distinct).
We can use the axes argument to change which PC axes are plotted in the figure. Setting axes=1 will plot just the first saved axis (here the first positive axis), while axes=2 will plot just the second saved axis (here the first negative axis). Setting axes=1:2 will plot combined variation from both these axes, as is shown in the figure below.
#visualize spatial genetic structure of sPCA axes 
colorplot(seq_genind$other$xy, mySpca$ls, axes=1:2, transp=T, alpha=0.75, cex=3, add=F, main="Colorplot of PC (lagged scores)")After plotting the combinied variation for the both retained axes (first positive axis and first negative axis), we can see that genetic variation in the tundra shrew, Sorex tundrensis, does seem to be at least partially spatially structured, with genetic variation (here shown by color) clustering in spatial groups.
A locus is a specific position in the genome where a particular gene or genetic marker is located. As such, not all loci necessarily correspond to a gene. Because PhylogatR alignments reflect genetic data uploaded to GenBank, you may also encounter alignments representing anonymous loci.↩
If you use the argument genes=LocusName to create a genind object for a single gene, you will need to manually edit the “occurrences.txt” file to contain just the coordinates of the individuals that have sequence data for that particular gene. Once you have created your genind object, you can use the command indNames(seq_genind) to obtain a list of accessions for all of the individuals in your genind object. Your “occurences.txt” file should contain these same individuals (and only these individuals), in the same order.↩
Jombart, T. 2015. An Introduction to adegenet 2.0.0. Imperial College London. MRC Centre for Outbreak Analysis and Modelling. http://adegenet.r-forge.r-project.org/files/tutorial-basics.pdf↩
Jombart, T., Devillard, S., Dufour, A-B., and Pontier, D. 2008. Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity. 101, 92-103. https://doi.org/10.1038/hdy.2008.34↩