1. Purpose

This module is designed to explore spatial genetic structure within a species from a PhylogatR download.

The code presented here explores genetic and geographic data downloaded from the PhylogatR website and walks you through the steps of: i) importing your genetic and geographic data, ii) visualizing sample density, and iii) exploring the genetic variation and how it is clustered.

Genetic variation refers to diversity in the base pairs of DNA. We can observe the patterns in genetic variation to make inferences about evolutionary processes such as migration, selection, and drift, as well as estimate the relationship among individuals in a sample and visualize these relationships using a phylogenetic tree.

The data you will download from PhylogatR will consist of a series of folders for each species. In each folder there will be unaligned and aligned fasta files for each locus1 that is present for that species in the database. These files contain the genetic data for a species. Unaligned fasta files will have the .fa extension and aligned fasta files will have the .afa extension. In addition to these files, each species folder will contain an occurrence file, named “occurrences.txt”. This file contains metadata associated with individual species occurrences, such as accession number, taxonomy, basis of record, and geographic coordinates.

We will start by looking at the data for just one species. The next two sections of this module will take you through the process of importing your data from PhylogatR and preparing it for analysis. Data preparation is a very important part of any scientific analysis, and can often take many steps, especially when dealing with large datasets in which the data wasn’t specifically collected with the current study in mind (i.e., repurposed data). But, with the proper preparation, repurposed data (such as what you’ll encounter on PhylogatR) can be used to answer new and exciting questions!

The sections following data preparation (sections 4-6) will walk you through assigning individuals to populations to visualize sampling density and assesing spatial genetic structure through Spatial Principal Component Analysis (sPCA).

First, load all the necessary packages. If you have not already installed these packages, you will need to do that first. See here for more information on package installation in R.

#load neccesary packages 

2. Importing Genetic Data

We will import all the sequence alignments for one species. In this case, there is a folder on my Desktop called “phylogatr-results” and we are going to look at data from the tundra shrew, Sorex tundrensis (pictured below). When you import your own data from PhylogatR, you will need to change the path to the folder accordingly.

Photo credit: Andrew Hope, USGS

First, we need to store the path names of all of the aligned fasta files for our chosen species. Again, make sure your path here is set to the folder of the species that you want to inspect. We will generate a list of paths to all of the aligned fasta files (i.e., all files in the species folder that end in “.afa”) using the function list.files and store this list in R as “myfiles”.

#get paths to aligned fasta files in species folder (all .afa files in downloaded PhylogatR folder)
myfiles<-list.files(path="/Users/parsons.463/Desktop/phylogatr-results/Sorex-tundrensis", pattern=".afa", full.names=TRUE)

Next, we will use the paths stored in “myfiles” to import our aligned fasta files and create a multidna object. A multidna object is a class of object used to store multiple DNA sequence alignments. Sequences are stored as a list, with each element of the list being a seperate DNA alignment stored as a DNAbin matrix. For more information on DNAbin and multidna objects, see here.

We will use the function read.multiFASTA to create a multidna object containing our aligned fasta files. In the likely case that some individuals are not sequenced for all genes, setting the argument “add.gaps” to false will ensure that gap only sequences will not be added to the alignments.

#create S4 class multidna object (contains DNAbin objects for all loci/.afa alignment files)
multidna<-read.multiFASTA(files=myfiles, add.gaps=FALSE)

Now we will prepare our multidna object (which is used to store multiple alignments) for its final conversion into a genind object (which is used to store individual genotypes). To do this conversion, we have to ensure that the loci names in our multidna object do not have any periods in them. Becasue the loci names in a multidna object are automatically set to the names of imported files used to generate it, (in our case, aligned fasta files ending in “.afa”), we will use the gsub function to remove the “.afa” file extension from our multidna locus names.

The returned text should contain a list of new locus names. We can see that the species Sorex tundrensis has sequence information for four loci, APOB, BRCA1, COI, and CYTB. Before moving on, be sure to double check that all file extensions (i.e., “.afa”) have been successfully removed.

#prepare multidna object for conversion to genind(s) by removing "." from locus names (which are automatically named after filenames)
(setLocusNames(multidna)<-gsub(".afa", "", getLocusNames(multidna)))
## [1] "Sorex-tundrensis-APOB"  "Sorex-tundrensis-BRCA1" "Sorex-tundrensis-COI"  
## [4] "Sorex-tundrensis-CYTB"

The function multidna2genind concatenates seperate DNA alignments, and then extracts SNPs of the resulting alignment to create a single genind object. We will now use this function to convert our multidna object, “multidna”, into a genind object, named “seq_genind”. Setting the argument genes=TRUE will ensure that all genes from an individual are concatenated into a single alignment in our final genind object. Alternitavely, the argument genes=LocusName can be used to create a genind object for a single gene.2

#convert to genind object
seq_genind<-multidna2genind(multidna, genes=TRUE)

3. Importing Geographic Data

Now that we’ve successfully imported and prepared our genetic data, we can get started with our geographic data. First, we will import our “occurrences.txt” file into R and save it as a data frame called “occurrences”.

In this case, there is a folder on my Desktop called “phylogatr-results” and we are going to look at the data from the “occurrences.txt” file located within the species Sorex tundrensis. When you import your own data from PhylogatR, you will need to change this path accordingly.

Once that data has been imported into R, we will extract just the columns containing longitude and latitude (i.e., columns 4 and 5) and save them in R as “geo”.

#read in geographic coordinates
occurrences<-as.data.frame(read.table("/Users/parsons.463/Desktop/phylogatr-results/Sorex-tundrensis/occurrences.txt", sep="\t", fill=TRUE, header=TRUE))


Because our downstream analysis does not allow for for duplicate localities (i.e., multiple occurrences with identical geographic coordinates), we will need to add some random noise to our coordinates to disentangle identical locations. First, we will use the function as.matrix to convert our geographic data “geo” into a matrix, which we will save as “xy”. We will then use the jitter function to add random noise to the coordinates in our “xy” matrix. The jittered coordinates will be saved in R as “xyrand”.

#jitter coordinates to remove duplicate localities

xyrand=jitter(xy, factor=2)

Finally, we will insert the jittered coordinates from “xyrand” into the “other” slot in our genind object, where we will save them in a list as “xy”. This will link our geographic coordinates to the thier appropriate genetic sequence.

#add jittered coordinates to genind object 

4. Visualizing Sample Density

First, we will make use of genind accessors to learn more about our data. Accessors are functions that allow us to interact with the different slots of our genind object. You can learn more about genind slots and accessors by referring to pages 8-12 of the adegenet tutorial (Jombart 2015).3 We will use the accessor nInd to check the number of individuals included in our genind object.

#check number of individuals in genind object
## [1] 139

Now we will assign each of the individuals in our chosen species to a collection locality. Individuals with the same geographic coordinates will be assigned to a single collection locality. First, we will generate locality IDs using the geographic data stored in “geo”. We will then save this information in the population slot, “pop”, of our genind object.

Note: Here, we are using the population slot of our genind object to assign individuals to collection localities. However, the population slot can also be used to include population level information, if you have that for your species.

#generate locality IDs and add assignments to genind object 
geo$ID<-id(geo[c("latitude", "longitude")], drop=TRUE)


Now we will determine the number of individuals in each locality and plot our geographic coordinates on a map to visualize sampling denisty. In the resulting map, each individual dot represents a collection locality and the size of the dot represents how many individuals belong to that locality. Please keep in mind that when using geographic coordinates in R, longitude (the x-axis) comes first, and latitude (the y-axis) comes second.

#determine number of individuals in each locality 
counts<-ddply(geo, .(ID), "count")

#plot sampling density
  geom_polygon(data = world, aes(x=long, y = lat, group = group), fill = "lightgrey", color = "gray30", size = .25)+
  ggtitle("Number of Individuals per Sampling Locality")+
  geom_point( data=counts, aes(x=counts$longitude, y=counts$latitude, size=freq),  color="blue", alpha = 0.4)