1. Purpose


Genetic diversity refers to variation in the base pairs of DNA. We can observe the patterns in genetic diversity to make inferences about evolutionary processes such as migration, selection, and drift. Genetic diversity among individuals within a species might increase with geographic distance; therefore, you might expect the larger a species range, the higher the genetic diversity.

Learning objectives

  1. Navigate an open-source database.
  2. Explain how genetic diversity is related to species characteristics such as range size and dispersal ability.
  3. Run R code that automates data analysis.
  4. Exapling the value of automating data analysis.

The code presented here loops through all the sequence alignments from a phylogatR download and calculates genetic diversity & the size of the geographic range of sampling. Next, it runs a regression analysis to ask the question: does the size of the geographic range of sampling predict levels of genetic diversity?

The data you will download from PhylogatR will consist of a series of folders for each species. In each folder there will be unaligned and aligned fasta files for each locus1 that is present for that species in the database. These files contain the genetic data for a species. Unaligned fasta files will have the .fa extension and aligned fasta files will have the .afa extension. In addition to these files, each species folder will contain an occurrence file, named occurrences.txt. This file contains metadata associated with individual species occurrences, such as GenBank accession number, GBIF ID, and geographic coordinates. In the root folder of your download, you will find a tab-delimited genes.txt file that includes metadata from your download, such as taxonomy for each species, number of individuals per alignment, and how many sequences were removed after data cleaning steps.

Throughout this exercise data from the genus Plethodon, a group of terrestrial salamanders from North America, will be used as an example.

Open the genes.txt file.

First, load all the necessary packages. If you have not already installed these packages, you will need to do that first. See here for more information on package installation in R.

#load necessary packages
library(tools)
library(seqinr)
library(pegas)
library(raster)
library(rgeos)
library(ggplot2)
library(dplyr)

2. Explore data

When importing your own data from PhylogatR, you will need to change the file path or set your current working directory accordingly.

Import the genes.txt file to extract information from the download.

genes <- read.delim("phylogatr-results/genes.txt")

nrow(genes)
## [1] 33
unique(genes$species)
##  [1] "Plethodon cinereus"    "Plethodon fourchensis" "Plethodon glutinosus" 
##  [4] "Plethodon grobmani"    "Plethodon hubrichti"   "Plethodon kiamichi"   
##  [7] "Plethodon mississippi" "Plethodon montanus"    "Plethodon ouachitae"  
## [10] "Plethodon richmondi"   "Plethodon serratus"    "Plethodon sherando"   
## [13] "Plethodon shermani"    "Plethodon vehiculum"   "Plethodon wehrlei"

Thenrows() function counts how many rows are in your dataset.

3. Import sequence data


First, we are going to create a list of all the sequence alignments and put them in an object called myfiles.

myfiles <- list.files(path = "phylogatr-results", pattern = ".afa", full.names=TRUE, recursive = TRUE)

4. Calculate genetic diversity and sampling range


Here we loop through the list stored in myfiles to calculate genetic diversity for each sequence alignment and the area that the geographic coordinates from each species occupies. All data will be written to a csv file called phylogatr-data.csv and stored at the path specified at the end of the loop. Here, we are storing the file on my desktop, but you will need to change the path to where you want the file stored accordingly.

for (f in myfiles) {

  #get species and gene name
  sp <- basename(f)
  sp <- file_path_sans_ext(sp)
  
  #read in fasta file
  seq <- fasta2DNAbin(f, quiet=TRUE)
  
  #get number of sequences
  n <- nrow(seq)

  #calculate nucleotide diversity
  pi <- nuc.div(seq)
  
  #navigate to species folder to get GPS coordinates
  s <- sub("/[^/]+$", "", f)
  c <- paste(s, "/occurrences.txt", sep="")
  occurrences <- read.delim(c)
  xy <- occurrences[,c(5:4)]
    
  #get area based on species GPS coordinates
  ch <- chull(xy)
  coords <- xy[c(ch, ch[1]), ]
  sp_poly <- SpatialPolygons(list(Polygons(list(Polygon(coords)), ID=1)))
  area<-gArea(sp_poly)
  format(area, scientific = FALSE)
  
  #write data to file
  write.table(data.frame(sp, pi, area, n), file="phylogatr-pi-area.csv", quote=FALSE, row.names=FALSE, col.names=!file.exists("phylogatr-pi-area.csv"), append=TRUE, sep=",")
}

You will see the error "wrong file extension". This can be ignored. It is because we are using .fa and .fas as our file extensions instead of the extpected .fasta.

5. Regression analysis


First, import the output from the previous step. If necessary, change the path to the file accordingly.

data <- read.csv("phylogatr-pi-area.csv")

Run the regression analysis.

r <- lm(pi~area, data=data)
summary(r)
## 
## Call:
## lm(formula = pi ~ area, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02761 -0.02084 -0.01694  0.01481  0.08229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.083e-02  5.431e-03   3.835 0.000577 ***
## area        8.525e-05  2.096e-04   0.407 0.686949    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02906 on 31 degrees of freedom
## Multiple R-squared:  0.00531,    Adjusted R-squared:  -0.02678 
## F-statistic: 0.1655 on 1 and 31 DF,  p-value: 0.6869

In this case, the relationship between area and pi is almost zero (R-squared = 0.005) and not significant (p = 0.69).

Create a plot to see what the data look like.

ggplot(data, aes(x=area, y=pi)) + geom_point(color="red")

There are some clear outliers in these Plethodon data and we will explore these further.

outlier.pi <- filter(data, pi > 0.1)
outlier.pi
##                         sp        pi     area n
## 1 Plethodon-glutinosus-COI 0.1045455 16.75395 5
outlier.area <- filter(data, area > 80)
outlier.area
##                        sp           pi    area  n
## 1  Plethodon-cinereus-18S 0.0004401408 84.7279  8
## 2  Plethodon-cinereus-COI 0.0051136364 84.7279 55
## 3 Plethodon-cinereus-CYTB 0.0674329502 84.7279  4

We chose to explore the high level of genetic diversity of P. serratus by looking at the sequence alignment for this species and gene (see here for guidance. After visually inspecting the alignment, we suspect that one of the individual sequences in this alignment might have been mis-idantified and so we will remove this species from the analysis for now. However, in some circumstances you could remove an odd looking sequence, and keep that species in the analysis.

Photo credit: apmbibiaweb.org: Tod Pierson

Given what we know about the distribution of P. cinereus this species might be an outlier, but is likley not a mistake in the data. This species has a VERY large distribution compared to other Plethodon species, but it is interesting that correspondingly its genetic diveristy isn't higher, as expected.

Photo credit: apmbibiaweb.org: John White Map credit: iucnredlist.org

In most cases where you have an obvious geographic outlier, we recommend plotting the GPS coordinates for that species and making a decision about the GPS coordinates for that species. Once the GPS coordinates for any outlier species are cleaned up, those species may be able to remain in the analysis. See here for guidance. For now, we will simply remove it and redo our analysis.

data_pruned <- filter(data, area < 80, pi < 0.10)
r_pruned <- lm(pi~area, data=data_pruned)
summary(r_pruned)
## 
## Call:
## lm(formula = pi ~ area, data = data_pruned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01996 -0.01804 -0.01569  0.01568  0.05173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.019962   0.004932   4.048  0.00039 ***
## area        -0.001069   0.001429  -0.748  0.46097    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02435 on 27 degrees of freedom
## Multiple R-squared:  0.0203, Adjusted R-squared:  -0.01599 
## F-statistic: 0.5594 on 1 and 27 DF,  p-value: 0.461
ggplot(data_pruned, aes(x=area, y=pi)) + geom_point(color="red")

This changes things, but not in any meaningful way (R-squared = 0.0.02, p = 0.46). We are not advocating removing data until you obtain significant results, but I am curious what will happen after removing the other points that stand out.

data_pruned2 <- filter(data_pruned, area < 15)
r_pruned2 <- lm(pi~area, data=data_pruned2)
summary(r_pruned2)
## 
## Call:
## lm(formula = pi ~ area, data = data_pruned2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.01944 -0.01856 -0.01536  0.01577  0.05224 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.0194407  0.0055856   3.480  0.00178 **
## area        -0.0003502  0.0036706  -0.095  0.92471   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0248 on 26 degrees of freedom
## Multiple R-squared:  0.0003501,  Adjusted R-squared:  -0.0381 
## F-statistic: 0.009105 on 1 and 26 DF,  p-value: 0.9247
ggplot(data_pruned2, aes(x=area, y=pi)) + geom_point(color="red")

Our hypothesis was not supported. The size of the geographic range, as measured here, does not explain variation in intra-specific genetic diveristy and there are likley other factors that influence genetic diversity within the species of this group. However, the removal of outliers did influnce the results showing more of a trend (R-squared = 0.0003, p = 0.92). The outliers we detected might be worth exploring further.


  1. A locus is a specific position in the genome where a particular gene or genetic marker is located. As such, not all loci necessarily correspond to a gene. Because PhylogatR alignments reflect genetic data uploaded to GenBank, you may also encounter alignments representing anonymous loci.