1. Purpose


This module is designed to explore data downloads from PhylogatR for observations that look spurious. Data cleaning, and filtering out bad data, is the most important step in a data science project. Bad data = bad results. The code presented here pulls genetic and geographic data downloaded from the PhylogatR website and allows you to create a list of sequence alignments that stand out in the data based on estimates of nucleotide divesity.

Learning objectives

  1. Navigate an open-source database.
  2. Explain why data cleaning is a necessary step in data analysis.
  3. Quantify outliers from data.

2. PhylogatR

PhylogatR is a database that supplies users with DNA sequence data from thousands of species that are ready to analyze. There are several steps that needs to be carried out before DNA sequences can be used for data analysis.

Choose a species to assess sequence alignments and geographic coordinates. Navigate to the PhylogatR website to download data. In the example below, fungi was typed into the filter box, but you can also click on any names to keep those in your search. If you choose to type in a filter term, you will want to check the search list because it will retain anything that has fungi in the name. For example, see row two below (Family: Fungiidae). It is a coral, which is an animal, but the Family name has fungi nested within it. If you only want fungi, you will need to click on fungi under Kingdom so everything else is removed. Though for this exercise you will choose just a single species. p

Your download will be a zip file called phylogatr-results. If you are using a Mac, the zip file will unzip when you double click on it. If you are using Windows, you must extract the files first (this is very important, otherwise the computer won't recognize the file).

The data you download from PhylogatR will consist of a series of folders for several taxonomic levels until you get to species. In each species folder there will be unaligned and aligned fasta files for each locus1 that is present for that species in the database. These files contain the genetic data for a species. Unaligned fasta files will have the .fa extension and aligned fasta files will have the .afa extension. In addition to these files, each species folder will contain an occurrence file, named occurrences.txt. This file contains information associated with individual occurrences, such as accession number, taxonomy, basis of record, and geographic coordinates.

This module will take you through the process of estimating nucleotide diversity in a group of closely related species and determine which estimates fall outside of the 95% range of values for this summary statistic.

R and RStudio

R is a language and environment for statistical computing and graphics.

RStudio is a set of integrated tools designed to help you be more productive with R.

You will need to install R and RStudio on your computer. See here for instructions.

You will need to load R packages in order to obtain the functions necessary for the analyses that will be conducted. If you have not already installed these packages, you will need to do that first. See here for more information on package installation in R. You will need to install the following packages using the install.packages() function. You only need to install packages once on your computer, but you need to load the library every time you open a new R session.

For example, copy and paste install.packages("tools") into the console in R (bottom left). Do the same thing for each package that needs to be installed for each of the required packages listed below.

  • tools
  • sequinr
  • pegas

Next, load all the packages necessary for this exercise.

#load necessary packages
library(tools)
library(seqinr)
library(pegas)
library(adegenet)

2. Explore data

When importing your own data from PhylogatR, you will need to change the file path or set your current working directory accordingly. We are going to explore a small set of mammal data for demonstration purposes.

Import the genes.txt file to extract information from the download.

genes <- read.delim("phylogatr-results/genes.txt")

nrow(genes)
## [1] 62
unique(genes$species)
##  [1] "Anourosorex squamipes"      "Blarina brevicauda"        
##  [3] "Chimarrogale platycephalus" "Condylura cristata"        
##  [5] "Crocidura dsinezumi"        "Crocidura macmillani"      
##  [7] "Crocidura olivieri"         "Crocidura suaveolens"      
##  [9] "Crocidura tanakae"          "Crocidura wuchihensis"     
## [11] "Cryptotis lacertosus"       "Cryptotis magna"           
## [13] "Cryptotis mam"              "Cryptotis merriami"        
## [15] "Cryptotis mexicana"         "Cryptotis oreoryctes"      
## [17] "Neomys fodiens"             "Sorex araneus"             
## [19] "Sorex arcticus"             "Sorex bedfordiae"          
## [21] "Sorex caecutiens"           "Sorex cinereus"            
## [23] "Sorex daphaenodon"          "Sorex fumeus"              
## [25] "Sorex haydeni"              "Sorex hoyi"                
## [27] "Sorex isodon"               "Sorex longirostris"        
## [29] "Sorex minutissimus"         "Sorex minutus"             
## [31] "Sorex monticolus"           "Sorex palustris"           
## [33] "Sorex roboratus"            "Sorex trowbridgii"         
## [35] "Sorex tundrensis"           "Sorex vagrans"             
## [37] "Sorex volnuchini"

Thenrows() function counts how many rows are in your dataset.

3. Import sequence data


First, we are going to create a list of all the sequence alignments and put them in an object called myfiles.

myfiles <- list.files(path = "phylogatr-results", pattern = ".afa", full.names=TRUE, recursive = TRUE)

4. Calculate genetic diversity


Here we loop through the list stored in myfiles to calculate genetic diversity for each sequence alignment. All data will be written to a csv file called phylogatr-data.csv and stored at the path specified at the end of the loop. Here, we are storing the file on my desktop, but you will need to change the path to where you want the file stored accordingly.

for (f in myfiles) {

  #get species and gene name
  sp <- basename(f)
  sp <- file_path_sans_ext(sp)
  
  #read in fasta file
  seq <- fasta2DNAbin(f, quiet=TRUE)
  
  #get number of sequences
  n <- nrow(seq)

  #calculate nucleotide diversity
  pi <- nuc.div(seq)
  
  
  #write data to file
  write.table(data.frame(sp, pi, n), file="phylogatr-pi.csv", quote=FALSE, row.names=FALSE, col.names=!file.exists("phylogatr-pi.csv"), append=TRUE, sep=",")
}

You will see the error "wrong file extension". This can be ignored. It is because we are using .fa and .fas as our file extensions instead of the extpected .fasta.

5. Outlier analysis


First, import the output from the previous step. If necessary, change the path to the file accordingly.

data <- read.csv("phylogatr-pi.csv")

Print out a list of sequence alignments that fall outside the 95th percentile of the distribution. These may contain poor sequence alignments, though we do recommend checking those that are deemed suspicious by eye. For more information on screening for outliers please see here.

lower_bound <- quantile(data$pi, 0.025)
lower_bound
## 2.5% 
##    0
upper_bound <- quantile(data$pi, 0.975)
upper_bound
##      97.5% 
## 0.05139108
outliers <- which(data$pi < lower_bound | data$pi > upper_bound)
data[outliers,]
##                          sp         pi n
## 11 Crocidura-suaveolens-COI 0.08494409 7
## 24   Cryptotis-mexicana-COI 0.05503145 3

  1. A locus is a specific position in the genome where a particular gene or genetic marker is located. As such, not all loci necessarily correspond to a gene. Because PhylogatR alignments reflect genetic data uploaded to GenBank, you may also encounter alignments representing anonymous loci.