1. Purpose

This module is designed to explore data downloads from PhylogatR for observations that look spurious. Data cleaning, and filtering out bad data, is THE MOST IMPORTANT STEP in a data science project. Bad data = bad results.

The code presented here allows you to visualize: i) aligned DNA sequences for all loci present for a particular species in the database and ii) geographic coordinates for all individuals within that species.

The data you will download from PhylogatR will contain a series of folders for each species. In each folder there will be unaligned and aligned fasta files for each locus1 that is present for that species in the database. These files contain the genetic data for a species. Unaligned fasta files will have the .fa extension and aligned fasta files will have the .afa extension. In addition to these files, each species folder will contain an occurrence file, named “occurrences.txt”. This file contains metadata associated with individual species occurrences, such as accession number, taxonomy, basis of record, and geographic coordinates.

We will start by looking at the data for just one species.This module will take you through the process of importing your data from PhylogatR, visualizing sequence alignments for individual loci simultaneously so that you may check them by eye, and plotting geographic coordinates of occurrences on a map for visual inspection.

First, load all the necessary packages. If you have not already installed these packages, you will need to do that first. See here for more information on package installation in R.

#load necessary packages

2. Importing data

We will import all the sequence alignments for one species. In this case, there is a folder on my Desktop called “phylogatr-results” and we are going to look at the data from the species Sorex monticolus. When you import your own data from PhylogatR, you will need to change the path to the folder accordingly.

First, we need to store the path names of all of the aligned fasta files for our chosen species. Again, make sure your path here is set to the folder of the species that you want to inspect. We will generate a list of paths to all of the aligned fasta files (i.e., all files in the species folder that end in .afa) using the function list.files and store this list in R as myfiles.

#get paths to aligned fasta files in species folder (all .afa files in downloaded PhylogatR folder)
myfiles<-list.files(path="/Users/parsons.463/Desktop/phylogatr-results/Sorex-monticolus", pattern=".afa", full.names=TRUE)

Next, we will use the paths stored in “myfiles” to import our aligned fasta files and create a multidna object. A multidna object is a class of object used to store multiple DNA sequence alignments. Sequences are stored as a list, with each element of the list being a separate DNA alignment stored as a DNAbin matrix. For more information on DNAbin and multidna objects, see here.

We will use the function read.multiFASTA to create a multidna object containing our aligned fasta files. In the likely case that some individuals are not sequenced for all loci, setting the argument add.gaps=FALSE will ensure that gap only sequences will not be added to the alignments.

#create S4 class multidna object (contains DNAbin objects for all loci/.afa alignment files)
multidna<-read.multiFASTA(files=myfiles, add.gaps=FALSE)

3. Visualizing sequence alignments

Now we will use the plot function to visualize the separate alignments (i.e., loci) in our multidna object simultaneously. When set to false, the argument rows will return each locus in a separate image. If set to true, all loci will be shown in a single image (note: depending on how many alignments you have, setting rows=TRUE may make your alignments too small to inspect)

The resulting alignments will consist of a series of rows (each representing the DNA sequence of one individual) and columns (each representing a particular nucleotide position in the locus). Accession numbers for each individual are located to the left of each row and the species and gene/locus name are located at the bottom of the alignment. Nucleotides are color coded according to the legend located at the top of the alignment.

#visualize the seperate alignments in your multidna object simultaneously
plot(multidna, rows=FALSE, ask=FALSE)