Install the following packages.

For the sequence data manipulation and missingness plotting:

library(seqinr)
library(gplots)
library(RColorBrewer)
library(graphics)

For the population structure visualisation in pophelper:

library(Cairo)
library(ggplot2)
library(gridExtra)
library(gtable)
library(tidyr)
library(pophelper)

For the population genetics statistics:

library(adegenet)
library(pegas)
library(poppr)
library(hierfstat)

FILE FORMATS

FASTA (seqinr)

With the package seqinr you can read in a fasta file. Below you see an example where I used an aligned fasta file with triplicated samples (library in triplicate, treated as separate samples during assembly). I want to know whether these sequences have differences between replicates of individuals, or whether differences could be attritbuted to missing data.

First we read in the data:

original<-read.fasta("RAD_Bufo_replicated_individuals.fasta", as.string=TRUE, set.attributes = FALSE)

Now I am making a for loop to look at differences caused by missing data (- and n), or by alternative genotype calling (c, t, g, a or IUPAC codes).

First I make empty matrices to fill during the for loop.

Out_Perc_diff<-matrix(nrow=length(original), ncol=0)
Out_Perc_no_worries<-matrix(nrow=length(original), ncol=0)

Then I loop trough the fasta file.

for (i in 1:length(original)){
  #Make empty vectors for second loop
  col_diff<-c()
  col_no_worries<-c()
  Seq_1<-getSequence(original[[i]])
  
  for (j in 1:length(original)){
    #Get raw percentage of difference between sequences, includes ALL differences
    Seq_2<-getSequence(original[[j]])
    Same<-sum(Seq_1==Seq_2, na.rm=TRUE)          #All positions which are equal
    Diff<-sum(!Seq_1==Seq_2, na.rm=TRUE)         #All positions which are different
    stopifnot((Same+Diff)==length(Seq_1))        #Check
  
    Perc_diff<-(Diff/length(Seq_1))
    
    #Get percentage of difference because of N
    Seq_1_N<-Seq_1=="n"
    Seq_2_N<-Seq_2=="n"
    
    #Get percentage of difference because of -
    Seq_1_I<-Seq_1=="-"
    Seq_2_I<-Seq_2=="-"
    
    #Get the exceptions, the differences we are not afraid of (missing data: N and -)
    #The - (indel) in paired end is very common when not sequencing the whole selected fragment
    Diff_no_w_1<-sum(!Seq_1_N==Seq_2_N, na.rm=TRUE)
    Diff_no_w_2<-sum(!Seq_1_I==Seq_2_I, na.rm=TRUE)
    Diff_no_w_3<-sum(Seq_1_N==TRUE & Seq_1_N==Seq_2_I | Seq_1_I==TRUE & Seq_1_I==Seq_2_N, na.rm=TRUE)
    
    Perc_no_worries<-(((Diff_no_w_1+Diff_no_w_2)-Diff_no_w_3)/length(Seq_1))
    
    #Add the numbers to make a column
    col_diff<-c(col_diff, Perc_diff)
    col_no_worries<-c(col_no_worries, Perc_no_worries)
  }
  #Add the columns to the matrix
  Out_Perc_diff<-cbind(Out_Perc_diff, col_diff)
  Out_Perc_no_worries<-cbind(Out_Perc_no_worries, col_no_worries)
}

Then I add rownames to the matrices, mainly to be able to produce figures.

rownames(Out_Perc_diff)<-names(original)
colnames(Out_Perc_diff)<-names(original)
rownames(Out_Perc_no_worries)<-names(original)
colnames(Out_Perc_no_worries)<-names(original)

Then I calculate what we came for: the actual number of ‘wrongly’ called SNPs!

Diff_worry<-Out_Perc_diff-Out_Perc_no_worries

I order them in an order that I prefer, as I want to use this for visualisation.

All_diff<-c(Diff_worry['1572_A','1572_B'], Diff_worry['1572_B','1572_C'],Diff_worry['1572_A','1572_C'],
  Diff_worry['1573_A','1573_B'], Diff_worry['1573_B','1573_C'],Diff_worry['1573_A','1573_C'],
  Diff_worry['1574_A','1574_B'], Diff_worry['1574_B','1574_C'],Diff_worry['1574_A','1574_C'],
  Diff_worry['4112_A','4112_B'], Diff_worry['4112_B','4112_C'],Diff_worry['4112_A','4112_C'],
  Diff_worry['4113_A','4113_B'], Diff_worry['4113_B','4113_C'],Diff_worry['4113_A','4113_C'],
  Diff_worry['4114_A','4114_B'], Diff_worry['4114_B','4114_C'],Diff_worry['4114_A','4114_C']
  )

And then I take the mean of all the differences that worried me!

mean(All_diff)
## [1] 0.004901468

Good, we can see that only a very small portion (<1%) of the differences between the replicates of individuals is due to an actually different nucleotide.

Now I want to visualise this with “heat tables” to compare between the influence of real mistakes vs missing data.

First, I made a custom palette to colour my “heat table”.

my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)

And I define the color breaks manually for a “skewed” color transition, because I know some values are not included.

col_breaks = c(seq(0,.3,length=100),                 # for green
               seq(0.3001,0.6,length=100),            # for yellow
               seq(.6001,.9,length=100))             # for red

And I make Figure 1: to only see the influence of missing data. First I need to set the dimensions of the “image” I am creating.

dim1 <- ncol(Out_Perc_no_worries)
dim2 <- ncol(Diff_worry)

Then I make the actual image, give it axes, and fill it with the numbers I calculated above.

image(1:dim1, 1:dim1, Out_Perc_no_worries, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
  axis(1, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=3)+
  axis(2, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=1)+
  text(expand.grid(1:dim1, 1:dim1), sprintf("%0.2f", Out_Perc_no_worries), cex=0.54)

## numeric(0)

Question

Now we also want to see this figure for the “really concerning” differences between the sequences.

image(1:dim2, 1:dim2, Diff_worry, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
  axis(1, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=3)+
  axis(2, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=1)+
  text(expand.grid(1:dim2, 1:dim2), sprintf("%0.2f", Diff_worry), cex=0.54)

## numeric(0)

Cool! Compared to figure 1, figure 2 shows really less differences between the replicates between individuals.

There are other packages available, which you can use to explore your data, also for other file formats. We are not going trough all of them, but here are a few examples of packages you can use.

VCF

vcfR

To read in and manipulate a vcf file, one can use the R package vcfR: https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html

BAM/SAM

scanBam

To import, count, index, filter, sort, and merge BAM files: https://rdrr.io/bioc/Rsamtools/man/scanBam.html

Rsamtools

To read in and manipulate a BAM/SAM file, one can use the R package Rsamtools: https://kasperdanielhansen.github.io/genbioconductor/html/Rsamtools.html

Data filtering

It is nice to be able to filter your data manually. This allows you to understand better what you are working with. Keep in mind that with large SNP data this is probably not possible. We here will only discuss what to do with smaller datasets. For larger datasets we may set up another session!

Examples of SNPs that you might want to filter out are singletons. Or you might want to work with only the di-allelic SNPs in your dataset, because many softwares only utilise di-allelic SNPs. In my case, I wanted to see which SNPs were present as one variant in one of my species, and as another in another species. With these SNPs I would be able to analyse the hybrids between both species. In the next code, I filtered my data for such ‘species specific’ SNPs based on my assumed ‘pure’ populations.

First I read in my SNPs.

SNPtable<-as.data.frame(read.table("RAD_Bufo_SNPtable.txt", header=TRUE))
knitr::kable(SNPtable[1:5,1:5])
sample Pop rad133_snp0 rad462_snp0 rad462_snp1
1572_A_S25_A 1 NA NA NA NA
1572_A_S25_B 1 NA NA NA NA
1573_A_S26_A 1 NA NA NA NA
1573_A_S26_B 1 NA NA NA NA
1574_A_S27_A 1 NA NA NA NA

Then I remove columns where one species only has NA (possibly null aleles). I used the maximum amount of individuals with “NA” by using “<=54” and “<=30”.

SNPtable_1<-SNPtable[,colSums(is.na(SNPtable[1:74,])) <= 54]
SNPtable_2<-SNPtable_1[,colSums(is.na(SNPtable_1[75:114,])) <= 30]

Then I check my dimensions.

nrow(SNPtable_2)
## [1] 114
ncol(SNPtable_2)
## [1] 583

And I split the table up in two tables for each species that I have: B_bufo and B_spino.

B_bufo<-SNPtable_2[1:74,]
B_spino<-SNPtable_2[75:114,]

One final check, are the first columns the right populations?

knitr::kable(unique(B_bufo[,1]))
x
1
2
3
4
5
knitr::kable(unique(B_spino[,1]))
x
18
19
20
21
22

Now I can start comparing the two tables

Again I prepare empty matrices, but I do not know how many SNPs of each type I will get.

Not_diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)
Not_diagnostic<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)

And now I loop through all SNPs (columns) in this table:

for (i in 3:ncol(SNPtable_2)) {
  x_tot<-B_bufo[,i]
  y_tot<-B_spino[,i]
  #compare the two sets of numbers, returns a block with true/false 
  ans<-vapply(x_tot, function(x_tot) x_tot==y_tot, logical(length(y_tot)))
  #if true occurs save in table with not_diagnostic markers
  if(any(ans, na.rm=TRUE)|(all(is.na(y_tot))|all(is.na(x_tot)))){
    Not_diagnostic<-cbind(Not_diagnostic_old, SNPtable_2[,i, drop=FALSE])
    #if only false or NA save in table with diagnostic markers
  } else {
    Diagnostic<-cbind(Diagnostic_old, SNPtable_2[,i, drop=FALSE])
  }
  Not_diagnostic_old<-Not_diagnostic
  Diagnostic_old<-Diagnostic
}

Awesome, now I want to summarize and check if all went well!

Tot_original<-ncol(SNPtable)-2
Tot_original_B_bufo<-ncol(B_bufo)-2
Tot_original_B_spino<-ncol(B_spino)-2
Tot_not_diagnostic<-ncol(Not_diagnostic_old)
Tot_diagnostic<-ncol(Diagnostic_old)
Tot_not_diagnostic+Tot_diagnostic
## [1] 581
Perc_diagnostic<-Tot_diagnostic/Tot_original*100

And finally I can write the diagnostic snps to a file.

write.table(Diagnostic_old, file="RAD_Bufo_Diagnostic.txt", quote=FALSE, sep=" ", col.names=TRUE)

Questions

Challenge

Write a code that takes one SNP per RAD locus randomly.

Structure

You can run structure-like analysis in R (includes some nice mapping options): http://membres-timc.imag.fr/Olivier.Francois/tutoRstructure.pdf

Or: https://github.com/Tom-Jenkins/admixture_pie_chart_map_tutorial/blob/master/pie_chart_admixture_map_tutorial.R

I love the possibilities with visualisation with Pophelper, but other packages are available.

First, read in some data. These are the q scores for the nuclear structure run, and frequencies for mtDNA.

Struct_table<-read.table("Structure_31SNPs_qscores_nuclear.txt", header = TRUE)
Struct_table_mtDNA<-read.table("Structure_31SNPs_qscores_mtDNA.txt", header = TRUE)

Now make the data structure appropriate for PopHelper, for each structure run you want to visualise, you need to make a list with the clusters assigned. For example, if want to visualise k=3 for your data, you would need to extend the list with “Cluster3” and the values.

Q1<-list(structure_1=data.frame(Cluster1=c(Struct_table[,3]),
                                Cluster2=c(Struct_table[,4])))
Q2<-list(mtDNA=data.frame(Cluster1=c(Struct_table_mtDNA[,3]),
                          Cluster2=c(Struct_table_mtDNA[,4])))

Then I make the population labels appropriate for PopHelper.

Pop_label<-Struct_table[,1, drop = FALSE]
Pop_label$Pop <- as.character(Pop_label$Pop)

And I check whether this now includes all my populations I want to visualise.

nrow(Pop_label)
## [1] 306

And the master-piece plot! It is only one line of code!! b(-)d

Look at the many options available for your population structure plotting in R with the PopHelper package here: http://www.royfrancis.com/pophelper/articles/index.html#plotq

Population genetic analyses in adegenet

adegenet

Read in the data & replace missing data with the mean frequency (there are other methods available). Other file formats are possible, but here we use a structure file.

First, we read in the data.

And I check the populations in this new “obj1” to see if I have all my data now.

levels(obj1$pop)
##  [1] "T1_16" "T1_13" "T1_12" "T1_11" "T1_10" "T1_8"  "T1_14" "T1_1"  "T1_2" 
## [10] "T1_5"  "T1_3"  "T1_6"  "T1_9"  "T1_20" "T1_19" "T1_21" "T1_22" "T1_18"
## [19] "T1_4"  "T1_7"  "T1_15" "T2_17" "T2_16" "T2_6"  "T2_12" "T2_11" "T2_10"
## [28] "T2_8"  "T2_7"  "T2_9"  "T2_13" "T2_14" "T2_15"

There is an option to replace missing data with the mean frequency.

NA_mean<-tab(obj1, NA.method="mean")

Principal Component Analysis

We can perform a PCA with adegenet.

pca1<-dudi.pca(NA_mean, scannf=F, scale=F,nf=3)

And we make a nice plot.

s.class(pca1$li, pop(obj1), axesel=FALSE, cstar=1, cpoint=1, col=c(rep("black", 7),
                                                                   rep("blue", 4), 
                                                                   rep("black", 2), 
                                                                   rep("red", 5), "blue", 
                                                                   "black", "black",
                                                                   rep("black", 12)))

This looks good.

We can also ask for summary of this PCA.

summary(pca1)
## Class: pca dudi
## Call: dudi.pca(df = NA_mean, scale = F, scannf = F, nf = 3)
## 
## Total inertia: 1244
## 
## Eigenvalues:
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  334.01   34.58   16.21   14.51   12.98 
## 
## Projected inertia (%):
##     Ax1     Ax2     Ax3     Ax4     Ax5 
##  26.841   2.779   1.302   1.166   1.043 
## 
## Cumulative projected inertia (%):
##     Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
##   26.84   29.62   30.92   32.09   33.13 
## 
## (Only 5 dimensions (out of 372) are shown)

Question

Summaries

You can get basic information on your data using summaries, which returns both a visible output and an invisible list. This is the summary for population 11 in transect 1.

summary(obj1[obj1$pop=="T1_11",1:100])
## 
## // Number of individuals: 15
## // Group sizes: 15
## // Number of alleles per locus: 0 1 1 1 1 1 2 1 1 1 2 0 1 1 1 0 1 2 1 1 1 1 2 3 2 2 1 1 1 1 1 1 1 1 2 0 1 0 0 2 2 1 1
## // Number of alleles per group: 48
## // Percentage of missing data: 46.67 %
## // Observed heterozygosity: NaN 0 0 0 0 0 0.4 0 0 0 0 NaN 0 0 0 NaN 0 0 0 0 0 0 0.25 0 0.1 0.25 0 0 0 0 0 0 0 0 0.33 NaN 0 NaN NaN 0.1 0 0 0.71
## // Expected heterozygosity: 1 0 0 0 0 0 0.39 0 0 0 0.5 1 0 0 0 1 0 0.5 0 0 0 0 0.33 0.64 0.1 0.22 0 0 0 0 0 0 0 0 0.49 1 0 1 1 0.1 0.5 0 0

Hardy Weinberg equilibrium

You can check whether your data is out of Hardy Weinberg equilibrium with adegenet, too.

For example, we can do the Hardy Weinberg test without the Monte Carlo test, for population 11.

Test_pop_T1_11<-hw.test(obj1[obj1@pop=="T1_11",1:100], B=0)

And we can check the output.

knitr::kable(Test_pop_T1_11[1:10,])
chi^2 df Pr(chi^2 >)
L0001 NA NA NA
L0002 0.0000000 0 1.0000000
L0003 0.0000000 0 1.0000000
L0004 0.0000000 0 1.0000000
L0005 0.0000000 0 1.0000000
L0006 0.0000000 0 1.0000000
L0007 0.0077479 1 0.9298589
L0008 0.0000000 0 1.0000000
L0009 0.0000000 0 1.0000000
L0010 0.0000000 0 1.0000000

Questions

F-statistics

You can also look for population structure with F-statistics. You can calculate Fst (), pop/totalFit (ind/total), and Fis (ind/pop) by locus.

First select only a couple of the populations that we know are diverse (hybrid), this will also make the F-statistics analysis run faster. A good package to know is poppr.

obj2<-poppr::popsub(obj1, sublist = c("T1_11", "T1_12", "T1_13", "T1_14"))

Then we only want to use a couple of loci.

obj3<-obj2[loc=c(1:10)]

And we need to convert this to a loci object.

loc1<-as.loci(obj3)

Now our loc1 object has 10 loci for 4 populations.

length(loc1)
## [1] 11
length(unique(loc1$population))
## [1] 4

And we are able to calculate the different types of F-statistics for each of our loci

Fst(loc1, pop=loc1$population)
##             Fit         Fst        Fis
## L0002       NaN         NaN        NaN
## L0003       NaN         NaN        NaN
## L0004 1.0000000  0.03554970  1.0000000
## L0005       NaN         NaN        NaN
## L0006 0.4795871 -0.03983449  0.4995234
## L0007 0.1533101  0.24066700 -0.1150442
## L0008       NaN         NaN        NaN
## L0009       NaN         NaN        NaN
## L0010       NaN         NaN        NaN
## L0011 1.0000000  0.02800735  1.0000000

And we can caluclate the pairwise Fst for each population (needs a gene object again):

pairwise.fst(obj3, pop=obj3$pop, res.type = "matrix")
##            T1_13      T1_12      T1_11     T1_14
## T1_13 0.00000000 0.15914451 0.07219285 0.2290108
## T1_12 0.15914451 0.00000000 0.07030307 0.2234629
## T1_11 0.07219285 0.07030307 0.00000000 0.2988845
## T1_14 0.22901078 0.22346290 0.29888448 0.0000000

Inbreeding

We can check for inbreeding, which is a real common problem in amphibian populations:

temp <- inbreeding(obj3, N=10)

Then we can look for the results for the first individual in our dataset (1480_S104).

knitr::kable(temp$`1480_S104`)
x
0.6805103
0.7209798
0.4365282
0.3660192
0.5465846
0.5763880
0.9131494
0.8467496
0.2283945
0.6174487

We can then take the mean for each individual.

MeanFst <- sapply(temp, mean)

And plot it.

hist(MeanFst)

Adegenet has many other options:

DAPC & Compoplot With adegenet: http://adegenet.r-forge.r-project.org/files/tutorial-dapc.pdf

THE END HAPPY CODING!!