Install the following packages.
For the sequence data manipulation and missingness plotting:
library(seqinr)
library(gplots)
library(RColorBrewer)
library(graphics)
For the population structure visualisation in pophelper:
library(Cairo)
library(ggplot2)
library(gridExtra)
library(gtable)
library(tidyr)
library(pophelper)
For the population genetics statistics:
library(adegenet)
library(pegas)
library(poppr)
library(hierfstat)
seqinr
)With the package seqinr
you can read in a fasta file. Below you see an example where I used an aligned fasta file with triplicated samples (library in triplicate, treated as separate samples during assembly). I want to know whether these sequences have differences between replicates of individuals, or whether differences could be attritbuted to missing data.
First we read in the data:
original<-read.fasta("RAD_Bufo_replicated_individuals.fasta", as.string=TRUE, set.attributes = FALSE)
Now I am making a for loop to look at differences caused by missing data (- and n), or by alternative genotype calling (c, t, g, a or IUPAC codes).
First I make empty matrices to fill during the for loop.
Out_Perc_diff<-matrix(nrow=length(original), ncol=0)
Out_Perc_no_worries<-matrix(nrow=length(original), ncol=0)
Then I loop trough the fasta file.
for (i in 1:length(original)){
#Make empty vectors for second loop
col_diff<-c()
col_no_worries<-c()
Seq_1<-getSequence(original[[i]])
for (j in 1:length(original)){
#Get raw percentage of difference between sequences, includes ALL differences
Seq_2<-getSequence(original[[j]])
Same<-sum(Seq_1==Seq_2, na.rm=TRUE) #All positions which are equal
Diff<-sum(!Seq_1==Seq_2, na.rm=TRUE) #All positions which are different
stopifnot((Same+Diff)==length(Seq_1)) #Check
Perc_diff<-(Diff/length(Seq_1))
#Get percentage of difference because of N
Seq_1_N<-Seq_1=="n"
Seq_2_N<-Seq_2=="n"
#Get percentage of difference because of -
Seq_1_I<-Seq_1=="-"
Seq_2_I<-Seq_2=="-"
#Get the exceptions, the differences we are not afraid of (missing data: N and -)
#The - (indel) in paired end is very common when not sequencing the whole selected fragment
Diff_no_w_1<-sum(!Seq_1_N==Seq_2_N, na.rm=TRUE)
Diff_no_w_2<-sum(!Seq_1_I==Seq_2_I, na.rm=TRUE)
Diff_no_w_3<-sum(Seq_1_N==TRUE & Seq_1_N==Seq_2_I | Seq_1_I==TRUE & Seq_1_I==Seq_2_N, na.rm=TRUE)
Perc_no_worries<-(((Diff_no_w_1+Diff_no_w_2)-Diff_no_w_3)/length(Seq_1))
#Add the numbers to make a column
col_diff<-c(col_diff, Perc_diff)
col_no_worries<-c(col_no_worries, Perc_no_worries)
}
#Add the columns to the matrix
Out_Perc_diff<-cbind(Out_Perc_diff, col_diff)
Out_Perc_no_worries<-cbind(Out_Perc_no_worries, col_no_worries)
}
Then I add rownames to the matrices, mainly to be able to produce figures.
rownames(Out_Perc_diff)<-names(original)
colnames(Out_Perc_diff)<-names(original)
rownames(Out_Perc_no_worries)<-names(original)
colnames(Out_Perc_no_worries)<-names(original)
Then I calculate what we came for: the actual number of ‘wrongly’ called SNPs!
Diff_worry<-Out_Perc_diff-Out_Perc_no_worries
I order them in an order that I prefer, as I want to use this for visualisation.
All_diff<-c(Diff_worry['1572_A','1572_B'], Diff_worry['1572_B','1572_C'],Diff_worry['1572_A','1572_C'],
Diff_worry['1573_A','1573_B'], Diff_worry['1573_B','1573_C'],Diff_worry['1573_A','1573_C'],
Diff_worry['1574_A','1574_B'], Diff_worry['1574_B','1574_C'],Diff_worry['1574_A','1574_C'],
Diff_worry['4112_A','4112_B'], Diff_worry['4112_B','4112_C'],Diff_worry['4112_A','4112_C'],
Diff_worry['4113_A','4113_B'], Diff_worry['4113_B','4113_C'],Diff_worry['4113_A','4113_C'],
Diff_worry['4114_A','4114_B'], Diff_worry['4114_B','4114_C'],Diff_worry['4114_A','4114_C']
)
And then I take the mean of all the differences that worried me!
mean(All_diff)
## [1] 0.004901468
Good, we can see that only a very small portion (<1%) of the differences between the replicates of individuals is due to an actually different nucleotide.
Now I want to visualise this with “heat tables” to compare between the influence of real mistakes vs missing data.
First, I made a custom palette to colour my “heat table”.
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
And I define the color breaks manually for a “skewed” color transition, because I know some values are not included.
col_breaks = c(seq(0,.3,length=100), # for green
seq(0.3001,0.6,length=100), # for yellow
seq(.6001,.9,length=100)) # for red
And I make Figure 1: to only see the influence of missing data. First I need to set the dimensions of the “image” I am creating.
dim1 <- ncol(Out_Perc_no_worries)
dim2 <- ncol(Diff_worry)
Then I make the actual image, give it axes, and fill it with the numbers I calculated above.
image(1:dim1, 1:dim1, Out_Perc_no_worries, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
axis(1, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=3)+
axis(2, 1:dim1, colnames(Out_Perc_no_worries), cex.axis = 0.7, las=1)+
text(expand.grid(1:dim1, 1:dim1), sprintf("%0.2f", Out_Perc_no_worries), cex=0.54)
## numeric(0)
Now we also want to see this figure for the “really concerning” differences between the sequences.
image(1:dim2, 1:dim2, Diff_worry, axes = FALSE, xlab="", ylab="", col=my_palette, breaks=col_breaks)+
axis(1, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=3)+
axis(2, 1:dim2, colnames(Diff_worry), cex.axis = 0.7, las=1)+
text(expand.grid(1:dim2, 1:dim2), sprintf("%0.2f", Diff_worry), cex=0.54)
## numeric(0)
Cool! Compared to figure 1, figure 2 shows really less differences between the replicates between individuals.
There are other packages available, which you can use to explore your data, also for other file formats. We are not going trough all of them, but here are a few examples of packages you can use.
vcfR
To read in and manipulate a vcf file, one can use the R package vcfR: https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html
scanBam
To import, count, index, filter, sort, and merge BAM files: https://rdrr.io/bioc/Rsamtools/man/scanBam.html
Rsamtools
To read in and manipulate a BAM/SAM file, one can use the R package Rsamtools: https://kasperdanielhansen.github.io/genbioconductor/html/Rsamtools.html
It is nice to be able to filter your data manually. This allows you to understand better what you are working with. Keep in mind that with large SNP data this is probably not possible. We here will only discuss what to do with smaller datasets. For larger datasets we may set up another session!
Examples of SNPs that you might want to filter out are singletons. Or you might want to work with only the di-allelic SNPs in your dataset, because many softwares only utilise di-allelic SNPs. In my case, I wanted to see which SNPs were present as one variant in one of my species, and as another in another species. With these SNPs I would be able to analyse the hybrids between both species. In the next code, I filtered my data for such ‘species specific’ SNPs based on my assumed ‘pure’ populations.
First I read in my SNPs.
SNPtable<-as.data.frame(read.table("RAD_Bufo_SNPtable.txt", header=TRUE))
knitr::kable(SNPtable[1:5,1:5])
sample | Pop | rad133_snp0 | rad462_snp0 | rad462_snp1 | |
---|---|---|---|---|---|
1572_A_S25_A | 1 | NA | NA | NA | NA |
1572_A_S25_B | 1 | NA | NA | NA | NA |
1573_A_S26_A | 1 | NA | NA | NA | NA |
1573_A_S26_B | 1 | NA | NA | NA | NA |
1574_A_S27_A | 1 | NA | NA | NA | NA |
Then I remove columns where one species only has NA (possibly null aleles). I used the maximum amount of individuals with “NA” by using “<=54” and “<=30”.
SNPtable_1<-SNPtable[,colSums(is.na(SNPtable[1:74,])) <= 54]
SNPtable_2<-SNPtable_1[,colSums(is.na(SNPtable_1[75:114,])) <= 30]
Then I check my dimensions.
nrow(SNPtable_2)
## [1] 114
ncol(SNPtable_2)
## [1] 583
And I split the table up in two tables for each species that I have: B_bufo and B_spino.
B_bufo<-SNPtable_2[1:74,]
B_spino<-SNPtable_2[75:114,]
One final check, are the first columns the right populations?
knitr::kable(unique(B_bufo[,1]))
x |
---|
1 |
2 |
3 |
4 |
5 |
knitr::kable(unique(B_spino[,1]))
x |
---|
18 |
19 |
20 |
21 |
22 |
Now I can start comparing the two tables
Again I prepare empty matrices, but I do not know how many SNPs of each type I will get.
Not_diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic_old<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)
Not_diagnostic<-matrix(NA,nrow=nrow(SNPtable_2), ncol=0)
Diagnostic<-matrix(NA,nrow=nrow(SNPtable_2),ncol=0)
And now I loop through all SNPs (columns) in this table:
for (i in 3:ncol(SNPtable_2)) {
x_tot<-B_bufo[,i]
y_tot<-B_spino[,i]
#compare the two sets of numbers, returns a block with true/false
ans<-vapply(x_tot, function(x_tot) x_tot==y_tot, logical(length(y_tot)))
#if true occurs save in table with not_diagnostic markers
if(any(ans, na.rm=TRUE)|(all(is.na(y_tot))|all(is.na(x_tot)))){
Not_diagnostic<-cbind(Not_diagnostic_old, SNPtable_2[,i, drop=FALSE])
#if only false or NA save in table with diagnostic markers
} else {
Diagnostic<-cbind(Diagnostic_old, SNPtable_2[,i, drop=FALSE])
}
Not_diagnostic_old<-Not_diagnostic
Diagnostic_old<-Diagnostic
}
Awesome, now I want to summarize and check if all went well!
Tot_original<-ncol(SNPtable)-2
Tot_original_B_bufo<-ncol(B_bufo)-2
Tot_original_B_spino<-ncol(B_spino)-2
Tot_not_diagnostic<-ncol(Not_diagnostic_old)
Tot_diagnostic<-ncol(Diagnostic_old)
Tot_not_diagnostic+Tot_diagnostic
## [1] 581
Perc_diagnostic<-Tot_diagnostic/Tot_original*100
And finally I can write the diagnostic snps to a file.
write.table(Diagnostic_old, file="RAD_Bufo_Diagnostic.txt", quote=FALSE, sep=" ", col.names=TRUE)
How many diagnostic markers do I have?
How many non-diagnostic markers do I have?
Does it add up to the total of markers in the input file?
Where did the other markers go?
Write a code that takes one SNP per RAD locus randomly.
You can run structure-like analysis in R (includes some nice mapping options): http://membres-timc.imag.fr/Olivier.Francois/tutoRstructure.pdf
I love the possibilities with visualisation with Pophelper, but other packages are available.
First, read in some data. These are the q scores for the nuclear structure run, and frequencies for mtDNA.
Struct_table<-read.table("Structure_31SNPs_qscores_nuclear.txt", header = TRUE)
Struct_table_mtDNA<-read.table("Structure_31SNPs_qscores_mtDNA.txt", header = TRUE)
Now make the data structure appropriate for PopHelper, for each structure run you want to visualise, you need to make a list with the clusters assigned. For example, if want to visualise k=3 for your data, you would need to extend the list with “Cluster3” and the values.
Q1<-list(structure_1=data.frame(Cluster1=c(Struct_table[,3]),
Cluster2=c(Struct_table[,4])))
Q2<-list(mtDNA=data.frame(Cluster1=c(Struct_table_mtDNA[,3]),
Cluster2=c(Struct_table_mtDNA[,4])))
Then I make the population labels appropriate for PopHelper.
Pop_label<-Struct_table[,1, drop = FALSE]
Pop_label$Pop <- as.character(Pop_label$Pop)
And I check whether this now includes all my populations I want to visualise.
nrow(Pop_label)
## [1] 306
And the master-piece plot! It is only one line of code!! b(-)d
Look at the many options available for your population structure plotting in R with the PopHelper package here: http://www.royfrancis.com/pophelper/articles/index.html#plotq
adegenet
adegenet
Read in the data & replace missing data with the mean frequency (there are other methods available). Other file formats are possible, but here we use a structure file.
First, we read in the data.
And I check the populations in this new “obj1” to see if I have all my data now.
levels(obj1$pop)
## [1] "T1_16" "T1_13" "T1_12" "T1_11" "T1_10" "T1_8" "T1_14" "T1_1" "T1_2"
## [10] "T1_5" "T1_3" "T1_6" "T1_9" "T1_20" "T1_19" "T1_21" "T1_22" "T1_18"
## [19] "T1_4" "T1_7" "T1_15" "T2_17" "T2_16" "T2_6" "T2_12" "T2_11" "T2_10"
## [28] "T2_8" "T2_7" "T2_9" "T2_13" "T2_14" "T2_15"
There is an option to replace missing data with the mean frequency.
NA_mean<-tab(obj1, NA.method="mean")
We can perform a PCA with adegenet.
pca1<-dudi.pca(NA_mean, scannf=F, scale=F,nf=3)
And we make a nice plot.
s.class(pca1$li, pop(obj1), axesel=FALSE, cstar=1, cpoint=1, col=c(rep("black", 7),
rep("blue", 4),
rep("black", 2),
rep("red", 5), "blue",
"black", "black",
rep("black", 12)))
This looks good.
We can also ask for summary of this PCA.
summary(pca1)
## Class: pca dudi
## Call: dudi.pca(df = NA_mean, scale = F, scannf = F, nf = 3)
##
## Total inertia: 1244
##
## Eigenvalues:
## Ax1 Ax2 Ax3 Ax4 Ax5
## 334.01 34.58 16.21 14.51 12.98
##
## Projected inertia (%):
## Ax1 Ax2 Ax3 Ax4 Ax5
## 26.841 2.779 1.302 1.166 1.043
##
## Cumulative projected inertia (%):
## Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
## 26.84 29.62 30.92 32.09 33.13
##
## (Only 5 dimensions (out of 372) are shown)
You can get basic information on your data using summaries, which returns both a visible output and an invisible list. This is the summary for population 11 in transect 1.
summary(obj1[obj1$pop=="T1_11",1:100])
##
## // Number of individuals: 15
## // Group sizes: 15
## // Number of alleles per locus: 0 1 1 1 1 1 2 1 1 1 2 0 1 1 1 0 1 2 1 1 1 1 2 3 2 2 1 1 1 1 1 1 1 1 2 0 1 0 0 2 2 1 1
## // Number of alleles per group: 48
## // Percentage of missing data: 46.67 %
## // Observed heterozygosity: NaN 0 0 0 0 0 0.4 0 0 0 0 NaN 0 0 0 NaN 0 0 0 0 0 0 0.25 0 0.1 0.25 0 0 0 0 0 0 0 0 0.33 NaN 0 NaN NaN 0.1 0 0 0.71
## // Expected heterozygosity: 1 0 0 0 0 0 0.39 0 0 0 0.5 1 0 0 0 1 0 0.5 0 0 0 0 0.33 0.64 0.1 0.22 0 0 0 0 0 0 0 0 0.49 1 0 1 1 0.1 0.5 0 0
You can check whether your data is out of Hardy Weinberg equilibrium with adegenet, too.
For example, we can do the Hardy Weinberg test without the Monte Carlo test, for population 11.
Test_pop_T1_11<-hw.test(obj1[obj1@pop=="T1_11",1:100], B=0)
And we can check the output.
knitr::kable(Test_pop_T1_11[1:10,])
chi^2 | df | Pr(chi^2 >) | |
---|---|---|---|
L0001 | NA | NA | NA |
L0002 | 0.0000000 | 0 | 1.0000000 |
L0003 | 0.0000000 | 0 | 1.0000000 |
L0004 | 0.0000000 | 0 | 1.0000000 |
L0005 | 0.0000000 | 0 | 1.0000000 |
L0006 | 0.0000000 | 0 | 1.0000000 |
L0007 | 0.0077479 | 1 | 0.9298589 |
L0008 | 0.0000000 | 0 | 1.0000000 |
L0009 | 0.0000000 | 0 | 1.0000000 |
L0010 | 0.0000000 | 0 | 1.0000000 |
Why do we have rows with no output?
Why do we have only 43 loci?
You can also look for population structure with F-statistics. You can calculate Fst (), pop/totalFit (ind/total), and Fis (ind/pop) by locus.
First select only a couple of the populations that we know are diverse (hybrid), this will also make the F-statistics analysis run faster. A good package to know is poppr.
obj2<-poppr::popsub(obj1, sublist = c("T1_11", "T1_12", "T1_13", "T1_14"))
Then we only want to use a couple of loci.
obj3<-obj2[loc=c(1:10)]
And we need to convert this to a loci object.
loc1<-as.loci(obj3)
Now our loc1 object has 10 loci for 4 populations.
length(loc1)
## [1] 11
length(unique(loc1$population))
## [1] 4
And we are able to calculate the different types of F-statistics for each of our loci
Fst(loc1, pop=loc1$population)
## Fit Fst Fis
## L0002 NaN NaN NaN
## L0003 NaN NaN NaN
## L0004 1.0000000 0.03554970 1.0000000
## L0005 NaN NaN NaN
## L0006 0.4795871 -0.03983449 0.4995234
## L0007 0.1533101 0.24066700 -0.1150442
## L0008 NaN NaN NaN
## L0009 NaN NaN NaN
## L0010 NaN NaN NaN
## L0011 1.0000000 0.02800735 1.0000000
And we can caluclate the pairwise Fst for each population (needs a gene object again):
pairwise.fst(obj3, pop=obj3$pop, res.type = "matrix")
## T1_13 T1_12 T1_11 T1_14
## T1_13 0.00000000 0.15914451 0.07219285 0.2290108
## T1_12 0.15914451 0.00000000 0.07030307 0.2234629
## T1_11 0.07219285 0.07030307 0.00000000 0.2988845
## T1_14 0.22901078 0.22346290 0.29888448 0.0000000
We can check for inbreeding, which is a real common problem in amphibian populations:
temp <- inbreeding(obj3, N=10)
Then we can look for the results for the first individual in our dataset (1480_S104).
knitr::kable(temp$`1480_S104`)
x |
---|
0.6805103 |
0.7209798 |
0.4365282 |
0.3660192 |
0.5465846 |
0.5763880 |
0.9131494 |
0.8467496 |
0.2283945 |
0.6174487 |
We can then take the mean for each individual.
MeanFst <- sapply(temp, mean)
And plot it.
hist(MeanFst)
Adegenet has many other options:
DAPC & Compoplot With adegenet: http://adegenet.r-forge.r-project.org/files/tutorial-dapc.pdf