added twisst and GEM analysis

0de79b16 · Kristian Ullrich · 0d17cfda · 0de79b16 · 0de79b16
Commit 0de79b16 authored 7 years ago by Kristian Ullrich
--- a/README.md
+++ b/README.md
@@ -60,13 +60,13 @@ _used software:_

 ## Construction of consensus sequences for natural populations

-To construct population specific consensus sequences, we used ANGSD ([Korneliussen et al. 2014](http://www.biomedcentral.com/1471-2105/15/356/abstract)) and the 'doFasta 2' option.
+To construct population specific consensus sequences, we used 'ANGSD' ([Korneliussen et al. 2014](http://www.biomedcentral.com/1471-2105/15/356/abstract)) and the 'doFasta 2' option.

 ##### CONSENSUS FASTA

 In addition to the 'doFasta 2' optin, which extracts the most common nucleotide, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.

-CONSENSUS FASTA example for the population _Mus musculus musculus AFG_:
+CONSENSUS FASTA example for the population _Mus musculus musculus AFG_
 ```
 #example for populaton Mmm_AFG for chr1

@@ -101,13 +101,13 @@ All CONSENSUS FASTA files can be obtained from:

 ## Construction of individual sequences using IUPAC code

-To construct individual sequences, we used ANGSD and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.
+To construct individual sequences, we used 'ANGSD' and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.

 ##### INDIVIDUAL FASTA

 In addition to the 'doFasta 4' optin, which extracts IUPAC code, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.

-CONSENSUS FASTA example for the population _Mus musculus musculus AFG_:
+CONSENSUS FASTA example for the population _Mus musculus musculus AFG_
 ```
 #example for one Mus musculus musculus individual - AFG1 - 396 for chr1

@@ -138,7 +138,7 @@ dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}}
 \end{equation}
 where $d_{XO_{w_{i}}} and $d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance ([Kimura 1980](https://www.ncbi.nlm.nih.gov/pubmed/7463489)) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window ($w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package ([Pages et al. 2009](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)).

-example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_:
+example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_
 ```
 #example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG

@@ -170,7 +170,7 @@ DISTMODEL <- "K80"
 #to run the script
 nohup nice R CMD BATCH --vanilla get_dK80.r &

-#result file will be in this case:
+#to check result file

 tail -f /tmp/Mmd_FRA_Mmd_GER.Mmd_IRA.Mmm_AFG.chr1.tsv
 ```
@@ -185,16 +185,68 @@ _used software:_

 + R version 3.4.1 (2017-06-30)
 + R package ape_4.1
-+ R package Biostrings_2.40.2
-+ R package S4Vectors_0.10.3
-+ R package XVector_0.12.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0
 + R package IRanges_2.6.1
-+ R package BiocGenerics_0.18.0
+ R package BiocGenerics_0.22.0
 + get_dK80.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_dK80.r>

 ## TWISST analysis

+Phylogenetic trees per 25kbp windows were calculated with the 'bionjs' function of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) and used as input files for 'twisst' ([Martin and Belleghem 2017](http://www.genetics.org/content/early/2017/03/21/genetics.116.194720)). The 'bionjs' function can reconstruct a phylogenetic tree from a distance matrix with possibly missing values. We developed an R package (https://github.com/kullrich/distIUPAC) to calculate pairwise 'literal distances' using IUPAC encoded nucleotide sequences. Briefly, only bi-allelic nucleotide states are considered for the pairwise distance calculation and all tri-nucleotide states, gaps or missing data are discarded. The distance score matrix applied is based on ([Chang et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/28819774)), whereby two homologous SNPs yield 0 if they were identical or identically heterozygous, 1 if different, and 0.5 if one of them was heterozygous. Windows with complete deletion sites (missing in all individuals) higher than 50\% were removed. The resulting pairwise distance matrix per window were used to calculate phylogenetic trees which were further used as input trees for 'twisst'.

+example for chromosome 3 for the tree calculations 
+```
+#example for chromosome 3
+
+#change the bottom part of the script 'get_bionj_trees_for_twisst.r' for each chromosome
+
+#here you can find the example for chromosome 3
+
+TMP_DIR <- "/tmp"
+
+SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/chr3.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
+chr <- "chr3"
+
+WSIZE <- 25000
+WJUMP <- 25000
+
+OUT_FILE <- paste0(TMP_DIR,"/",chr,".sw",WSIZE,".","wj",WJUMP,".newick.trees")
+
+#to run the script
+nohup nice R CMD BATCH --vanilla get_bionj_trees_for_twisst.r &
+
+#to check result file
+tail -f /tmp/chr3.sw25000.wj25000.newick.trees
+
+#to extract only the trees which have passed 50% missing complete deletion sites
+awk '{if($6!="NA") print $0}' chr3.sw25000.wj25000.newick.trees > chr3.sw25000.wj25000.newick.trees.noNA
+
+#to extract the trees for twisst
+awk '{print $6}' chr3.sw25000.wj25000.newick.trees.noNA | tail -n +2 > chr3.sw25000.wj25000.newick.trees.noNA.twisst
+
+#twisst analysis
+python twisst.py -t chr3.sw25000.wj25000.newick.trees.noNA.twisst -w chr3.sw25000.wj25000.newick.trees.noNA.twisst.weightsfile -D chr3.sw25000.wj25000.newick.trees.noNA.twisst.distfile -g SPRE SPRE_SP36,SPRE_SP39,SPRE_SP41,SPRE_SP51,SPRE_SP62,SPRE_SP68,SPRE_SP69,SPRE_SP70 -g CAS CAS_1,CAS_2,CAS_3,CAS_4,CAS_5,CAS_6,CAS_7,CAS_8,CAS_9,CAS_10 -g AFG AFG_396,AFG_413,AFG_416,AFG_424,AFG_435,AFG_444 -g IRA IRA_AH15,IRA_AH23,IRA_JR11,IRA_JR15,IRA_JR2_F1C,IRA_JR5_F1C,IRA_JR7_F1C,IRA_JR8_F1A -g GER GER_TP1,GER_TP121B,GER_TP17_2,GER_TP3_02,GER_TP4a,GER_TP51D,GER_TP7_10F1A2,GER_TP81B -g FRA FRA_14,FRA_15B,FRA_16B,FRA_18B,FRA_B2C,FRA_C1,FRA_E1,FRA_F1B --method complete
+```
+
+_used software:_
+
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.22.0
+ R package Rcpp_0.12.12
+ R package RcppParallel_4.3.20
+ R dev package distIUPAC_0.1.0
+ get_bionj_trees_for_twisst.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_bionj_trees_for_twisst.r>
+
+All twisst analysis files can be obtained from:
+
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/twisst/>

 ## Simulation

@@ -264,6 +316,24 @@ _used software:_

 FASTA sequence construction and dK80 calculation was performed as described above.

+## Mappability mm10 genome
+
+To construct a mappability track we used the software GEM ([Derrien et al. 2012](https://www.ncbi.nlm.nih.gov/pubmed/22276185)) with follofing options on the mm10 genome.
+
+K-MER LENGTH: 100
+APPROXIMATION THRESHOLD: 7
+MAX MISMATCHES: 4
+MAX ERRORS: 4
+MAX BIG INDEL LENGTH: 15
+MIN MATCHED BASES: 80
+STRATA AFTER BEST: 1
+
+The bigWig track file (http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mm10.gem.mappability.25kbp.sorted.bed.bw) represents the mean values for 25kbp windows.
+
+_used software:_
+
+ GEM-mappability build 1.315
+
 ## Data visualization and availability

 All data can be obtained from:

--- a/scripts/get_bionj_trees_for_twisst.r
+++ b/scripts/get_bionj_trees_for_twisst.r
+#author: Kristian K Ullrich
+#date: November 2017
+#email: ullrich@evolbio.mpg.de
+
+TMP_DIR <- "/tmp"
+
+SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/chr3.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
+chr <- "chr3"
+
+WSIZE <- 25000
+WJUMP <- 25000
+
+OUT_FILE <- paste0(TMP_DIR,"/",chr,".sw",WSIZE,".","wj",WJUMP,".newick.trees")
+
+#####
+library(distIUPAC)
+library(ape)
+source("https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/sliding_window_steps_generator.r")
+options(scipen=22)
+
+#####MAIN#####
+
+sink(OUT_FILE)
+cat("scaffold\tstart\tend\tmid\tsites\ttree\n")
+sink(NULL)
+
+tmp.chr.seq<-readBStringSet(SEQ_FILE)
+tmp.sw<-sliding_window_steps_generator(window=WSIZE,jump=WJUMP,start.by=1,end.by=unique(width(tmp.chr.seq)))
+for(i in 1:dim(tmp.sw)[2]){
+    tmp.out<-subseq(tmp.chr.seq,tmp.sw[1,i],tmp.sw[2,i])
+    tmp.window.len<-(tmp.sw[2,i]-tmp.sw[1,i]+1)[[1]]
+    tmp.complete.deletion.len<-length(which(unlist(apply(as.matrix(tmp.out),2,function(x) all(x=="N")))))
+    if(tmp.complete.deletion.len/tmp.window.len>0.5){
+        cat(c(chr,tmp.sw[,i],tmp.window.len-tmp.complete.deletion.len,NA),sep="\t",file=OUT_FILE,append=TRUE)
+        cat("\n",file=OUT_FILE,append=TRUE)
+        next()
+    }
+    tmp.out.dist<-distIUPAC(as.character(tmp.out))
+    colnames(tmp.out.dist)<-rownames(tmp.out.dist)<-gsub("-","_",unlist(lapply(strsplit(names(tmp.out),"\\."),function(x) paste0(x[1],"_",x[2]))))
+    tmp.out.dist.bionjs<-bionjs(as.dist(tmp.out.dist))
+    tmp.tree<-write.tree(tmp.out.dist.bionjs)
+    cat(c(chr,tmp.sw[,i],tmp.window.len-tmp.complete.deletion.len,tmp.tree),sep="\t",file=OUT_FILE,append=TRUE)
+    cat("\n",file=OUT_FILE,append=TRUE)
+}
+
+#####
+
+quit()
+