Gitlab Community Edition Instance

Skip to content
Snippets Groups Projects
Commit 0de79b16 authored by Kristian Ullrich's avatar Kristian Ullrich
Browse files

added twisst and GEM analysis

parent 0d17cfda
No related branches found
No related tags found
No related merge requests found
......@@ -60,13 +60,13 @@ _used software:_
## Construction of consensus sequences for natural populations
To construct population specific consensus sequences, we used ANGSD ([Korneliussen et al. 2014](http://www.biomedcentral.com/1471-2105/15/356/abstract)) and the 'doFasta 2' option.
To construct population specific consensus sequences, we used 'ANGSD' ([Korneliussen et al. 2014](http://www.biomedcentral.com/1471-2105/15/356/abstract)) and the 'doFasta 2' option.
##### CONSENSUS FASTA
In addition to the 'doFasta 2' optin, which extracts the most common nucleotide, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_:
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_
```
#example for populaton Mmm_AFG for chr1
......@@ -101,13 +101,13 @@ All CONSENSUS FASTA files can be obtained from:
## Construction of individual sequences using IUPAC code
To construct individual sequences, we used ANGSD and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.
To construct individual sequences, we used 'ANGSD' and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.
##### INDIVIDUAL FASTA
In addition to the 'doFasta 4' optin, which extracts IUPAC code, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_:
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_
```
#example for one Mus musculus musculus individual - AFG1 - 396 for chr1
......@@ -138,7 +138,7 @@ dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}}
\end{equation}
where $d_{XO_{w_{i}}} and $d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance ([Kimura 1980](https://www.ncbi.nlm.nih.gov/pubmed/7463489)) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window ($w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package ([Pages et al. 2009](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)).
example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_:
example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_
```
#example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG
......@@ -170,7 +170,7 @@ DISTMODEL <- "K80"
#to run the script
nohup nice R CMD BATCH --vanilla get_dK80.r &
#result file will be in this case:
#to check result file
tail -f /tmp/Mmd_FRA_Mmd_GER.Mmd_IRA.Mmm_AFG.chr1.tsv
```
......@@ -185,16 +185,68 @@ _used software:_
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.40.2
+ R package S4Vectors_0.10.3
+ R package XVector_0.12.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.18.0
+ R package BiocGenerics_0.22.0
+ get_dK80.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_dK80.r>
## TWISST analysis
Phylogenetic trees per 25kbp windows were calculated with the 'bionjs' function of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) and used as input files for 'twisst' ([Martin and Belleghem 2017](http://www.genetics.org/content/early/2017/03/21/genetics.116.194720)). The 'bionjs' function can reconstruct a phylogenetic tree from a distance matrix with possibly missing values. We developed an R package (https://github.com/kullrich/distIUPAC) to calculate pairwise 'literal distances' using IUPAC encoded nucleotide sequences. Briefly, only bi-allelic nucleotide states are considered for the pairwise distance calculation and all tri-nucleotide states, gaps or missing data are discarded. The distance score matrix applied is based on ([Chang et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/28819774)), whereby two homologous SNPs yield 0 if they were identical or identically heterozygous, 1 if different, and 0.5 if one of them was heterozygous. Windows with complete deletion sites (missing in all individuals) higher than 50\% were removed. The resulting pairwise distance matrix per window were used to calculate phylogenetic trees which were further used as input trees for 'twisst'.
example for chromosome 3 for the tree calculations
```
#example for chromosome 3
#change the bottom part of the script 'get_bionj_trees_for_twisst.r' for each chromosome
#here you can find the example for chromosome 3
TMP_DIR <- "/tmp"
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/chr3.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
chr <- "chr3"
WSIZE <- 25000
WJUMP <- 25000
OUT_FILE <- paste0(TMP_DIR,"/",chr,".sw",WSIZE,".","wj",WJUMP,".newick.trees")
#to run the script
nohup nice R CMD BATCH --vanilla get_bionj_trees_for_twisst.r &
#to check result file
tail -f /tmp/chr3.sw25000.wj25000.newick.trees
#to extract only the trees which have passed 50% missing complete deletion sites
awk '{if($6!="NA") print $0}' chr3.sw25000.wj25000.newick.trees > chr3.sw25000.wj25000.newick.trees.noNA
#to extract the trees for twisst
awk '{print $6}' chr3.sw25000.wj25000.newick.trees.noNA | tail -n +2 > chr3.sw25000.wj25000.newick.trees.noNA.twisst
#twisst analysis
python twisst.py -t chr3.sw25000.wj25000.newick.trees.noNA.twisst -w chr3.sw25000.wj25000.newick.trees.noNA.twisst.weightsfile -D chr3.sw25000.wj25000.newick.trees.noNA.twisst.distfile -g SPRE SPRE_SP36,SPRE_SP39,SPRE_SP41,SPRE_SP51,SPRE_SP62,SPRE_SP68,SPRE_SP69,SPRE_SP70 -g CAS CAS_1,CAS_2,CAS_3,CAS_4,CAS_5,CAS_6,CAS_7,CAS_8,CAS_9,CAS_10 -g AFG AFG_396,AFG_413,AFG_416,AFG_424,AFG_435,AFG_444 -g IRA IRA_AH15,IRA_AH23,IRA_JR11,IRA_JR15,IRA_JR2_F1C,IRA_JR5_F1C,IRA_JR7_F1C,IRA_JR8_F1A -g GER GER_TP1,GER_TP121B,GER_TP17_2,GER_TP3_02,GER_TP4a,GER_TP51D,GER_TP7_10F1A2,GER_TP81B -g FRA FRA_14,FRA_15B,FRA_16B,FRA_18B,FRA_B2C,FRA_C1,FRA_E1,FRA_F1B --method complete
```
_used software:_
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.22.0
+ R package Rcpp_0.12.12
+ R package RcppParallel_4.3.20
+ R dev package distIUPAC_0.1.0
+ get_bionj_trees_for_twisst.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_bionj_trees_for_twisst.r>
All twisst analysis files can be obtained from:
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/twisst/>
## Simulation
......@@ -264,6 +316,24 @@ _used software:_
FASTA sequence construction and dK80 calculation was performed as described above.
## Mappability mm10 genome
To construct a mappability track we used the software GEM ([Derrien et al. 2012](https://www.ncbi.nlm.nih.gov/pubmed/22276185)) with follofing options on the mm10 genome.
K-MER LENGTH: 100
APPROXIMATION THRESHOLD: 7
MAX MISMATCHES: 4
MAX ERRORS: 4
MAX BIG INDEL LENGTH: 15
MIN MATCHED BASES: 80
STRATA AFTER BEST: 1
The bigWig track file (http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mm10.gem.mappability.25kbp.sorted.bed.bw) represents the mean values for 25kbp windows.
_used software:_
+ GEM-mappability build 1.315
## Data visualization and availability
All data can be obtained from:
......
#author: Kristian K Ullrich
#date: November 2017
#email: ullrich@evolbio.mpg.de
TMP_DIR <- "/tmp"
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/chr3.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
chr <- "chr3"
WSIZE <- 25000
WJUMP <- 25000
OUT_FILE <- paste0(TMP_DIR,"/",chr,".sw",WSIZE,".","wj",WJUMP,".newick.trees")
#####
library(distIUPAC)
library(ape)
source("https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/sliding_window_steps_generator.r")
options(scipen=22)
#####MAIN#####
sink(OUT_FILE)
cat("scaffold\tstart\tend\tmid\tsites\ttree\n")
sink(NULL)
tmp.chr.seq<-readBStringSet(SEQ_FILE)
tmp.sw<-sliding_window_steps_generator(window=WSIZE,jump=WJUMP,start.by=1,end.by=unique(width(tmp.chr.seq)))
for(i in 1:dim(tmp.sw)[2]){
tmp.out<-subseq(tmp.chr.seq,tmp.sw[1,i],tmp.sw[2,i])
tmp.window.len<-(tmp.sw[2,i]-tmp.sw[1,i]+1)[[1]]
tmp.complete.deletion.len<-length(which(unlist(apply(as.matrix(tmp.out),2,function(x) all(x=="N")))))
if(tmp.complete.deletion.len/tmp.window.len>0.5){
cat(c(chr,tmp.sw[,i],tmp.window.len-tmp.complete.deletion.len,NA),sep="\t",file=OUT_FILE,append=TRUE)
cat("\n",file=OUT_FILE,append=TRUE)
next()
}
tmp.out.dist<-distIUPAC(as.character(tmp.out))
colnames(tmp.out.dist)<-rownames(tmp.out.dist)<-gsub("-","_",unlist(lapply(strsplit(names(tmp.out),"\\."),function(x) paste0(x[1],"_",x[2]))))
tmp.out.dist.bionjs<-bionjs(as.dist(tmp.out.dist))
tmp.tree<-write.tree(tmp.out.dist.bionjs)
cat(c(chr,tmp.sw[,i],tmp.window.len-tmp.complete.deletion.len,tmp.tree),sep="\t",file=OUT_FILE,append=TRUE)
cat("\n",file=OUT_FILE,append=TRUE)
}
#####
quit()
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment