changed general setup

f543a201 · Kristian Ullrich · d3e7253d · f543a201 · f543a201 · f543a201
Commit f543a201 authored 7 years ago by Kristian Ullrich
--- a/README.md
+++ b/README.md
 # Scripts for the Publication:

-Ullrich KK, Tautz D
+Introgression patterns between house mouse subspecies and species reveal genomic windows of frequent exchange
+
+Ullrich KK, Linnenbrink M, Tautz D

 ## Data sources

@@ -10,9 +12,9 @@ For mapping details please look into the original publication ([Harr et al. 2016

 ## Get masking regions for individual samples and natural populations

-For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes from the reference GRCm38 _mm10_ <http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/>.
+For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes (chr1-chr19,chrX,chrY,chrM) from the reference GRCm38 _mm10_ <http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/>.

-The BAM files were processed with 'genomeCoverageBed' to obtain site specific genome coverage and further united with 'unionBedGraphs'.
+The BAM files were processed with 'genomeCoverageBed' from the bedtools software suite ([Quinlan and Hall 2010](https://academic.oup.com/bioinformatics/article/26/6/841/244688/BEDTools-a-flexible-suite-of-utilities-for)) to obtain site specific genome coverage and further united with 'unionBedGraphs'.

 ##### Population specific masking

@@ -31,7 +33,7 @@ genomeCoverageBed example for the population _Mus musculus musculus AFG_:
 #AFG5_435.bam
 #AFG6_444.bam

-$REFERENCE=mm10.fasta
+REFERENCE=mm10.fasta

 for file in *.bam; do genomeCoverageBed -ibam $file -bga -g $REFFERENCE > $file".bga";done
 ```
@@ -60,6 +62,10 @@ bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
 awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
 ```

+United masked files generated can be obtained from:
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/
+
 ##### Individual specific masking

 For individuals regions with a coverage smaller than 5 were used as masking regions.
@@ -76,15 +82,14 @@ awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT

 _used software:_

-+ bedtools v2.24.0 <http://bedtools.readthedocs.io/en/latest/>
+ bedtools v2.26.0 <http://bedtools.readthedocs.io/en/latest/>
 + awk

 ## SNP and INDEL calling

-For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' with relaxed quality options to retain information in CNV regions.
-Due to the amount of data first mpileup files were generated for each BAM file and merged with 'bcftools merge'. After merging mpileup files, 'bcftools call' was used to generate the final VCF file.
+For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' ([Li et al. 2011](https://academic.oup.com/bioinformatics/article/25/16/2078/204688/The-Sequence-Alignment-Map-format-and-SAMtools)) with relaxed quality options to retain information in CNV regions. Each chromosome was analyzed seperately.

-samtools mpileup | bcftools call example for the population _Mus musculus musculus AFG_:
+samtools mpileup | bcftools call example for chromosome 1 for the population _Mus musculus musculus AFG_:
 ```
 #example for population Mmm_AFG:
 #
@@ -97,9 +102,26 @@ samtools mpileup | bcftools call example for the population _Mus musculus muscul
 #AFG5_435.bam
 #AFG6_444.bam

-#generating indiviual mpileup files for each BAM file
+REFERENCE=mm10.fasta

-$REFERENCE=mm10.fasta
+PLOIDY=ploidy.txt
+#chrX    1       171031299       M       1
+#chrY    1       91744698        M       1
+#chrY    1       91744698        F       0
+#chrM    1       16299   F       1
+#chrM    1       16299   M       1
+#*       *       *       M       2
+#*       *       *       F       2
+
+SAMPLES=AFG_samples.txt
+#396     M
+#413     M
+#416     M
+#424     M
+#435     F
+#444     M
+
+OUTPUT=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf

 INPUT1=AFG1_396.bam
 INPUT2=AFG2_413.bam
@@ -108,63 +130,42 @@ INPUT4=AFG4_424.bam
 INPUT5=AFG5_435.bam
 INPUT6=AFG6_444.bam

-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT1".mpileup.q0Q10.vcf" $INPUT1
-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT2".mpileup.q0Q10.vcf" $INPUT2
-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT3".mpileup.q0Q10.vcf" $INPUT3
-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT4".mpileup.q0Q10.vcf" $INPUT4
-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT5".mpileup.q0Q10.vcf" $INPUT5
-samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT6".mpileup.q0Q10.vcf" $INPUT6
-
-bgzip $INPUT1".mpileup.q0Q10.vcf"
-bgzip $INPUT2".mpileup.q0Q10.vcf"
-bgzip $INPUT3".mpileup.q0Q10.vcf"
-bgzip $INPUT4".mpileup.q0Q10.vcf"
-bgzip $INPUT5".mpileup.q0Q10.vcf"
-bgzip $INPUT6".mpileup.q0Q10.vcf"
-
-tabix $INPUT1".mpileup.q0Q10.vcf.gz"
-tabix $INPUT2".mpileup.q0Q10.vcf.gz"
-tabix $INPUT3".mpileup.q0Q10.vcf.gz"
-tabix $INPUT4".mpileup.q0Q10.vcf.gz"
-tabix $INPUT5".mpileup.q0Q10.vcf.gz"
-tabix $INPUT6".mpileup.q0Q10.vcf.gz"
+BAMLIST=Mmm_AFG.list

-#merge individual mpileup files
+echo $INPUT1 >> $BAMLIST
+echo $INPUT2 >> $BAMLIST
+echo $INPUT3 >> $BAMLIST
+echo $INPUT4 >> $BAMLIST
+echo $INPUT5 >> $BAMLIST
+echo $INPUT6 >> $BAMLIST

-MPILEUPLIST=AFG.mpileup.list
+#example for chromosome 1

-echo $INPUT1".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-echo $INPUT2".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-echo $INPUT3".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-echo $INPUT4".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-echo $INPUT5".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-echo $INPUT6".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
-
-MPILEUPOUTPUT=Mmm_AFG.mpileup.q0Q10.vcf.gz
-
-bcftools merge -m all -O z -o $MPILEUPOUTPUT -l $MPILEUPLIST
-
-#call SNP and INDEL
-
-OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
-
-bcftools call -O z -f GQ -m -v -o $OUTPUT $MPILEUPOUTPUT
+samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -r chr1 -uf $REFERENCE -b $BAMLIST | bcftools call -O v -f GQ -m -v --ploidy-file $PLOIDY -S $SAMPLES > $OUTPUT
+bgzip $OUTPUT
+tabix $OUTPUT".gz"
 ```

 _used software:_

-+ samtools 1.3.1-36-g613501f (using htslib 1.3.1-59-g0f2a88a)
-+ bcftools 1.3.1-39-gd797e86 (using htslib 1.3.1-59-g0f2a88a)
+ samtools 1.3.1 (using htslib 1.3.1)
+ bcftools 1.3.1 (using htslib 1.3.1)
 + bgzip v1.3
 + tabix v1.3

+All mpileup generated for each population can be obtained from:
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/
+
+NOTE: For each population all analyzed chromosomes were merged into one file.
+
 ## K80 distance calculation

 ### Get population specific SNPs

-To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools'. Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations").
+To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools' ([Danecek et al. 2011](https://academic.oup.com/bioinformatics/article/27/15/2156/402296/The-variant-call-format-and-VCFtools)). Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations").

-vcftools example for the population _Mus musculus musculus AFG_:
+vcftools example for chromosome 1 for the population _Mus musculus musculus AFG_:
 ```
 #example for population Mmm_AFG:
 #
@@ -186,12 +187,18 @@ echo "424" >> $POPIDS
 echo "435" >> $POPIDS
 echo "444" >> $POPIDS

-GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
-OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels
+GZVCF=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf.gz
+OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels

 vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --keep $POPIDS --out $OUTPUT
 ```

+All re-coded vcf files can be obtained from:
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/
+
+NOTE: For each population all analyzed chromosomes were merged into one file.
+
 vcfparser.py mvcf2consensus example for the population _Mus musculus musculus AFG_:
 ```
 #example for population Mmm_AFG:
@@ -205,86 +212,44 @@ vcfparser.py mvcf2consensus example for the population _Mus musculus musculus AF
 #435
 #444

-INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.vcf
-OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus
+INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.vcf
+OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus

-python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -samples 396,413,416,424,435,444 -id Mmm_AFG.mv
+python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1 -samples 396,413,416,424,435,444 -id Mmm_AFG.mv
 ```

+All CONSENSUS vcf files can be obtained from:
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/
+
+NOTE: For each population all analyzed chromosomes were merged into one file.
+
 vcfparser.py vcf2fasta example for the population _Mus musculus musculus AFG_:
 ```
 #example for population Mmm_AFG:
 #

 REFERENCE=mm10.fasta
-INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf
-OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr
+INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.vcf
+OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr1
 MASKFILE=Mmm_AFG.combined.bga.stcov5

-python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4
+python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1 -ibga $MASKFILE -cov2N 4
 ```

-_used software:_
-
-+ vcftools v0.1.15
-+ vcfparser.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py>

-### Calculate K80 distance between populations using the CONSENSUS pseudo-genome files

-## Nucleotide diversity calculations
+All pseudo-genome FASTA files can be obtained from:

-### Get individual specific SNPs

-To get individual specific VCF files the VCF file produced with 'bcftools call' was first re-coded for each individual with 'vcftools'. Further the individual specific VCF files were used to generate pseudo-genomes files per individual with 'vcfparser.py vcf2fasta' using individual masking regions (see "Get masking regions for individual samples and natural populations").
-
-vcftools example for individual 396 of the population _Mus musculus musculus AFG_:
-```
-#example for individual 396 of the population Mmm_AFG:
-#
-#VCF IDs:
-#
-#396
-
-GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
-OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels
-
-vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --indv 396 --out $OUTPUT
-```
-
-vcfparser.py vcf2fasta example for individual 396 of the population _Mus musculus musculus AFG_:
-```
-#example for individual 396 of the population Mmm_AFG:
-#
-
-REFERENCE=mm10.fasta
-INPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf
-OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr
-MASKFILE=Mmm_AFG.combined.bga.stcov5
-
-python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4
-```

 _used software:_

 + vcftools v0.1.15
 + vcfparser.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py>

-### Calculate nucleotide diversity within each population
-
-_used software:_
-
-+ variscan v2.0.3
-
-## Dxy distance calculation
-
-### Calculate Dxy distance between populations
-
-_used software:_
-
-### Calculate Dxy distance between individuals and populations
-
-_used software:_
+### Calculate dK80 distance between populations using the CONSENSUS pseudo-genome files

 ## Simulation

-
+To simulate genomes, first we estimated the number of pair-wise segregating sites with the CONSENSUS pseudo-genome files adding the reference mm10. Subsequently we used  
--- a/scripts/gccontent.py
+++ b/scripts/gccontent.py
+#!/usr/bin/python
+# -*- coding: UTF-8 -*-
+
+'''
+author: Kristian K Ullrich
+date: July 2017
+email: ullrich@evolbio.mpg.de
+License: MIT
+
+The MIT License (MIT)
+
+Copyright (c) 2017 Kristian K Ullrich
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+This software relies on the following python packages:
+sys
+os
+argparse
+numpy https://pypi.python.org/pypi/numpy
+Bio https://github.com/biopython/DIST
+'''
+
+
+import sys
+import os
+import argparse
+from Bio import SeqIO
+import numpy as np
+from Bio.SeqUtils import GC
+
+
+def sliding_window_steps_generator(sw,j,start,end):
+    if end<=start:
+        print 'end must be smaller than start'
+        return
+    start_seq = np.arange(start,end,j)
+    end_seq = np.arange(start+sw-1,end,j)
+    end_seq = np.append(end_seq,np.repeat(end,len(start_seq)-len(end_seq)))
+    mid_seq = (start_seq+end_seq)/2
+    return([start_seq,end_seq])
+
+
+#main
+def main():
+    parser = argparse.ArgumentParser(prog='gccontent', usage='%(prog)s [options] [<arguments>...]',
+                                     description='script to extract GC content from nucleotide fasta file')
+    parser.add_argument('-i', help='specify FASTA input file')
+    parser.add_argument('-o', help='specify input file')
+    parser.add_argument('-w', help='specify sliding window size [default:25000]', default = '25000', type = int)
+    parser.add_argument('-j', help='specify sliding window jump [default:25000]', default = '25000', type = int)
+    args = parser.parse_args()
+    if args.i is None:
+        parser.print_help()
+        sys.exit('\nPlease specify FASTA input file')
+    args.i = os.path.abspath(args.i)
+    if args.o is None:
+        parser.print_help()
+        sys.exit('\nPlease specify output file')
+    args.o = os.path.abspath(args.o)
+    print args
+    fasta = SeqIO.parse(args.i,'fasta')
+    with open(args.o,'w') as outhandle:
+        for r in fasta:
+            tmp_windows_start, tmp_windows_end = sliding_window_steps_generator(args.w, args.j, 1, len(r.seq))
+            for s,e in zip(tmp_windows_start, tmp_windows_end):
+                tmp_N_count = r.seq[s:e].count('N')
+                tmp_N_percent = float(tmp_N_count) / float(len(r.seq[s:e]))
+                if tmp_N_percent==1.0:
+                    tmp_GC_percent = 0.0
+                if tmp_N_percent!=1.0:
+                    tmp_G_count = r.seq[s:e].count('G')
+                    tmp_C_count = r.seq[s:e].count('C')
+                    tmp_A_count = r.seq[s:e].count('A')
+                    tmp_T_count = r.seq[s:e].count('T')
+                    tmp_GC_percent = (float(tmp_G_count) + float(tmp_C_count)) / (float(len(r.seq[s:e])) - float(tmp_N_count))
+                outhandle.write('%s\t%i\t%i\t%f\t%f\n' % (r.id,s,e,tmp_N_percent,tmp_GC_percent))
+
+
+if __name__ == '__main__':
+    main()
+
--- a/scripts/get_dK80.r
+++ b/scripts/get_dK80.r
+#author: Kristian K Ullrich
+#date: July 2017
+#email: ullrich@evolbio.mpg.de
+
+#chr1 - IRA_AFG
+#[X]: Mmd_FRA
+#[Y]: Mmd_GER
+#[Z]: Mmd_IRA
+#[O]: Mmm_AFG
+
+popX <- "Mmd_FRA"
+popY <- "Mmd_GER"
+popZ <- "Mmd_IRA"
+popO <- "Mmm_AFG"
+
+TMP_DIR <- "/tmp"
+
+popX.pos <- 2
+popY.pos <- 3
+popZ.pos <- 4
+popO.pos <- 5
+
+SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/CAS_FRA_GER_IRA_AFG_SPRE.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.refmajorsample.ref.consensus.fasta"
+chr <- "chr1"
+OUT_FILE <- paste0(TMP_DIR,"/",popX,"_",popY,".",popZ,".",popO,".",chr,".tsv")
+
+WSIZE <- 25000
+WJUMP <- 25000
+
+DISTMODEL <- "K80"
+
+#####
+
+library(Biostrings)
+library(ape)
+source("/data/codeprojects/r_scripts/sliding_window_steps_generator.r")
+options(scipen=22)
+
+calc_delta_consensus_four<-function(seq,popX.pos,popY.pos,popZ.pos,popO.pos,TMP,distMODEL="K80"){
+    MISSING<-NA
+    TREELENGTH<-NA
+    TREE<-NA
+    TREETOPOLOGY<-NA
+    dK80.XZO<-NA
+    dK80.YZO<-NA
+    OUT<-list(MISSING,dK80.XZO,dK80.YZO,TREELENGTH,TREE,TREETOPOLOGY)
+    names(OUT)<-c("MISSING","dK80.XZO","dK80.YZO","TREELENGTH","TREE","TREETOPOLOGY")
+    tmp.seq<-seq[c(popX.pos,popY.pos,popZ.pos,popO.pos)]
+    names(tmp.seq)<-c("X","Y","Z","O")
+    tmp.seq.XX<-gregexpr("X",tmp.seq[1])
+    tmp.seq.XY<-gregexpr("X",tmp.seq[2])
+    tmp.seq.XZ<-gregexpr("X",tmp.seq[3])
+    tmp.seq.XO<-gregexpr("X",tmp.seq[4])
+    tmp.seq.NX<-gregexpr("N",tmp.seq[1])
+    tmp.seq.NY<-gregexpr("N",tmp.seq[2])
+    tmp.seq.NZ<-gregexpr("N",tmp.seq[3])
+    tmp.seq.NO<-gregexpr("N",tmp.seq[4])
+    remove.pos<-unique(c(tmp.seq.XX[[1]],tmp.seq.XY[[1]],tmp.seq.XZ[[1]],tmp.seq.XO[[1]],tmp.seq.NX[[1]],tmp.seq.NY[[1]],tmp.seq.NZ[[1]],tmp.seq.NO[[1]]))
+    remove.pos<-remove.pos[remove.pos!="-1"]
+    if(length(remove.pos)!=0){
+        tmp.seq.rem<-BStringSet(apply(as.matrix(tmp.seq)[,-remove.pos,drop=FALSE],1,function(x) paste(x,collapse="")))
+    }
+    if(length(remove.pos)==0){
+        tmp.seq.rem<-tmp.seq
+    }
+    OUT$MISSING<-length(remove.pos)
+    writeXStringSet(tmp.seq.rem,file=paste0(TMP,"/tmp.seq.rem.fasta"))
+    tmp.dna<-read.dna(paste0(TMP,"/tmp.seq.rem.fasta"),format="fasta")
+    if(length(tmp.dna)==0){
+        return(OUT)
+    }
+    if(all(is.na(base.freq(tmp.dna)))){
+        return(OUT)
+    }
+    tmp.dist.dna.MODEL<-dist.dna(tmp.dna,as.matrix=T,model=distMODEL)
+    if(any(is.na(tmp.dist.dna.MODEL))){
+        return(OUT)
+    }
+    dXY<-tmp.dist.dna.MODEL[1,2]
+    dXZ<-tmp.dist.dna.MODEL[1,3]
+    dXO<-tmp.dist.dna.MODEL[1,4]
+    dYZ<-tmp.dist.dna.MODEL[2,3]
+    dYO<-tmp.dist.dna.MODEL[2,4]
+    dZO<-tmp.dist.dna.MODEL[3,4]
+    dK80.XZO<-dXO-dXZ
+    dK80.YZO<-dYO-dYZ
+    OUT$dK80.XZO<-dK80.XZO
+    OUT$dK80.YZO<-dK80.YZO
+    tmp.dna.nj<-midpoint(nj(tmp.dist.dna.MODEL))
+    OUT$TREELENGTH<-sum(tmp.dna.nj$edge.length)
+    OUT$TREE<-write.tree(tmp.dna.nj)
+    tmp.dna.nj$edge.length<-NULL
+    OUT$TREETOPOLOGY<-write.tree(tmp.dna.nj)
+    return(OUT)
+}
+
+#####MAIN#####
+
+sink(OUT_FILE)
+
+cat("#[X]: ")
+cat(popX)
+cat("\n")
+cat("#[Y]: ")
+cat(popY)
+cat("\n")
+cat("#[Z]: ")
+cat(popZ)
+cat("\n")
+cat("#[O]: ")
+cat(popO)
+cat("\n")
+cat("CHR\tSTART\tEND\t")
+cat("MISSING\tdK80.XZO\tdK80.YZO\tTREELENGTH\tTREE\tTREETOPOLOGY\n")
+
+tmp.chr.seq<-readBStringSet(SEQ_FILE)
+tmp.sw<-sliding_window_steps_generator(window=WSIZE,jump=WJUMP,start.by=1,end.by=unique(width(tmp.chr.seq)))
+pb<-txtProgressBar(min=1,max=dim(tmp.sw)[2],style=3)
+for(i in 1:dim(tmp.sw)[2]){
+    tmp.out<-subseq(tmp.chr.seq,tmp.sw[1,i],tmp.sw[2,i])
+    fourout<-calc_delta_consensus_four(tmp.out,popX.pos,popY.pos,popZ.pos,popO.pos,TMP_DIR,DISTMODEL)
+    cat(chr)
+    cat("\t")
+    cat(tmp.sw[1,i])
+    cat("\t")
+    cat(tmp.sw[2,i])
+    cat("\t")
+    cat(unlist(fourout),sep="\t")
+    cat("\n")
+}
+close(pb)
+
+sink(NULL)
+
+#####
+
+quit()
+
--- a/scripts/simdiv.py
+++ b/scripts/simdiv.py
@@ -2,14 +2,14 @@
 # -*- coding: UTF-8 -*-

 '''
-Author: Krisian Ullrich
-date: Dezember 2016
+author: Kristian K Ullrich
+date: July 2017
 email: ullrich@evolbio.mpg.de
 License: MIT

 The MIT License (MIT)

-Copyright (c) 2016 Kristian Ullrich
+Copyright (c) 2017 Kristian K Ullrich

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

--- a/scripts/sliding_window_steps_generator.r
+++ b/scripts/sliding_window_steps_generator.r
+#author: Kristian K Ullrich
+#date: July 2017
+#email: ullrich@evolbio.mpg.de
+
+sliding_window_steps_generator = function(window=10000,jump=1000,start.by=1,end.by=20000){
+    if(end.by<=start.by){
+        stop("end.by <= start.by")
+    }
+    start.seq = seq(start.by, end.by, by = jump)
+    end.seq = seq(start.by + window - 1, end.by, by = jump)
+    end.seq = c(end.seq, rep(end.by, length(start.seq)-length(end.seq)))
+    start.end.matrix = rbind(start.seq, end.seq)
+    start.end.matrix = rbind(start.end.matrix, apply(start.end.matrix, 2, function(x) {mean( c( x[1] - 1, x[2]) )} ) )
+    start.end.matrix[2,]<-as.numeric(sprintf("%.0f",start.end.matrix[2,]))
+    row.names(start.end.matrix) = c("start", "end", "mid")
+    return(start.end.matrix)
+}
+
--- a/scripts/vcfparser.py
+++ b/scripts/vcfparser.py