added simulation part

20e98084 · Kristian Ullrich · 9df63214 · 20e98084
Commit 20e98084 authored 7 years ago by Kristian Ullrich
--- a/README.md
+++ b/README.md
@@ -64,7 +64,7 @@ awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT

 United masked files generated can be obtained from:

-http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/>

 ##### Individual specific masking

@@ -155,7 +155,7 @@ _used software:_

 All mpileup generated for each population can be obtained from:

-http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/>

 NOTE: For each population all analyzed chromosomes were merged into one file.

@@ -195,7 +195,7 @@ vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-

 All re-coded vcf files can be obtained from:

-http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/>

 NOTE: For each population all analyzed chromosomes were merged into one file.

@@ -220,7 +220,7 @@ python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1 -sa

 All CONSENSUS vcf files can be obtained from:

-http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/>

 NOTE: For each population all analyzed chromosomes were merged into one file.

@@ -238,7 +238,7 @@ python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm

 All pseudo-genome FASTA files can be obtained from:

-http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/
+<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/>

 _used software:_

@@ -298,7 +298,83 @@ _used software:_
 + R package XVector_0.12.1
 + R package IRanges_2.6.1
 + R package BiocGenerics_0.18.0
+ get_dK80.r <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/get_dK80.r>

 ## Simulation

-To simulate genomes, first we estimated the number of pair-wise segregating sites with the CONSENSUS pseudo-genome files adding the reference mm10. Subsequently we used  
+Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.
+
+### Generate simulated data
+
+First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.
+
+```
+#SPRE
+python simdiv.py -i mm10.fasta -o mm10_0.008450_SPRE.fa -d 0.008450 -chr true
+#ancsetral DMC
+python simdiv.py -i mm10.fasta -o mm10_0.004725_DMC.fa -d 0.004725 -chr true
+#ancestral MC
+python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.000639_MC.fa -d 0.000639 -chr true
+#AFG
+python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.fa -d 0.003087 -chr true
+#CAS
+python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.fa -d 0.003087 -chr true
+#ancestral D
+python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.002246_D.fa -d 0.002246 -chr true
+#newREF
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.001480_newREF.fa -d 0.001480 -chr true
+#ancestral FGI
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -d 0.000262 -chr true
+#IRA
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.fa -d 0.001217 -chr true
+#ancestral FG
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -d 0.000201 -chr true
+#FRA
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.fa -d 0.001016 -chr true
+#GER
+python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.fa -d 0.001016 -chr true
+```
+
+_used software:_
+
+ R version 3.4.1 (2017-06-30)
+ simdiv.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/simdiv.py>
+
+### Generate artificial illumina reads
+
+Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' ([Huang et al. 2011](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr708)) and to mimic the original sequence libraries.
+
+```
+art_illumina -sam -na -ss HS20 -i mm10_0.008450_SPRE.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.008450_SPRE.chr1.
+art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.
+art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.
+art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.
+art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.
+art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o  mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.
+```
+
+_used software:_
+
+ ART_Illumina Q Version 2.5.8 (June 7, 2016)
+
+### Mapping and data post-processing
+
+Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'bwamem' ([Li et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/)), followed by sorting, marking and removing duplicates with the picard software suite (<https://broadinstitute.github.io/picard/>) and an indel realignment step with 'GATK' ([McKenna et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2928508/)) as described in Harr et al. 2016 (<http://www.nature.com/articles/sdata201675>).
+
+_used software:_
+
+ bwa 0.7.12-r1039
+ picard.jar 2.9.2-SNAPSHOT
+ GenomeAnalysisTK.jar v3.7
+
+Masking, SNP calling, FASTA sequence construction and dK80 calculation was performed as described above.
+
+## Data visualization and availability
+
+All data can be obtained from:
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/
+
+See the readme.txt file for a detailed description of the content.
+
+http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/readme.txt