Gitlab Community Edition Instance

Skip to content
Snippets Groups Projects
Commit 20e98084 authored by Kristian Ullrich's avatar Kristian Ullrich
Browse files

added simulation part

parent 9df63214
No related branches found
No related tags found
No related merge requests found
......@@ -64,7 +64,7 @@ awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
United masked files generated can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/>
##### Individual specific masking
......@@ -155,7 +155,7 @@ _used software:_
All mpileup generated for each population can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/>
NOTE: For each population all analyzed chromosomes were merged into one file.
......@@ -195,7 +195,7 @@ vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-
All re-coded vcf files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/>
NOTE: For each population all analyzed chromosomes were merged into one file.
......@@ -220,7 +220,7 @@ python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1 -sa
All CONSENSUS vcf files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/>
NOTE: For each population all analyzed chromosomes were merged into one file.
......@@ -238,7 +238,7 @@ python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm
All pseudo-genome FASTA files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/>
_used software:_
......@@ -298,7 +298,83 @@ _used software:_
+ R package XVector_0.12.1
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.18.0
+ get_dK80.r <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/get_dK80.r>
## Simulation
To simulate genomes, first we estimated the number of pair-wise segregating sites with the CONSENSUS pseudo-genome files adding the reference mm10. Subsequently we used
Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.
### Generate simulated data
First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.
```
#SPRE
python simdiv.py -i mm10.fasta -o mm10_0.008450_SPRE.fa -d 0.008450 -chr true
#ancsetral DMC
python simdiv.py -i mm10.fasta -o mm10_0.004725_DMC.fa -d 0.004725 -chr true
#ancestral MC
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.000639_MC.fa -d 0.000639 -chr true
#AFG
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.fa -d 0.003087 -chr true
#CAS
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.fa -d 0.003087 -chr true
#ancestral D
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.002246_D.fa -d 0.002246 -chr true
#newREF
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.001480_newREF.fa -d 0.001480 -chr true
#ancestral FGI
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -d 0.000262 -chr true
#IRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.fa -d 0.001217 -chr true
#ancestral FG
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -d 0.000201 -chr true
#FRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.fa -d 0.001016 -chr true
#GER
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.fa -d 0.001016 -chr true
```
_used software:_
+ R version 3.4.1 (2017-06-30)
+ simdiv.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/simdiv.py>
### Generate artificial illumina reads
Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' ([Huang et al. 2011](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr708)) and to mimic the original sequence libraries.
```
art_illumina -sam -na -ss HS20 -i mm10_0.008450_SPRE.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.008450_SPRE.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.
```
_used software:_
+ ART_Illumina Q Version 2.5.8 (June 7, 2016)
### Mapping and data post-processing
Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'bwamem' ([Li et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/)), followed by sorting, marking and removing duplicates with the picard software suite (<https://broadinstitute.github.io/picard/>) and an indel realignment step with 'GATK' ([McKenna et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2928508/)) as described in Harr et al. 2016 (<http://www.nature.com/articles/sdata201675>).
_used software:_
+ bwa 0.7.12-r1039
+ picard.jar 2.9.2-SNAPSHOT
+ GenomeAnalysisTK.jar v3.7
Masking, SNP calling, FASTA sequence construction and dK80 calculation was performed as described above.
## Data visualization and availability
All data can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/
See the readme.txt file for a detailed description of the content.
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/readme.txt
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment