To simulate genomes, first we estimated the number of pair-wise segregating sites with the CONSENSUS pseudo-genome files adding the reference mm10. Subsequently we used
Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.
### Generate simulated data
First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.
Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' ([Huang et al. 2011](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr708)) and to mimic the original sequence libraries.
Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'bwamem' ([Li et al. 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/)), followed by sorting, marking and removing duplicates with the picard software suite (<https://broadinstitute.github.io/picard/>) and an indel realignment step with 'GATK' ([McKenna et al. 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2928508/)) as described in Harr et al. 2016 (<http://www.nature.com/articles/sdata201675>).
_used software:_
+ bwa 0.7.12-r1039
+ picard.jar 2.9.2-SNAPSHOT
+ GenomeAnalysisTK.jar v3.7
Masking, SNP calling, FASTA sequence construction and dK80 calculation was performed as described above.