# Scripts for the Publication: Ullrich KK, Tautz D ## Data sources Genome mapping files for _Mus musculus domesticus GER_, _Mus musculus domesticus FRA_, _Mus musculus domesticus IRA_, _Mus musculus musculus AFG_, _Mus musculus castaneus CAS_ and _Mus spretus SPRE_ were obtained from <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_domesticus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_musculus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_castaneus/genomes_bam/>, <http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_spretus/genomes_bam/>. For mapping details please look into the original publication ([Harr et al. 2016](http://www.nature.com/articles/sdata201675)) <http://www.nature.com/article-assets/npg/sdata/2016/sdata201675/extref/sdata201675-s7.docx>. ## Get masking regions for individual samples and natural populations For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes from the reference GRCm38 _mm10_ <http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/>. The BAM files were processed with 'genomeCoverageBed' to obtain site specific genome coverage and further united with 'unionBedGraphs'. ##### Population specific masking The per population combined coverage was further processed to only retain regions with a coverage smaller than 5 resulting as the masking regions. genomeCoverageBed example for the population _Mus musculus musculus AFG_: ``` #example for populaton Mmm_AFG: # #used BAM files: # #AFG1_396.bam #AFG2_413.bam #AFG3_416.bam #AFG4_424.bam #AFG5_435.bam #AFG6_444.bam $REFERENCE=mm10.fasta for file in *.bam; do genomeCoverageBed -ibam $file -bga -g $REFFERENCE > $file".bga";done ``` unionBedGraphs example for the population _Mus musculus musculus AFG_: ``` INPUT1=AFG1_396.bam.bga INPUT2=AFG2_413.bam.bga INPUT3=AFG3_416.bam.bga INPUT4=AFG4_424.bam.bga INPUT5=AFG5_435.bam.bga INPUT6=AFG6_444.bam.bga OUTPUT=Mmm_AFG.combined.bga unionBedGraphs -i $INPUT1 $INPUT2 $INPUT3 $INPUT4 $INPUT5 $INPUT6 | awk -v OFS='\t' 'BEGIN {sum=0} {for (i=4; i<=NF; i++) sum+=$i; print $1,$2,$3,sum; sum=0}' > $OUTPUT ``` get masking region example for the population _Mus musculus musculus AFG_: ``` INPUT=Mmm_AFG.combined.bga OUTPUT=Mmm_AFG.combined.bga.stcov5 awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5" bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge" awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT ``` ##### Individual specific masking For individuals regions with a coverage smaller than 5 were used as masking regions. get masking region example for individual 396 of the population _Mus musculus musculus AFG_: ``` INPUT=AFG1_396.bam.bga OUTPUT=AFG1_396.bam.bga.stcov5 awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5" bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge" awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT ``` _used software:_ + bedtools v2.24.0 <http://bedtools.readthedocs.io/en/latest/> + awk ## SNP and INDEL calling For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' with relaxed quality options to retain information in CNV regions. Due to the amount of data first mpileup files were generated for each BAM file and merged with 'bcftools merge'. After merging mpileup files, 'bcftools call' was used to generate the final VCF file. samtools mpileup | bcftools call example for the population _Mus musculus musculus AFG_: ``` #example for population Mmm_AFG: # #used BAM files: # #AFG1_396.bam #AFG2_413.bam #AFG3_416.bam #AFG4_424.bam #AFG5_435.bam #AFG6_444.bam #generating indiviual mpileup files for each BAM file $REFERENCE=mm10.fasta INPUT1=AFG1_396.bam INPUT2=AFG2_413.bam INPUT3=AFG3_416.bam INPUT4=AFG4_424.bam INPUT5=AFG5_435.bam INPUT6=AFG6_444.bam samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT1".mpileup.q0Q10.vcf" $INPUT1 samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT2".mpileup.q0Q10.vcf" $INPUT2 samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT3".mpileup.q0Q10.vcf" $INPUT3 samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT4".mpileup.q0Q10.vcf" $INPUT4 samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT5".mpileup.q0Q10.vcf" $INPUT5 samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT6".mpileup.q0Q10.vcf" $INPUT6 bgzip $INPUT1".mpileup.q0Q10.vcf" bgzip $INPUT2".mpileup.q0Q10.vcf" bgzip $INPUT3".mpileup.q0Q10.vcf" bgzip $INPUT4".mpileup.q0Q10.vcf" bgzip $INPUT5".mpileup.q0Q10.vcf" bgzip $INPUT6".mpileup.q0Q10.vcf" tabix $INPUT1".mpileup.q0Q10.vcf.gz" tabix $INPUT2".mpileup.q0Q10.vcf.gz" tabix $INPUT3".mpileup.q0Q10.vcf.gz" tabix $INPUT4".mpileup.q0Q10.vcf.gz" tabix $INPUT5".mpileup.q0Q10.vcf.gz" tabix $INPUT6".mpileup.q0Q10.vcf.gz" #merge individual mpileup files MPILEUPLIST=AFG.mpileup.list echo $INPUT1".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST echo $INPUT2".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST echo $INPUT3".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST echo $INPUT4".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST echo $INPUT5".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST echo $INPUT6".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST MPILEUPOUTPUT=Mmm_AFG.mpileup.q0Q10.vcf.gz bcftools merge -m all -O z -o $MPILEUPOUTPUT -l $MPILEUPLIST #call SNP and INDEL OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz bcftools call -O z -f GQ -m -v -o $OUTPUT $MPILEUPOUTPUT ``` _used software:_ + samtools 1.3.1-36-g613501f (using htslib 1.3.1-59-g0f2a88a) + bcftools 1.3.1-39-gd797e86 (using htslib 1.3.1-59-g0f2a88a) + bgzip v1.3 + tabix v1.3 ## K80 distance calculation ### Get population specific SNPs To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools'. Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations"). vcftools example for the population _Mus musculus musculus AFG_: ``` #example for population Mmm_AFG: # #VCF IDs: # #396 #413 #416 #424 #435 #444 POPIDS=AFG.vcf.ids echo "396" >> $POPIDS echo "413" >> $POPIDS echo "416" >> $POPIDS echo "424" >> $POPIDS echo "435" >> $POPIDS echo "444" >> $POPIDS GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --keep $POPIDS --out $OUTPUT ``` vcfparser.py mvcf2consensus example for the population _Mus musculus musculus AFG_: ``` #example for population Mmm_AFG: # #VCF IDs: # #396 #413 #416 #424 #435 #444 INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.vcf OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -samples 396,413,416,424,435,444 -id Mmm_AFG.mv ``` vcfparser.py vcf2fasta example for the population _Mus musculus musculus AFG_: ``` #example for population Mmm_AFG: # REFERENCE=mm10.fasta INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr MASKFILE=Mmm_AFG.combined.bga.stcov5 python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4 ``` _used software:_ + vcftools v0.1.15 + vcfparser.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py> ### Calculate K80 distance between populations using the CONSENSUS pseudo-genome files ## Nucleotide diversity calculations ### Get individual specific SNPs To get individual specific VCF files the VCF file produced with 'bcftools call' was first re-coded for each individual with 'vcftools'. Further the individual specific VCF files were used to generate pseudo-genomes files per individual with 'vcfparser.py vcf2fasta' using individual masking regions (see "Get masking regions for individual samples and natural populations"). vcftools example for individual 396 of the population _Mus musculus musculus AFG_: ``` #example for individual 396 of the population Mmm_AFG: # #VCF IDs: # #396 GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --indv 396 --out $OUTPUT ``` vcfparser.py vcf2fasta example for individual 396 of the population _Mus musculus musculus AFG_: ``` #example for individual 396 of the population Mmm_AFG: # REFERENCE=mm10.fasta INPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr MASKFILE=Mmm_AFG.combined.bga.stcov5 python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4 ``` _used software:_ + vcftools v0.1.15 + vcfparser.py <https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py> ### Calculate nucleotide diversity within each population _used software:_ + variscan v2.0.3 ## Dxy distance calculation ### Calculate Dxy distance between populations _used software:_ ### Calculate Dxy distance between individuals and populations _used software:_ ## Simulation