Scripts for the Publication:
Ullrich KK, Tautz D
Data sources
Genome mapping files for Mus musculus domesticus GER, Mus musculus domesticus FRA, Mus musculus domesticus IRA, Mus musculus musculus AFG, Mus musculus castaneus CAS and Mus spretus SPRE were obtained from http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_domesticus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_musculus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_castaneus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_spretus/genomes_bam/.
For mapping details please look into the original publication (Harr et al. 2016) http://www.nature.com/article-assets/npg/sdata/2016/sdata201675/extref/sdata201675-s7.docx.
Get masking regions for individual samples and natural populations
For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes from the reference GRCm38 mm10 http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/.
The BAM files were processed with 'genomeCoverageBed' to obtain site specific genome coverage and further united with 'unionBedGraphs'. The per population combined coverage was further processed to only retain regions with a coverage smaller than 5 resulting as the masking regions.
genomeCoverageBed example for the population Mus musculus musculus AFG:
#example for populaton Mmm_AFG:
#
#used BAM files:
#
#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam
$REFERENCE=mm10.fasta
for file in *.bam; do genomeCoverageBed -ibam $file -bga -g $REFFERENCE > $file".bga";done
unionBedGraphs example for the population Mus musculus musculus AFG:
INPUT1=AFG1_396.bam.bga
INPUT2=AFG2_413.bam.bga
INPUT3=AFG3_416.bam.bga
INPUT4=AFG4_424.bam.bga
INPUT5=AFG5_435.bam.bga
INPUT6=AFG6_444.bam.bga
OUTPUT=Mmm_AFG.combined.bga
unionBedGraphs -i $INPUT1 $INPUT2 $INPUT3 $INPUT4 $INPUT5 $INPUT6 | awk -v OFS='\t' 'BEGIN {sum=0} {for (i=4: i<=NF; i++) sum+=$1; print $1,$2,$3,sum; sum=0}' > $OUTPUT
get masking region example for the population Mus musculus musculus AFG:
INPUT=Mmm_AFG.combined.bga
OUTPUT=Mmm_AFG.combined.bga.stcov5
awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
get masking region example for individual 396 of the population Mus musculus musculus AFG:
INPUT=AFG1_396.bam.bga
OUTPUT=AFG1_396.bam.bga.stcov5
awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT
used software:
- bedtools v2.24.0 http://bedtools.readthedocs.io/en/latest/
- awk
SNP and INDEL calling
For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' with relaxed quality options to retain information in CNV regions. Due to the amount of data first mpileup files were generated for each BAM file and merged with 'bcftools merge'. After merging mpileup files, 'bcftools call' was used to generate the final VCF file.
samtools mpileup | bcftools call example for the population Mus musculus musculus AFG:
#example for population Mmm_AFG:
#
#used BAM files:
#
#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam
#generating indiviual mpileup files for each BAM file
$REFERENCE=mm10.fasta
INPUT1=AFG1_396.bam
INPUT2=AFG2_413.bam
INPUT3=AFG3_416.bam
INPUT4=AFG4_424.bam
INPUT5=AFG5_435.bam
INPUT6=AFG6_444.bam
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT1".mpileup.q0Q10.vcf" $INPUT1
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT2".mpileup.q0Q10.vcf" $INPUT2
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT3".mpileup.q0Q10.vcf" $INPUT3
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT4".mpileup.q0Q10.vcf" $INPUT4
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT5".mpileup.q0Q10.vcf" $INPUT5
samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -vf $REFERENCE -o $INPUT6".mpileup.q0Q10.vcf" $INPUT6
bgzip $INPUT1".mpileup.q0Q10.vcf"
bgzip $INPUT2".mpileup.q0Q10.vcf"
bgzip $INPUT3".mpileup.q0Q10.vcf"
bgzip $INPUT4".mpileup.q0Q10.vcf"
bgzip $INPUT5".mpileup.q0Q10.vcf"
bgzip $INPUT6".mpileup.q0Q10.vcf"
tabix $INPUT1".mpileup.q0Q10.vcf.gz"
tabix $INPUT2".mpileup.q0Q10.vcf.gz"
tabix $INPUT3".mpileup.q0Q10.vcf.gz"
tabix $INPUT4".mpileup.q0Q10.vcf.gz"
tabix $INPUT5".mpileup.q0Q10.vcf.gz"
tabix $INPUT6".mpileup.q0Q10.vcf.gz"
#merge individual mpileup files
MPILEUPLIST=AFG.mpileup.list
echo $INPUT1".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
echo $INPUT2".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
echo $INPUT3".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
echo $INPUT4".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
echo $INPUT5".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
echo $INPUT6".mpileup.q0Q10.vcf.gz" >> $MPILEUPLIST
MPILEUPOUTPUT=Mmm_AFG.mpileup.q0Q10.vcf.gz
bcftools merge -m all -O z -o $MPILEUPOUTPUT -l $MPILEUPLIST
#call SNP and INDEL
OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
bcftools call -O z -f GQ -m -v -o $OUTPUT $MPILEUPOUTPUT
used software:
- samtools 1.3.1-36-g613501f (using htslib 1.3.1-59-g0f2a88a)
- bcftools 1.3.1-39-gd797e86 (using htslib 1.3.1-59-g0f2a88a)
- bgzip v1.3
- tabix v1.3
K80 distance calculation
Get population specific SNPs
To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools'. Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations").
vcftools example for the population Mus musculus musculus AFG:
#example for population Mmm_AFG:
#
#VCF IDs:
#
#396
#413
#416
#424
#435
#444
POPIDS=AFG.vcf.ids
echo "396" >> $POPIDS
echo "413" >> $POPIDS
echo "416" >> $POPIDS
echo "424" >> $POPIDS
echo "435" >> $POPIDS
echo "444" >> $POPIDS
GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels
vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --keep $POPIDS --out $OUTPUT
vcfparser.py mvcf2consensus example for the population Mus musculus musculus AFG:
#example for population Mmm_AFG:
#
#VCF IDs:
#
#396
#413
#416
#424
#435
#444
INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus
python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -samples 396,413,416,424,435,444 -id Mmm_AFG.mv
vcfparser.py vcf2fasta example for the population Mus musculus musculus AFG:
#example for population Mmm_AFG:
#
REFERENCE=mm10.fasta
INPUT=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr
MASKFILE=Mmm_AFG.combined.bga.stcov5
python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4
used software:
- vcftools v0.1.15
- vcfparser.py https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py
Calculate K80 distance between populations using the CONSENSUS pseudo-genome files
Nucleotide diversity calculations
Get individual specific SNPs
To get individual specific VCF files the VCF file produced with 'bcftools call' was first re-coded for each individual with 'vcftools'. Further the individual specific VCF files were used to generate pseudo-genomes files per individual with 'vcfparser.py vcf2fasta' using individual masking regions (see "Get masking regions for individual samples and natural populations").
vcftools example for individual 396 of the population Mus musculus musculus AFG:
#example for individual 396 of the population Mmm_AFG:
#
#VCF IDs:
#
#396
GZVCF=Mmm_AFG.mpileup.q0Q10.bcfcall.mv.vcf.gz
OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels
vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --indv 396 --out $OUTPUT
vcfparser.py vcf2fasta example for individual 396 of the population Mus musculus musculus AFG:
#example for individual 396 of the population Mmm_AFG:
#
REFERENCE=mm10.fasta
INPUT=Mmm_AFG1.396.mpileup.q0Q10.bcfcall.mv.remIndels.recode.consensus.vcf
OUTPUT=Mmm_AFG1.396.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr
MASKFILE=Mmm_AFG.combined.bga.stcov5
python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chrX,chrY -ibga $MASKFILE -cov2N 4
used software:
- vcftools v0.1.15
- vcfparser.py https://gitlab.gwdg.de/evolgen/introgression/blob/master/scripts/vcfparser.py
Calculate nucleotide diversity within each population
used software:
- variscan v2.0.3
Dxy distance calculation
Calculate Dxy distance between populations
used software:
Calculate Dxy distance between individuals and populations
used software: