Gitlab Community Edition Instance

Skip to content
Snippets Groups Projects
Name Last commit Last update
scripts
README.md

Scripts for the Publication:

Introgression patterns between house mouse subspecies and species reveal genomic windows of frequent exchange

Ullrich KK, Linnenbrink M, Tautz D

Data sources

Genome mapping files for Mus musculus domesticus GER, Mus musculus domesticus FRA, Mus musculus domesticus IRA, Mus musculus musculus AFG, Mus musculus castaneus CAS and Mus spretus SPRE were obtained from http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_domesticus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_musculus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_m_castaneus/genomes_bam/, http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/m_spretus/genomes_bam/.

For mapping details please look into the original publication (Harr et al. 2016) http://www.nature.com/article-assets/npg/sdata/2016/sdata201675/extref/sdata201675-s7.docx.

Get masking regions for individual samples and natural populations

For masking genomic regions in natural populations which showed low coverage based on the genomic mapping BAM files we only considered the stable chromosomes (chr1-chr19,chrX,chrY,chrM) from the reference GRCm38 mm10 http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/mouse/.

The BAM files were processed with 'genomeCoverageBed' from the bedtools software suite (Quinlan and Hall 2010) to obtain site specific genome coverage and further united with 'unionBedGraphs'.

Population specific masking

The per population combined coverage was further processed to only retain regions with a coverage smaller than 5 resulting as the masking regions.

genomeCoverageBed example for the population Mus musculus musculus AFG:

#example for populaton Mmm_AFG:

#used BAM files:

#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam

REFERENCE=mm10.fasta

for file in *.bam; do genomeCoverageBed -ibam $file -bga -g $REFFERENCE > $file".bga";done

unionBedGraphs example for the population Mus musculus musculus AFG:

INPUT1=AFG1_396.bam.bga
INPUT2=AFG2_413.bam.bga
INPUT3=AFG3_416.bam.bga
INPUT4=AFG4_424.bam.bga
INPUT5=AFG5_435.bam.bga
INPUT6=AFG6_444.bam.bga

OUTPUT=Mmm_AFG.combined.bga

unionBedGraphs -i $INPUT1 $INPUT2 $INPUT3 $INPUT4 $INPUT5 $INPUT6 | awk -v OFS='\t' 'BEGIN {sum=0} {for (i=4; i<=NF; i++) sum+=$i; print $1,$2,$3,sum; sum=0}' > $OUTPUT

get masking region example for the population Mus musculus musculus AFG:

INPUT=Mmm_AFG.combined.bga
OUTPUT=Mmm_AFG.combined.bga.stcov5

awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT

United masked files generated can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/masking/

Individual specific masking

For individuals regions with a coverage smaller than 5 were used as masking regions.

get masking region example for individual 396 of the population Mus musculus musculus AFG:

INPUT=AFG1_396.bam.bga
OUTPUT=AFG1_396.bam.bga.stcov5

awk '{if($4<5) print $0}' $INPUT > $INPUT".stcov5"
bedtools merge -i $INPUT".stcov5" > $INPUT".stcov5.merge"
awk -v OFS='\t' '{print $1,$2,$3,4}' $INPUT".stcov5.merge" > $OUTPUT

used software:

SNP and INDEL calling

For SNP and INDEL calling the BAM files were processed with 'samtools mpileup' and 'bcftools call' (Li et al. 2011) with relaxed quality options to retain information in CNV regions. Each chromosome was analyzed seperately.

samtools mpileup | bcftools call example for chromosome 1 for the population Mus musculus musculus AFG:

#example for population Mmm_AFG:

#used BAM files:

#AFG1_396.bam
#AFG2_413.bam
#AFG3_416.bam
#AFG4_424.bam
#AFG5_435.bam
#AFG6_444.bam

REFERENCE=mm10.fasta

PLOIDY=ploidy.txt
#chrX    1       171031299       M       1
#chrY    1       91744698        M       1
#chrY    1       91744698        F       0
#chrM    1       16299   F       1
#chrM    1       16299   M       1
#*       *       *       M       2
#*       *       *       F       2

SAMPLES=AFG_samples.txt
#396     M
#413     M
#416     M
#424     M
#435     F
#444     M

OUTPUT=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf

INPUT1=AFG1_396.bam
INPUT2=AFG2_413.bam
INPUT3=AFG3_416.bam
INPUT4=AFG4_424.bam
INPUT5=AFG5_435.bam
INPUT6=AFG6_444.bam

BAMLIST=Mmm_AFG.list

echo $INPUT1 >> $BAMLIST
echo $INPUT2 >> $BAMLIST
echo $INPUT3 >> $BAMLIST
echo $INPUT4 >> $BAMLIST
echo $INPUT5 >> $BAMLIST
echo $INPUT6 >> $BAMLIST

#example for chromosome 1

samtools mpileup -q 0 -Q 10 -A -d 99999 -t DP,AD,ADF,ADR -r chr1 -uf $REFERENCE -b $BAMLIST | bcftools call -O v -f GQ -m -v --ploidy-file $PLOIDY -S $SAMPLES > $OUTPUT
bgzip $OUTPUT
tabix $OUTPUT".gz"

used software:

  • samtools 1.3.1 (using htslib 1.3.1)
  • bcftools 1.3.1 (using htslib 1.3.1)
  • bgzip v1.3
  • tabix v1.3

All mpileup generated for each population can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/mpileup/

NOTE: For each population all analyzed chromosomes were merged into one file.

K80 distance calculation

Get population specific SNPs

To get population specific CONSENSUS VCF files the VCF file produced with 'bcftools call' was first re-coded into population specific VCF files with 'vcftools' (Danecek et al. 2011). Further the population specific VCF file containing multiple individuals was parsed with 'vcfparser.py mvcf2consensus' to obtain a CONSENSUS VCF file for each population. This CONSENSUS VCF files were used to generate pseudo-genomes files per natural population with 'vcfparser.py vcf2fasta' using also the masking regions (see "Get masking regions for individual samples and natural populations").

vcftools example for chromosome 1 for the population Mus musculus musculus AFG:

#example for population Mmm_AFG:

#VCF IDs:

#396
#413
#416
#424
#435
#444

POPIDS=AFG.vcf.ids

echo "396" >> $POPIDS
echo "413" >> $POPIDS 
echo "416" >> $POPIDS
echo "424" >> $POPIDS
echo "435" >> $POPIDS
echo "444" >> $POPIDS

GZVCF=mpileup.q0Q10.chr1.Mmm_AFG.bcfcall.mv.vcf.gz
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels

vcftools --gzvcf $GZVCF --remove-indels --recode --recode-INFO-all --non-ref-ac-any 1 --keep $POPIDS --out $OUTPUT

All re-coded vcf files can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/recode/

NOTE: For each population all analyzed chromosomes were merged into one file.

vcfparser.py mvcf2consensus example for the population Mus musculus musculus AFG:

#example for population Mmm_AFG:

#VCF IDs:

#396
#413
#416
#424
#435
#444

INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus

python vcfparser.py mvcf2consensus -ivcf $INPUT -o $OUTPUT -cdp 11 -chr chr1 -samples 396,413,416,424,435,444 -id Mmm_AFG.mv

All CONSENSUS vcf files can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/vcf/consensus/

NOTE: For each population all analyzed chromosomes were merged into one file.

vcfparser.py vcf2fasta example for the population Mus musculus musculus AFG:

#example for population Mmm_AFG:

REFERENCE=mm10.fasta
INPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.vcf
OUTPUT=Mmm_AFG.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.consensus.chr1
MASKFILE=Mmm_AFG.combined.bga.stcov5

python vcfparser.py vcf2fasta -ivcf $INPUT -o $OUTPUT -R $REFERENCE -samples Mmm_AFG.mv -chr chr1 -ibga $MASKFILE -cov2N 4

All pseudo-genome FASTA files can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/

used software:

Calculate dK80 distance between populations using the CONSENSUS pseudo-genome files

To calculate dK80 distance between populations, quartets ([X],[Y],[Z],[O]) are used. Within the population quartets, dK80 is calculated for trios (e.g. [X],[Z],[O]) on non-verlapping sequence windows (w) throughout the investigated genome between this population triplet on a window (w_{i}) as: \begin{equation} dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}} \end{equation} where $d_{XO_{w_{i}}} and d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance (Kimura 1980) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' (Paradis et al. 2004) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window (w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package (Pages et al. 2009).

example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: Mus musculus musculus AFG:

#example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG

#change the bottom part of the script 'get_dK80.r' for each chromosome and quartet

#here you can find the example for chromosome 1

popX <- "Mmd_FRA"
popY <- "Mmd_GER"
popZ <- "Mmd_IRA"
popO <- "Mmm_AFG"

TMP_DIR <- "/tmp"

popX.pos <- 2
popY.pos <- 3
popZ.pos <- 4
popO.pos <- 5

SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/fasta/consensus/CAS_FRA_GER_IRA_AFG_SPRE.mpileup.q0Q10.chr1.bcfcall.mv.remIndels.recode.refmajorsample.ref.consensus.fasta"
chr <- "chr1"
OUT_FILE <- paste0(TMP_DIR,"/",popX,"_",popY,".",popZ,".",popO,".",chr,".tsv")

WSIZE <- 25000
WJUMP <- 25000

DISTMODEL <- "K80"

All dK80 files can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mpileup_pop_mv/browser_tracks/dK80/25kbp_sw/

NOTE: For each quartet comparison all analyzed chromosomes were merged into one file.

used software:

Simulation

Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.

Generate simulated data

First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.

#SPRE
python simdiv.py -i mm10.fasta -o mm10_0.008450_SPRE.fa -d 0.008450 -chr true
#ancsetral DMC
python simdiv.py -i mm10.fasta -o mm10_0.004725_DMC.fa -d 0.004725 -chr true
#ancestral MC
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.000639_MC.fa -d 0.000639 -chr true
#AFG
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.fa -d 0.003087 -chr true
#CAS
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.fa -d 0.003087 -chr true
#ancestral D
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.002246_D.fa -d 0.002246 -chr true
#newREF
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.001480_newREF.fa -d 0.001480 -chr true
#ancestral FGI
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -d 0.000262 -chr true
#IRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.fa -d 0.001217 -chr true
#ancestral FG
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -d 0.000201 -chr true
#FRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.fa -d 0.001016 -chr true
#GER
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.fa -d 0.001016 -chr true

used software:

Generate artificial illumina reads

Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' (Huang et al. 2011) and to mimic the original sequence libraries.

art_illumina -sam -na -ss HS20 -i mm10_0.008450_SPRE.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.008450_SPRE.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o  mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.

used software:

  • ART_Illumina Q Version 2.5.8 (June 7, 2016)

Mapping and data post-processing

Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'bwamem' (Li et al. 2009), followed by sorting, marking and removing duplicates with the picard software suite (https://broadinstitute.github.io/picard/) and an indel realignment step with 'GATK' (McKenna et al. 2010) as described in Harr et al. 2016 (http://www.nature.com/articles/sdata201675).

used software:

  • bwa 0.7.12-r1039
  • picard.jar 2.9.2-SNAPSHOT
  • GenomeAnalysisTK.jar v3.7

Masking, SNP calling, FASTA sequence construction and dK80 calculation was performed as described above.

Data visualization and availability

All data can be obtained from:

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/

See the readme.txt file for a detailed description of the content.

http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/readme.txt