Scripts for the Publication:
Introgression patterns between house mouse subspecies and species reveal genomic windows of frequent exchange
Ullrich KK, Linnenbrink M, McConnell E, Tautz D
Data sources
Genomic sequence data files for Mus musculus domesticus GER, Mus musculus domesticus FRA, Mus musculus domesticus IRA, Mus musculus musculus AFG, Mus musculus castaneus CAS and Mus spretus SPRE were obtained from the European Nucleotide Archive https://www.ebi.ac.uk/ena according to table 1 from the original publication (Harr et al. 2016) https://www.nature.com/articles/sdata201675/tables/1 (see also https://www.ebi.ac.uk/ena/data/view/PRJEB9450, https://www.ebi.ac.uk/ena/data/view/PRJEB11742, https://www.ebi.ac.uk/ena/data/view/PRJEB14167, https://www.ebi.ac.uk/ena/data/view/PRJEB2176).
Mapping
Genome sequencing reads were mapped against mm10 version of the mouse reference genome (GRCm38/mm10) with 'ngm' (Sedlazeck et al. 2013), followed by sorting, marking and removing duplicates with the picard software suite (https://broadinstitute.github.io/picard/).
#example for one Mus musculus musculus individual - AFG1 - 396
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/005/ERR1425295/ERR1425295_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/006/ERR1425296/ERR1425296_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/007/ERR1425297/ERR1425297_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/008/ERR1425298/ERR1425298_1.fastq.gz
zcat ERR1425295_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425296_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425297_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425298_1.fastq.gz >> Mmm_AFG.396.1.fq
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/005/ERR1425295/ERR1425295_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/006/ERR1425296/ERR1425296_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/007/ERR1425297/ERR1425297_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/008/ERR1425298/ERR1425298_2.fastq.gz
zcat ERR1425295_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425296_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425297_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425298_2.fastq.gz >> Mmm_AFG.396.2.fq
FWD=Mmm_AFG.396.1.fq
REV=Mmm_AFG.396.2.fq
NGMOUTPUT=Mmm_AFG.396.mm10.sam
RGID=396
RGSM=396
ngm -1 $FWD -2 $REV -r mm10.fasta -o $NGMOUTPUT --no-unal --sensitive -t 24 --no-progress --rg-id $RGID --rg-sm $RGSM --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b
NGMSORTEDOUTPUT=Mmm_AFG.396.mm10.sorted.bam
java -jar picard.jar SortSam I=$NGMOUTPUT O=$NGMSORTEDOUTPUT SO=coordinate
NODUPOUTPUT=Mmm_AFG.396.mm10.sorted.nodup.bam
DUPMETRICS=Mmm_AFG.396.mm10.sorted.duplicate.metrics
java -jar picard.jar MarkDuplicates I=$NGMSORTEDOUTPUT O=$NODUPOUTPUT M=$DUPMETRICS REMOVE_DUPLICATES=true
used software:
- NextGenMap 0.5.3
- picard.jar 2.9.2-SNAPSHOT
Construction of consensus sequences for natural populations
To construct population specific consensus sequences, we used ANGSD (Korneliussen et al. 2014) and the 'doFasta 2' option.
CONSENSUS FASTA
In addition to the 'doFasta 2' optin, which extracts the most common nucleotide, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population Mus musculus musculus AFG:
#example for populaton Mmm_AFG for chr1
#used BAM files:
#Mmm_AFG.396.mm10.sorted.nodup.bam
#Mmm_AFG.413.mm10.sorted.nodup.bam
#Mmm_AFG.416.mm10.sorted.nodup.bam
#Mmm_AFG.424.mm10.sorted.nodup.bam
#Mmm_AFG.435.mm10.sorted.nodup.bam
#Mmm_AFG.444.mm10.sorted.nodup.bam
echo Mmm_AFG.396.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.413.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.416.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.424.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.435.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.444.mm10.sorted.nodup.bam >> AFG.list
angsd -doFasta 2 -doCounts 1 -minQ 13 -uniqueOnly -setMinDepthInd 5 -setMaxDepthInd 100 -b AFG.list -r chr1 -out AFG.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.chr1
used software:
- ANGSD v0.919 http://www.popgen.dk/angsd/index.php/ANGSD
NOTE: Population consensus sequences were merged for each chromosome.
All CONSENSUS FASTA files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/pop/
Construction of individual sequences using IUPAC code
To construct individual sequences, we used ANGSD and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.
INDIVIDUAL FASTA
In addition to the 'doFasta 4' optin, which extracts IUPAC code, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population Mus musculus musculus AFG:
#example for one Mus musculus musculus individual - AFG1 - 396 for chr1
#used BAM file:
#Mmm_AFG.396.mm10.sorted.nodup.bam
angsd -doFasta 4 -doCounts 1 -minQ 13 -uniqueOnly -setMinDepthInd 5 -setMaxDepthInd 100 -i Mmm_AFG.396.mm10.sorted.nodup.bam -r chr1 -out AFG.396.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.chr1
used software:
- ANGSD v0.919 http://www.popgen.dk/angsd/index.php/ANGSD
NOTE: Individual sequences were merged for each chromosome.
All INDIVIDUAL FASTA files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/
K80 distance calculation
Calculate dK80 distance between populations using the CONSENSUS pseudo-genome files
To calculate dK80 distance between populations, quartets ([X],[Y],[Z],[O]) are used. Within the population quartets, dK80 is calculated for trios (e.g. [X],[Z],[O]) on non-verlapping sequence windows (w) throughout the investigated genome between this population triplet on a window (w_{i}) as: \begin{equation} dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}} \end{equation} where $d_{XO_{w_{i}}} and d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance (Kimura 1980) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' (Paradis et al. 2004) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window (w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package (Pages et al. 2009).
example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: Mus musculus musculus AFG:
#example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG
#change the bottom part of the script 'get_dK80.r' for each chromosome and quartet
#here you can find the example for chromosome 1
popX <- "Mmd_FRA"
popY <- "Mmd_GER"
popZ <- "Mmd_IRA"
popO <- "Mmm_AFG"
TMP_DIR <- "/tmp"
popX.pos <- 2
popY.pos <- 3
popZ.pos <- 4
popO.pos <- 1
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/pop/chr1.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
chr <- "chr1"
OUT_FILE <- paste0(TMP_DIR,"/",popX,"_",popY,".",popZ,".",popO,".",chr,".tsv")
WSIZE <- 25000
WJUMP <- 25000
DISTMODEL <- "K80"
#to run the script
nohup nice R CMD BATCH --vanilla get_dK80.r &
#result file will be in this case:
tail -f /tmp/Mmd_FRA_Mmd_GER.Mmd_IRA.Mmm_AFG.chr1.tsv
All dK80 files can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/dK80/
NOTE: For each quartet comparison all analyzed chromosomes were merged into one file.
used software:
- R version 3.4.1 (2017-06-30)
- R package ape_4.1
- R package Biostrings_2.40.2
- R package S4Vectors_0.10.3
- R package XVector_0.12.1
- R package IRanges_2.6.1
- R package BiocGenerics_0.18.0
- get_dK80.r https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_dK80.r
TWISST analysis
Simulation
Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.
Generate simulated data
First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.
#SPRE
python simdiv.py -i mm10.fasta -o mm10_0.008450_SPRE.fa -d 0.008450 -chr true
#ancsetral DMC
python simdiv.py -i mm10.fasta -o mm10_0.004725_DMC.fa -d 0.004725 -chr true
#ancestral MC
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.000639_MC.fa -d 0.000639 -chr true
#AFG
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.fa -d 0.003087 -chr true
#CAS
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.fa -d 0.003087 -chr true
#ancestral D
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.002246_D.fa -d 0.002246 -chr true
#newREF
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.001480_newREF.fa -d 0.001480 -chr true
#ancestral FGI
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -d 0.000262 -chr true
#IRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.fa -d 0.001217 -chr true
#ancestral FG
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -d 0.000201 -chr true
#FRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.fa -d 0.001016 -chr true
#GER
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.fa -d 0.001016 -chr true
used software:
- R version 3.4.1 (2017-06-30)
- simdiv.py https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/simdiv.py
Generate artificial illumina reads
Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' (Huang et al. 2011) and to mimic the original sequence libraries.
art_illumina -sam -na -ss HS20 -i mm10_0.008450_SPRE.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.008450_SPRE.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.
used software:
- ART_Illumina Q Version 2.5.8 (June 7, 2016)
Mapping and data post-processing
Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'ngm' (Sedlazeck et al. 2013), followed by sorting, marking and removing duplicates with the picard software suite (https://broadinstitute.github.io/picard/).
used software:
- NextGenMap 0.5.3
- picard.jar 2.9.2-SNAPSHOT
FASTA sequence construction and dK80 calculation was performed as described above.
Data visualization and availability
All data can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/
See the readme.txt file for a detailed description of the content.
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/readme.txt