Newer
Older
Introgression patterns between house mouse subspecies and species reveal genomic windows of frequent exchange

Kristian Ullrich
committed
Ullrich KK, Linnenbrink M, McConnell E, Tautz D

Kristian Ullrich
committed
Genomic sequence data files for _Mus musculus domesticus GER_, _Mus musculus domesticus FRA_, _Mus musculus domesticus IRA_, _Mus musculus musculus AFG_, _Mus musculus castaneus CAS_ and _Mus spretus SPRE_ were obtained from the European Nucleotide Archive <https://www.ebi.ac.uk/ena> according to table 1 from the original publication ([Harr et al. 2016](http://www.nature.com/articles/sdata201675)) <https://www.nature.com/articles/sdata201675/tables/1> (see also <https://www.ebi.ac.uk/ena/data/view/PRJEB9450>, <https://www.ebi.ac.uk/ena/data/view/PRJEB11742>, <https://www.ebi.ac.uk/ena/data/view/PRJEB14167>, <https://www.ebi.ac.uk/ena/data/view/PRJEB2176>).

Kristian Ullrich
committed
## Mapping

Kristian Ullrich
committed
Genome sequencing reads were mapped against mm10 version of the mouse reference genome (GRCm38/mm10) with 'ngm' ([Sedlazeck et al. 2013](https://www.ncbi.nlm.nih.gov/pubmed/23975764)), followed by sorting, marking and removing duplicates with the picard software suite (<https://broadinstitute.github.io/picard/>).

Kristian Ullrich
committed
#example for one Mus musculus musculus individual - AFG1 - 396
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/005/ERR1425295/ERR1425295_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/006/ERR1425296/ERR1425296_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/007/ERR1425297/ERR1425297_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/008/ERR1425298/ERR1425298_1.fastq.gz

Kristian Ullrich
committed
zcat ERR1425295_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425296_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425297_1.fastq.gz >> Mmm_AFG.396.1.fq
zcat ERR1425298_1.fastq.gz >> Mmm_AFG.396.1.fq

Kristian Ullrich
committed
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/005/ERR1425295/ERR1425295_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/006/ERR1425296/ERR1425296_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/007/ERR1425297/ERR1425297_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR142/008/ERR1425298/ERR1425298_2.fastq.gz

Kristian Ullrich
committed
zcat ERR1425295_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425296_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425297_2.fastq.gz >> Mmm_AFG.396.2.fq
zcat ERR1425298_2.fastq.gz >> Mmm_AFG.396.2.fq

Kristian Ullrich
committed
FWD=Mmm_AFG.396.1.fq
REV=Mmm_AFG.396.2.fq
NGMOUTPUT=Mmm_AFG.396.mm10.sam
RGID=396
RGSM=396

Kristian Ullrich
committed
ngm -1 $FWD -2 $REV -r mm10.fasta -o $NGMOUTPUT --no-unal --sensitive -t 24 --no-progress --rg-id $RGID --rg-sm $RGSM --rg-lb lib1 --rg-pl illumina --rg-pu unit1 -b

Kristian Ullrich
committed
NGMSORTEDOUTPUT=Mmm_AFG.396.mm10.sorted.bam

Kristian Ullrich
committed
java -jar picard.jar SortSam I=$NGMOUTPUT O=$NGMSORTEDOUTPUT SO=coordinate

Kristian Ullrich
committed
NODUPOUTPUT=Mmm_AFG.396.mm10.sorted.nodup.bam
DUPMETRICS=Mmm_AFG.396.mm10.sorted.duplicate.metrics

Kristian Ullrich
committed
java -jar picard.jar MarkDuplicates I=$NGMSORTEDOUTPUT O=$NODUPOUTPUT M=$DUPMETRICS REMOVE_DUPLICATES=true

Kristian Ullrich
committed
+ NextGenMap 0.5.3
+ picard.jar 2.9.2-SNAPSHOT

Kristian Ullrich
committed
## Construction of consensus sequences for natural populations
To construct population specific consensus sequences, we used 'ANGSD' ([Korneliussen et al. 2014](http://www.biomedcentral.com/1471-2105/15/356/abstract)) and the 'doFasta 2' option.

Kristian Ullrich
committed
##### CONSENSUS FASTA

Kristian Ullrich
committed
In addition to the 'doFasta 2' optin, which extracts the most common nucleotide, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_

Kristian Ullrich
committed
#example for populaton Mmm_AFG for chr1

Kristian Ullrich
committed
#used BAM files:

Kristian Ullrich
committed
#Mmm_AFG.396.mm10.sorted.nodup.bam
#Mmm_AFG.413.mm10.sorted.nodup.bam
#Mmm_AFG.416.mm10.sorted.nodup.bam
#Mmm_AFG.424.mm10.sorted.nodup.bam
#Mmm_AFG.435.mm10.sorted.nodup.bam
#Mmm_AFG.444.mm10.sorted.nodup.bam
echo Mmm_AFG.396.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.413.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.416.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.424.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.435.mm10.sorted.nodup.bam >> AFG.list
echo Mmm_AFG.444.mm10.sorted.nodup.bam >> AFG.list
angsd -doFasta 2 -doCounts 1 -minQ 13 -uniqueOnly -setMinDepthInd 5 -setMaxDepthInd 100 -b AFG.list -r chr1 -out AFG.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.chr1

Kristian Ullrich
committed
_used software:_

Kristian Ullrich
committed
+ ANGSD v0.919 <http://www.popgen.dk/angsd/index.php/ANGSD>

Kristian Ullrich
committed
NOTE: Population consensus sequences were merged for each chromosome.

Kristian Ullrich
committed
All CONSENSUS FASTA files can be obtained from:

Kristian Ullrich
committed
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/pop/>

Kristian Ullrich
committed
## Construction of individual sequences using IUPAC code
To construct individual sequences, we used 'ANGSD' and the 'doFasta 4' option. For this option, after all filters have been applied, all multi-allelic sites will be encoded as IUPAC code.

Kristian Ullrich
committed
##### INDIVIDUAL FASTA

Kristian Ullrich
committed
In addition to the 'doFasta 4' optin, which extracts IUPAC code, other filters like 'minQ' and coverage filters like 'setMinDepthInd' and 'setMaxDepthInd' were applied.
CONSENSUS FASTA example for the population _Mus musculus musculus AFG_

Kristian Ullrich
committed
#example for one Mus musculus musculus individual - AFG1 - 396 for chr1

Kristian Ullrich
committed
#used BAM file:

Kristian Ullrich
committed
#Mmm_AFG.396.mm10.sorted.nodup.bam

Kristian Ullrich
committed
angsd -doFasta 4 -doCounts 1 -minQ 13 -uniqueOnly -setMinDepthInd 5 -setMaxDepthInd 100 -i Mmm_AFG.396.mm10.sorted.nodup.bam -r chr1 -out AFG.396.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.chr1

Kristian Ullrich
committed
_used software:_

Kristian Ullrich
committed
+ ANGSD v0.919 <http://www.popgen.dk/angsd/index.php/ANGSD>

Kristian Ullrich
committed
NOTE: Individual sequences were merged for each chromosome.

Kristian Ullrich
committed
All INDIVIDUAL FASTA files can be obtained from:

Kristian Ullrich
committed
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/>

Kristian Ullrich
committed
## K80 distance calculation
### Calculate dK80 distance between populations using the CONSENSUS pseudo-genome files
To calculate dK80 distance between populations, quartets ([X],[Y],[Z],[O]) are used. Within the population quartets, dK80 is calculated for trios (e.g. [X],[Z],[O]) on non-verlapping sequence windows ($w$) throughout the investigated genome between this population triplet on a window ($w_{i}$) as:
\begin{equation}
dK80_{XZO_{w_{i}}} = d_{XO_{w_{i}}} - d_{XZ_{w_{i}}}
\end{equation}
where $d_{XO_{w_{i}}} and $d_{XZ_{w_{i}}} are defined as the average Kimura's 2-parameter sequence distance ([Kimura 1980](https://www.ncbi.nlm.nih.gov/pubmed/7463489)) between the corresponding two populations calculated with the function 'dist.dna' of the of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) using the model 'K80'. Prior the calculation of dK80 all sites with missing data within the specified window ($w_{i}$) and the specified populations were removed across the whole quartet with the 'Biostrings' R package ([Pages et al. 2009](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)).
example for chromosome 1 for the dK80 calculation for the quartet [X]: _Mus musculus domesticus FRA; [Y]: _Mus musculus domesticus GER; [Z]: _Mus musculus domesticus IRA; [O]: _Mus musculus musculus AFG_
```
#example for the quartet [X]: FRA; [Y]: GER; [Z]: IRA; [O]: AFG
#change the bottom part of the script 'get_dK80.r' for each chromosome and quartet
#here you can find the example for chromosome 1
popX <- "Mmd_FRA"
popY <- "Mmd_GER"
popZ <- "Mmd_IRA"
popO <- "Mmm_AFG"
TMP_DIR <- "/tmp"
popX.pos <- 2
popY.pos <- 3
popZ.pos <- 4

Kristian Ullrich
committed
popO.pos <- 1

Kristian Ullrich
committed
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/pop/chr1.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
chr <- "chr1"
OUT_FILE <- paste0(TMP_DIR,"/",popX,"_",popY,".",popZ,".",popO,".",chr,".tsv")
WSIZE <- 25000
WJUMP <- 25000
DISTMODEL <- "K80"

Kristian Ullrich
committed
#to run the script
nohup nice R CMD BATCH --vanilla get_dK80.r &

Kristian Ullrich
committed
tail -f /tmp/Mmd_FRA_Mmd_GER.Mmd_IRA.Mmm_AFG.chr1.tsv
```
All dK80 files can be obtained from:

Kristian Ullrich
committed
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/dK80/
NOTE: For each quartet comparison all analyzed chromosomes were merged into one file.
_used software:_
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0

Kristian Ullrich
committed
+ get_dK80.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_dK80.r>
## TWISST analysis
Phylogenetic trees per 25kbp windows were calculated with the 'bionjs' function of the R package 'ape' ([Paradis et al. 2004](https://academic.oup.com/bioinformatics/article/20/2/289/204981/APE-Analyses-of-Phylogenetics-and-Evolution-in-R)) and used as input files for 'twisst' ([Martin and Belleghem 2017](http://www.genetics.org/content/early/2017/03/21/genetics.116.194720)). The 'bionjs' function can reconstruct a phylogenetic tree from a distance matrix with possibly missing values. We developed an R package (https://github.com/kullrich/distIUPAC) to calculate pairwise 'literal distances' using IUPAC encoded nucleotide sequences. Briefly, only bi-allelic nucleotide states are considered for the pairwise distance calculation and all tri-nucleotide states, gaps or missing data are discarded. The distance score matrix applied is based on ([Chang et al. 2017](https://www.ncbi.nlm.nih.gov/pubmed/28819774)), whereby two homologous SNPs yield 0 if they were identical or identically heterozygous, 1 if different, and 0.5 if one of them was heterozygous. Windows with complete deletion sites (missing in all individuals) higher than 50\% were removed. The resulting pairwise distance matrix per window were used to calculate phylogenetic trees which were further used as input trees for 'twisst'.

Kristian Ullrich
committed
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
example for chromosome 3 for the tree calculations
```
#example for chromosome 3
#change the bottom part of the script 'get_bionj_trees_for_twisst.r' for each chromosome
#here you can find the example for chromosome 3
TMP_DIR <- "/tmp"
SEQ_FILE <- "http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/fasta/ind/chr3.ngm.minQ13.uniqueOnly.setMinDepthInd5.setMaxDepthInd100.fa"
chr <- "chr3"
WSIZE <- 25000
WJUMP <- 25000
OUT_FILE <- paste0(TMP_DIR,"/",chr,".sw",WSIZE,".","wj",WJUMP,".newick.trees")
#to run the script
nohup nice R CMD BATCH --vanilla get_bionj_trees_for_twisst.r &
#to check result file
tail -f /tmp/chr3.sw25000.wj25000.newick.trees
#to extract only the trees which have passed 50% missing complete deletion sites
awk '{if($6!="NA") print $0}' chr3.sw25000.wj25000.newick.trees > chr3.sw25000.wj25000.newick.trees.noNA
#to extract the trees for twisst
awk '{print $6}' chr3.sw25000.wj25000.newick.trees.noNA | tail -n +2 > chr3.sw25000.wj25000.newick.trees.noNA.twisst
#twisst analysis
python twisst.py -t chr3.sw25000.wj25000.newick.trees.noNA.twisst -w chr3.sw25000.wj25000.newick.trees.noNA.twisst.weightsfile -D chr3.sw25000.wj25000.newick.trees.noNA.twisst.distfile -g SPRE SPRE_SP36,SPRE_SP39,SPRE_SP41,SPRE_SP51,SPRE_SP62,SPRE_SP68,SPRE_SP69,SPRE_SP70 -g CAS CAS_1,CAS_2,CAS_3,CAS_4,CAS_5,CAS_6,CAS_7,CAS_8,CAS_9,CAS_10 -g AFG AFG_396,AFG_413,AFG_416,AFG_424,AFG_435,AFG_444 -g IRA IRA_AH15,IRA_AH23,IRA_JR11,IRA_JR15,IRA_JR2_F1C,IRA_JR5_F1C,IRA_JR7_F1C,IRA_JR8_F1A -g GER GER_TP1,GER_TP121B,GER_TP17_2,GER_TP3_02,GER_TP4a,GER_TP51D,GER_TP7_10F1A2,GER_TP81B -g FRA FRA_14,FRA_15B,FRA_16B,FRA_18B,FRA_B2C,FRA_C1,FRA_E1,FRA_F1B --method complete
```
_used software:_
+ R version 3.4.1 (2017-06-30)
+ R package ape_4.1
+ R package Biostrings_2.44.2
+ R package S4Vectors_0.14.3
+ R package XVector_0.16.0
+ R package IRanges_2.6.1
+ R package BiocGenerics_0.22.0
+ R package Rcpp_0.12.12
+ R package RcppParallel_4.3.20
+ R dev package distIUPAC_0.1.0 <https://github.com/kullrich/distIUPAC>
+ get_bionj_trees_for_twisst.r <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/get_bionj_trees_for_twisst.r>
All twisst analysis files can be obtained from:
<http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/ngm/twisst/>
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
Simulations were performed to evaluate dK80 under a phylogenetic scenario mimicking the real data without introgression, as well as to ensure that the mapping procedure does not lead to artefacts, especially in the copy number variable regions.
### Generate simulated data
First, to provide the appropriate distance framework, all pair-wise polymorphic sites and all pair-wise informative sites (excluding pair-wise missing sites) of all autosomes were counted between the investigated populations using the CONSENSUS FASTA files. The resulting pair-wise distances were then used as a distance matrix to calculate an UPGMA tree. The resulting UPGMA tree distances were used as a proxy to simulate chromosome 1 with the python script 'simdiv.py' taking the phylogenetic context into account.
```
#SPRE
python simdiv.py -i mm10.fasta -o mm10_0.008450_SPRE.fa -d 0.008450 -chr true
#ancsetral DMC
python simdiv.py -i mm10.fasta -o mm10_0.004725_DMC.fa -d 0.004725 -chr true
#ancestral MC
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.000639_MC.fa -d 0.000639 -chr true
#AFG
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.fa -d 0.003087 -chr true
#CAS
python simdiv.py -i mm10_0.004725_DMC_0.000639_MC.fa -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.fa -d 0.003087 -chr true
#ancestral D
python simdiv.py -i mm10_0.004725_DMC.fa -o mm10_0.004725_DMC_0.002246_D.fa -d 0.002246 -chr true
#newREF
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.001480_newREF.fa -d 0.001480 -chr true
#ancestral FGI
python simdiv.py -i mm10_0.004725_DMC_0.002246_D.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -d 0.000262 -chr true
#IRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.fa -d 0.001217 -chr true
#ancestral FG
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -d 0.000201 -chr true
#FRA
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.fa -d 0.001016 -chr true
#GER
python simdiv.py -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG.fa -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.fa -d 0.001016 -chr true
```
_used software:_
+ R version 3.4.1 (2017-06-30)

Kristian Ullrich
committed
+ simdiv.py <https://gitlab.gwdg.de/evolgen/introgression/raw/master/scripts/simdiv.py>
### Generate artificial illumina reads
Second, the simulated sequences representative of each population were used to generate artificial Illumina reads to test the influence of possible sequencing errors with 'ART' ([Huang et al. 2011](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr708)) and to mimic the original sequence libraries.
```
art_illumina -sam -na -ss HS20 -i mm10_0.008450_SPRE.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.008450_SPRE.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_AFG.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.000639_MC_0.003087_CAS.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.001217_IRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_FRA.chr1.
art_illumina -sam -na -ss HS20 -i mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.fa -f 20 -l 100 -m 250 -s 80 -p -o mm10_0.004725_DMC_0.002246_D_0.000262_FGI_0.000201_FG_0.001016_GER.chr1.
```
_used software:_
+ ART_Illumina Q Version 2.5.8 (June 7, 2016)
### Mapping and data post-processing

Kristian Ullrich
committed
Subsequently, the artificial Illumina reads were mapped against the simulated reference with 'ngm' ([Sedlazeck et al. 2013](https://www.ncbi.nlm.nih.gov/pubmed/23975764)), followed by sorting, marking and removing duplicates with the picard software suite (<https://broadinstitute.github.io/picard/>).

Kristian Ullrich
committed
+ NextGenMap 0.5.3

Kristian Ullrich
committed
FASTA sequence construction and dK80 calculation was performed as described above.
## Mappability mm10 genome
To construct a mappability track we used the software GEM ([Derrien et al. 2012](https://www.ncbi.nlm.nih.gov/pubmed/22276185)) with follofing options on the mm10 genome.
K-MER LENGTH: 100
APPROXIMATION THRESHOLD: 7
MAX MISMATCHES: 4
MAX ERRORS: 4
MAX BIG INDEL LENGTH: 15
MIN MATCHED BASES: 80
STRATA AFTER BEST: 1
The bigWig track file (http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/mm10.gem.mappability.25kbp.sorted.bed.bw) represents the mean values for 25kbp windows.
_used software:_
+ GEM-mappability build 1.315
## Data visualization and availability
All data can be obtained from:
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/
See the readme.txt file for a detailed description of the content.
http://wwwuser.gwdg.de/~evolbio/evolgen/wildmouse/introgression/readme.txt